Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

Closed
lmatz opened this issue Nov 12, 2024 · 2 comments
Milestone

Comments

@lmatz
Copy link
Contributor

lmatz commented Nov 12, 2024

https://buildkite.com/risingwave-test/sysbench-cdc/builds/803#01931b7f-fc3b-4298-9460-20405270f932/480

The test uses sysbench as the data generator.

Steps:

  1. The pipeline first creates "${BENCHMARK_SYSBENCH_TABLES}" tables in PG: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/prepare.template.yaml#L25-L44
  2. Then the pipeline creates one CDC source: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L54-L62
  3. Then the pipeline uses the same CDC source to create "${BENCHMARK_SYSBENCH_TABLES}" tables in RW: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L54-L62
  4. Finally, the pipeline inserts data into PG: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L109-L129

In a series of tests, we found that if "${BENCHMARK_SYSBENCH_TABLES}" <= 700, it is ok. But if >= 750, the test fails. Error details below. Tried several times, the same test pipeline always fail once going beyond 750. More precisely, it failed at 723,724,725,730 tables in 4 cases.

namespace = "sysbench-cdc-20241111-154325"
dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&from=1731333095031&to=1731334152598&var-datasource=cdtasocg64074c&var-namespace=sysbench-cdc-20241111-135408&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All

the log of the container that creates CDC table in RW: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-job-pqrcg%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1

frontend log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-frontend-f-8644f6b6b5-t69b4%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1

meta-node log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-meta-m-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1

metastore-pg log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22metastore-postgresql-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1

After 725 tables, we encounter such error from the container that creates CDC table:

  |   | 2024-11-11 22:03:27.759 | CREATE_TABLE |
  |   | 2024-11-11 22:03:28.462 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:29.226 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:29.911 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:30.396 | ERROR:  Failed to run the query |  
  |   | 2024-11-11 22:03:30.396 |   |  
  |   | 2024-11-11 22:03:30.396 | Caused by these errors (recent errors listed first): |  
  |   | 2024-11-11 22:03:30.396 | 1: gRPC request to meta service failed: Unknown error |  
  |   | 2024-11-11 22:03:30.396 | 2: transport error |  
  |   | 2024-11-11 22:03:30.396 | 3: connection error |  
  |   | 2024-11-11 22:03:30.396 | 4: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.396 |   |  
  |   | 2024-11-11 22:03:30.475 | server closed the connection unexpectedly |  
  |   | 2024-11-11 22:03:30.475 | This probably means the server terminated abnormally |  
  |   | 2024-11-11 22:03:30.475 | before or while processing the request. |  
  |   | 2024-11-11 22:03:30.475 | connection to server was lost

And about the same time, the frontend node log:

2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694563241Z  INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37340 |  
-- | -- | --
  |   | 2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694616263Z  INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37356 |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396522113Z  WARN risingwave_common_service::observer_manager: Receives meta's notification err error=status: Unknown, message: "h2 protocol error: error reading a body from connection", details: [], metadata: MetadataMap { headers: {} }: error reading a body from connection: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396657655Z  WARN risingwave_rpc_client::meta_client: refresh meta member client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396690303Z  WARN handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: risingwave_rpc_client::meta_client: force refresh meta client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396783079Z ERROR handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: pgwire::pg_protocol: error when process message error=Failed to run the query: gRPC request to meta service failed: Unknown error: transport error: connection error: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.451 | 2024-11-11T14:03:30.451651268Z  INFO risingwave_rt: received SIGTERM, shutting down... |  
  |   | 2024-11-11 22:03:30.475 | 2024-11-11T14:03:30.475595628Z  WARN risingwave_rpc_client::meta_client: failed to unregister from meta service error=gRPC request to meta service failed: The service is currently unavailable: transport error: dns error: failed to lookup address information: Name or service not known worker_id=1

But there is no error in the meta node's log.

And looking at the dashboard:
SCR-20241112-lkf

It seems everything is ok. CN's memory usage is lower than its limit 13GB. And the pipeline's kubectl describe does not tell any component is OOM or restarted. Meta node, frontend node are all quite idle

@github-actions github-actions bot added this to the release-2.2 milestone Nov 12, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Nov 18, 2024

False alarm.

The disconnection is caused by the infra not by RW.
However, increasing the number of CDC tables in the test triggers repetitive CN OOM and thus crashbackloop.

Will open another issue.

@lmatz lmatz closed this as completed Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant