bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

lmatz · 2024-11-12T07:33:08Z

https://buildkite.com/risingwave-test/sysbench-cdc/builds/803#01931b7f-fc3b-4298-9460-20405270f932/480

The test uses sysbench as the data generator.

Steps:

The pipeline first creates "${BENCHMARK_SYSBENCH_TABLES}" tables in PG: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/prepare.template.yaml#L25-L44
Then the pipeline creates one CDC source: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L54-L62
Then the pipeline uses the same CDC source to create "${BENCHMARK_SYSBENCH_TABLES}" tables in RW: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L54-L62
Finally, the pipeline inserts data into PG: https://github.com/risingwavelabs/kube-bench/blob/main/manifests/benchmarks/sysbench-pg-cdc/start.template.yaml#L109-L129

In a series of tests, we found that if "${BENCHMARK_SYSBENCH_TABLES}" <= 700, it is ok. But if >= 750, the test fails. Error details below. Tried several times, the same test pipeline always fail once going beyond 750. More precisely, it failed at 723,724,725,730 tables in 4 cases.

namespace = "sysbench-cdc-20241111-154325"
dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&from=1731333095031&to=1731334152598&var-datasource=cdtasocg64074c&var-namespace=sysbench-cdc-20241111-135408&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All

the log of the container that creates CDC table in RW: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-job-pqrcg%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1

frontend log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-frontend-f-8644f6b6b5-t69b4%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1

meta-node log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-meta-m-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1

metastore-pg log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22metastore-postgresql-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1

After 725 tables, we encounter such error from the container that creates CDC table:

  |   | 2024-11-11 22:03:27.759 | CREATE_TABLE |
  |   | 2024-11-11 22:03:28.462 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:29.226 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:29.911 | CREATE_TABLE |  
  |   | 2024-11-11 22:03:30.396 | ERROR:  Failed to run the query |  
  |   | 2024-11-11 22:03:30.396 |   |  
  |   | 2024-11-11 22:03:30.396 | Caused by these errors (recent errors listed first): |  
  |   | 2024-11-11 22:03:30.396 | 1: gRPC request to meta service failed: Unknown error |  
  |   | 2024-11-11 22:03:30.396 | 2: transport error |  
  |   | 2024-11-11 22:03:30.396 | 3: connection error |  
  |   | 2024-11-11 22:03:30.396 | 4: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.396 |   |  
  |   | 2024-11-11 22:03:30.475 | server closed the connection unexpectedly |  
  |   | 2024-11-11 22:03:30.475 | This probably means the server terminated abnormally |  
  |   | 2024-11-11 22:03:30.475 | before or while processing the request. |  
  |   | 2024-11-11 22:03:30.475 | connection to server was lost

And about the same time, the frontend node log:

2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694563241Z  INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37340 |  
-- | -- | --
  |   | 2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694616263Z  INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37356 |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396522113Z  WARN risingwave_common_service::observer_manager: Receives meta's notification err error=status: Unknown, message: "h2 protocol error: error reading a body from connection", details: [], metadata: MetadataMap { headers: {} }: error reading a body from connection: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396657655Z  WARN risingwave_rpc_client::meta_client: refresh meta member client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396690303Z  WARN handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: risingwave_rpc_client::meta_client: force refresh meta client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |  
  |   | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396783079Z ERROR handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: pgwire::pg_protocol: error when process message error=Failed to run the query: gRPC request to meta service failed: Unknown error: transport error: connection error: stream closed because of a broken pipe |  
  |   | 2024-11-11 22:03:30.451 | 2024-11-11T14:03:30.451651268Z  INFO risingwave_rt: received SIGTERM, shutting down... |  
  |   | 2024-11-11 22:03:30.475 | 2024-11-11T14:03:30.475595628Z  WARN risingwave_rpc_client::meta_client: failed to unregister from meta service error=gRPC request to meta service failed: The service is currently unavailable: transport error: dns error: failed to lookup address information: Name or service not known worker_id=1

But there is no error in the meta node's log.

And looking at the dashboard:

It seems everything is ok. CN's memory usage is lower than its limit 13GB. And the pipeline's kubectl describe does not tell any component is OOM or restarted. Meta node, frontend node are all quite idle

The text was updated successfully, but these errors were encountered:

lmatz · 2024-11-15T05:38:02Z

pipeline:
https://buildkite.com/risingwave-test/sysbench-cdc/builds/816#01932e46-d2f1-498f-85ab-084df77457b2

log:
https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22k7t%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241115-052343%5C%22,%20pod%3D%5C%22benchmark-job-t7fnj%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731648407073%22,%22to%22:%221731648901775%22%7D%7D%7D&orgId=1

new setting:
https://github.com/risingwavelabs/kube-bench/pull/463

Even if we create multiple publications in PG and then create the same number of sources in RW to amortize CDC tables among all the sources, we still fail to create more than 713 CDC tables.

lmatz · 2024-11-18T06:43:39Z

False alarm.

The disconnection is caused by the infra not by RW.
However, increasing the number of CDC tables in the test triggers repetitive CN OOM and thus crashbackloop.

Will open another issue.

github-actions bot added this to the release-2.2 milestone Nov 12, 2024

lmatz closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

lmatz commented Nov 12, 2024 •

edited

Loading

lmatz commented Nov 15, 2024 •

edited

Loading

lmatz commented Nov 18, 2024

bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

bug: frontend node somehow disconnects from meta node when creating multiple CDC tables of a single CDC source #19349

Comments

lmatz commented Nov 12, 2024 • edited Loading

lmatz commented Nov 15, 2024 • edited Loading

lmatz commented Nov 18, 2024

lmatz commented Nov 12, 2024 •

edited

Loading

lmatz commented Nov 15, 2024 •

edited

Loading