You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a series of tests, we found that if "${BENCHMARK_SYSBENCH_TABLES}" <= 700, it is ok. But if >= 750, the test fails. Error details below. Tried several times, the same test pipeline always fail once going beyond 750. More precisely, it failed at 723,724,725,730 tables in 4 cases.
After 725 tables, we encounter such error from the container that creates CDC table:
| | 2024-11-11 22:03:27.759 | CREATE_TABLE |
| | 2024-11-11 22:03:28.462 | CREATE_TABLE |
| | 2024-11-11 22:03:29.226 | CREATE_TABLE |
| | 2024-11-11 22:03:29.911 | CREATE_TABLE |
| | 2024-11-11 22:03:30.396 | ERROR: Failed to run the query |
| | 2024-11-11 22:03:30.396 | |
| | 2024-11-11 22:03:30.396 | Caused by these errors (recent errors listed first): |
| | 2024-11-11 22:03:30.396 | 1: gRPC request to meta service failed: Unknown error |
| | 2024-11-11 22:03:30.396 | 2: transport error |
| | 2024-11-11 22:03:30.396 | 3: connection error |
| | 2024-11-11 22:03:30.396 | 4: stream closed because of a broken pipe |
| | 2024-11-11 22:03:30.396 | |
| | 2024-11-11 22:03:30.475 | server closed the connection unexpectedly |
| | 2024-11-11 22:03:30.475 | This probably means the server terminated abnormally |
| | 2024-11-11 22:03:30.475 | before or while processing the request. |
| | 2024-11-11 22:03:30.475 | connection to server was lost
And about the same time, the frontend node log:
2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694563241Z INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37340 |
-- | -- | --
| | 2024-11-11 22:03:29.694 | 2024-11-11T14:03:29.694616263Z INFO pgwire::pg_server: accept connection peer_addr=10.0.84.72:37356 |
| | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396522113Z WARN risingwave_common_service::observer_manager: Receives meta's notification err error=status: Unknown, message: "h2 protocol error: error reading a body from connection", details: [], metadata: MetadataMap { headers: {} }: error reading a body from connection: stream closed because of a broken pipe |
| | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396657655Z WARN risingwave_rpc_client::meta_client: refresh meta member client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |
| | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396690303Z WARN handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: risingwave_rpc_client::meta_client: force refresh meta client failed error=gRPC request to meta service failed: The operation was cancelled: transport error: operation was canceled: connection closed |
| | 2024-11-11 22:03:30.396 | 2024-11-11T14:03:30.396783079Z ERROR handle_query{mode="simple query" session_id=1 sql=CREATE TABLE sbtest726 (id INT PRIMARY KEY, k INT, c CHARACTER VARYING, pad CHARACTER VARYING) FROM sbtest TABLE 'public.sbtest726'}: pgwire::pg_protocol: error when process message error=Failed to run the query: gRPC request to meta service failed: Unknown error: transport error: connection error: stream closed because of a broken pipe |
| | 2024-11-11 22:03:30.451 | 2024-11-11T14:03:30.451651268Z INFO risingwave_rt: received SIGTERM, shutting down... |
| | 2024-11-11 22:03:30.475 | 2024-11-11T14:03:30.475595628Z WARN risingwave_rpc_client::meta_client: failed to unregister from meta service error=gRPC request to meta service failed: The service is currently unavailable: transport error: dns error: failed to lookup address information: Name or service not known worker_id=1
But there is no error in the meta node's log.
And looking at the dashboard:
It seems everything is ok. CN's memory usage is lower than its limit 13GB. And the pipeline's kubectl describe does not tell any component is OOM or restarted. Meta node, frontend node are all quite idle
The text was updated successfully, but these errors were encountered:
Even if we create multiple publications in PG and then create the same number of sources in RW to amortize CDC tables among all the sources, we still fail to create more than 713 CDC tables.
The disconnection is caused by the infra not by RW.
However, increasing the number of CDC tables in the test triggers repetitive CN OOM and thus crashbackloop.
https://buildkite.com/risingwave-test/sysbench-cdc/builds/803#01931b7f-fc3b-4298-9460-20405270f932/480
The test uses
sysbench
as the data generator.Steps:
In a series of tests, we found that if "${BENCHMARK_SYSBENCH_TABLES}" <= 700, it is ok. But if >= 750, the test fails. Error details below. Tried several times, the same test pipeline always fail once going beyond 750. More precisely, it failed at 723,724,725,730 tables in 4 cases.
namespace = "sysbench-cdc-20241111-154325"
dashboard: https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&from=1731333095031&to=1731334152598&var-datasource=cdtasocg64074c&var-namespace=sysbench-cdc-20241111-135408&var-instance=benchmark-risingwave&var-pod=All&var-component=All&var-table=All
the log of the container that creates CDC table in RW: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-job-pqrcg%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1
frontend log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-frontend-f-8644f6b6b5-t69b4%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1
meta-node log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22benchmark-risingwave-meta-m-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331625799%22,%22to%22:%221731335351172%22%7D%7D%7D&orgId=1
metastore-pg log: https://grafana.test.risingwave-cloud.xyz/explore?schemaVersion=1&panes=%7B%22mvw%22:%7B%22datasource%22:%22edw30bp59yccgb%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22sysbench-cdc-20241111-135408%5C%22,%20pod%3D%5C%22metastore-postgresql-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22edw30bp59yccgb%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221731331846992%22,%22to%22:%221731335263137%22%7D%7D%7D&orgId=1
After 725 tables, we encounter such error from the container that creates CDC table:
And about the same time, the frontend node log:
But there is no error in the meta node's log.
And looking at the dashboard:
It seems everything is ok. CN's memory usage is lower than its limit 13GB. And the pipeline's
kubectl describe
does not tell any component is OOM or restarted. Meta node, frontend node are all quite idleThe text was updated successfully, but these errors were encountered: