Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clickhouse] Long running QA tests for replicated cluster #6953

Open
Tracked by #5999
karencfv opened this issue Oct 30, 2024 · 0 comments
Open
Tracked by #5999

[clickhouse] Long running QA tests for replicated cluster #6953

karencfv opened this issue Oct 30, 2024 · 0 comments
Assignees
Labels
clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers

Comments

@karencfv
Copy link
Contributor

Overview

As part of the work to roll out replicated ClickHouse we'll be needing some long running testing to ensure stability of the replicated cluster. Specifically, we'll be wanting to know how stable the replicated cluster is when left alone for a while under load (i.e., days, weeks).

We'll need to monitor the system and answer the following questions periodically during a long period of time (a month or so?):

  • Is data consistent across all replicas?
  • Is query performance acceptable under load?
  • Are the queue lengths acceptable under load?
  • Do queue lengths grow over time or are they consistent depending on the load?
  • <more?>

Implementation

We'll probably want to use clickhouse-admin to extract information from the system. There is a clickhouse-admin binary already installed in each of the clickhouse-{server|keeper} nodes.

To retrieve the information we need, we can leverage the following native ClickHouse tooling:

Altinity has a pretty cool ClickHouse stress test suite. We can probably use it for running stress tests, or take inspiration from it to create our own stress tests.

@karencfv karencfv added clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers labels Oct 30, 2024
@karencfv karencfv self-assigned this Oct 30, 2024
karencfv added a commit that referenced this issue Nov 6, 2024
## Overview

This commit implements a new clickhouse-admin endpoint to retrieve and
parse information from
`[system.distributed_ddl_queue](https://clickhouse.com/docs/en/operations/system-tables/distributed_ddl_queue)`.

## Purpose

As part of [stage 1](https://rfd.shared.oxide.computer/rfd/0468) of
rolling out replicated ClickHouse, we'll be needing to monitor the
installed but not visible replicated cluster on our dogfood rack. This
endpoint will aid in said monitoring by providing information about
distributed ddl queries.

## Testing

```console
$ cargo run --bin=clickhouse-admin-server -- run -c ./smf/clickhouse-admin-server/config.toml -a [::1]:8888 -l [::1]:22001 -b /Users/karcar/src/omicron/out/clickhouse/clickhouse
   Compiling omicron-clickhouse-admin v0.1.0 (/Users/karcar/src/omicron/clickhouse-admin)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 3.79s
     Running `target/debug/clickhouse-admin-server run -c ./smf/clickhouse-admin-server/config.toml -a '[::1]:8888' -l '[::1]:22001' -b /Users/karcar/src/omicron/out/clickhouse/clickhouse`
note: configured to log to "/dev/stdout"
{"msg":"listening","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:41.746887Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:197"}
{"msg":"accepted connection","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:55.674044Z","hostname":"ixchel","pid":55286,"local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:1105","remote_addr":"[::1]:59735"}
{"msg":"Retrieved data from `system.distributed_ddl_queue`","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148241Z","hostname":"ixchel","pid":55286,"component":"ClickhouseCli","file":"clickhouse-admin/types/src/lib.rs:1018","output":"\"{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"3\\\"}\\n{\\\"entry\\\":\\\"query-0000000001\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:17:08\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22002,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n{\\\"entry\\\":\\\"query-0000000000\\\",\\\"entry_version\\\":5,\\\"initiator_host\\\":\\\"ixchel\\\",\\\"initiator_port\\\":22001,\\\"cluster\\\":\\\"oximeter_cluster\\\",\\\"query\\\":\\\"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster\\\",\\\"settings\\\":{\\\"load_balancing\\\":\\\"random\\\",\\\"output_format_json_quote_64bit_integers\\\":\\\"0\\\"},\\\"query_create_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"host\\\":\\\"::1\\\",\\\"port\\\":22001,\\\"status\\\":\\\"Finished\\\",\\\"exception_code\\\":0,\\\"exception_text\\\":\\\"\\\",\\\"query_finish_time\\\":\\\"2024-11-01 16:16:45\\\",\\\"query_duration_ms\\\":\\\"4\\\"}\\n\""}
{"msg":"request completed","v":0,"name":"clickhouse-admin-server","level":30,"time":"2024-11-04T07:29:56.148459Z","hostname":"ixchel","pid":55286,"uri":"/distributed-ddl-queue","method":"GET","req_id":"fb46b182-2573-4daa-a791-118dad20a3c3","remote_addr":"[::1]:59735","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.12.0/src/server.rs:950","latency_us":473916,"response_code":"200"}
```

```console
$ curl http://[::1]:8888/distributed-ddl-queue
[{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"3"},{"entry":"query-0000000001","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID '701a3dd3-10f0-4f5d-b5b2-0ad11bcf2b17' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random"},"query_create_time":"2024-11-01 16:17:08","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:17:08","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22002,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"},{"entry":"query-0000000000","entry_version":5,"initiator_host":"ixchel","initiator_port":22001,"cluster":"oximeter_cluster","query":"CREATE DATABASE IF NOT EXISTS db1 UUID 'a49757e4-179e-42bd-866f-93ac43136e2d' ON CLUSTER oximeter_cluster","settings":{"load_balancing":"random","output_format_json_quote_64bit_integers":"0"},"query_create_time":"2024-11-01 16:16:45","host":"::1","port":22001,"status":"Finished","exception_code":0,"exception_text":"","query_finish_time":"2024-11-01 16:16:45","query_duration_ms":"4"}]
```

### Caveat

I have purposely not written an integration test as the port situation
is getting out of hand 😅 I'm working on a better implementation of
running these integration tests that involves less hardcoded ports. I
will include an integration test for this endpoint in that PR

Related: #6953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse Related to the ClickHouse metrics DBMS Testing & Analysis Tests & Analyzers
Projects
None yet
Development

No branches or pull requests

1 participant