[Segment Replication + Remote Store] GA test planning #8109

mch2 · 2023-06-16T17:12:52Z

This issue contains a running list of testing required In preparation for the release of using remote storage with segment replication.

Objective

Show trade-offs when compared to document replication.
Show trade-offs when compared to segment node-node replication.
Learn break points & bottlenecks depending on cluster configuration/workload.

General Hypothesis

Segrep + remote store (S&R) will achieve better ingest throughput, search latencies, and replication lag times as replica count grows. This is attributed to primary shards not incurring an incremental network cost per replica.
S&R would see higher primary refresh time when compared to node-node replication. This is attributed to primary shards synchronously uploading to remote store in refresh path.
Overloaded clusters should result in remote store and/or segment replication backpressure applied.

Performance Benchmark Plan #

Each use case listed below should also be performed with docrep & node-node replication and where possible added to our nightly runs on https://opensearch.org/benchmarks for easy ongoing comparison.

Metrics that should be captured in addition what OSB reports:

Replication lag - Add to OSB.
Network Throughput
IOPS
Thread pool stats - I think this may already be an optional return from OSB.

All clusters should have 3 dedicated cluster manager nodes.
Small cluster = ~3 nodes
Large cluster = ~40 nodes
Use m5.xlarge node type for consistency.

Use case	Workload(s)	Variations	Notes	Why
Ingest append-only	SO	Small cluster (3 nodes, 10 shards). 1 replica case and max replica case	Small cluster here is equivalent to existing nightlies.	General ingest performance
Update heavy / Large dataset	http_logs	large cluster (~10GB per shard) - 1 replica case & max replica case		Validate assertions that segrep performs even better in update/delete cases.
	http_logs	large cluster (~50GB per shard) - 1 replica case & max replica case		Comparison as shards grow in size
	http_logs	Small cluster - (~10GB per shard) 1 replica case & max replica case
	http_logs	Small cluster - (50gb per shard) 1 replica case & max replica case
Concurrent ingest/search with deep replica counts	TBD		No OSB workload exists today that concurrently performs ingest & search.	Validate hypothesis 1 with consistent ingest/search traffic
Zero replicas configured	any			Ensure no downgrade to docrep.

Additional tests (nice to have):
Longevity testing - Longer running test with large cluster.
Architecture comparison (x86 v arm)

Feature Stability and Testing #

This is a list of specific test cases that should be performed either through existing/new ITs or manual.
We are working on enabling S+R across our entire server suite as part of this issue, but are hitting challenges. So in the meantime creating a running list of the most critical.

Feature / Type	Test Case	Notes:
Backpressure
	Primary slow to upload
	Replica slow to fetch
	Slow checkpoint publish
Relocation
	Primary-Primary
	Replica or Primary to Replica
Shard events
	Delete index
	Replica removal
	Replica add
	Primary drop
	Replica drop
	Replica restart
	Primary drop
	Trigger a drop during segment download
Segrep Stats API
	All metrics present
Mem leak / heap analysis		This would be manual, fetch a heap dump for analysis.
	Constant calls to segrep stats API
	Before/after replication events on primary
	before/after replication events on replica
BWC
	Test node-node for no breakage. Add new tests in next release for remote store

Existing test packages to run with SR enabled in :server.
cluster
gateway
index
indexing
indices
ingest
recovery
remotestore
update

mch2 · 2023-06-20T16:30:14Z

Added an issue to opensearch-benchmark for adding replication lag metric - opensearch-project/opensearch-benchmark#339. This may require moving the lag metric to node stats similar to other captured metrics.

ankitkala · 2023-06-21T06:58:28Z

Definitely agree with adding a telemetry device in OSB to capture the replication lag. Here is a similar change we did for CCR lag.

Network Throughput
IOPS
Thread pool stats - I think this may already be an optional return from OSB.

These metrics can be published to a different cluster by OSB buy i don't think these are part of the test result.

mch2 · 2023-06-26T21:03:26Z

Splitting this out into individual runs:

Here's what I'm thinking in terms of set up -
All clusters should have 3 dedicated cluster manager nodes.
Small cluster = ~3 nodes
Large cluster = ~40 nodes
Use m5.xlarge node type for consistency.

Use case	Workload(s)	Variations	Notes	Why
Ingest append-only	SO	Small cluster (3 nodes, 10 shards). 1 replica case and max replica case	Small cluster here is equivalent to existing nightlies.	General ingest performance
Update heavy / Large dataset	http_logs	large cluster (~10GB per shard) - 1 replica case & max replica case		Validate assertions that segrep performs even better in update/delete cases.
	http_logs	large cluster (~50GB per shard) - 1 replica case & max replica case		Comparison as shards grow in size
	http_logs	Small cluster - (~10GB per shard) 1 replica case & max replica case
	http_logs	Small cluster - (50gb per shard) 1 replica case & max replica case
Concurrent ingest/search with deep replica counts	TBD		No OSB workload exists today that concurrently performs ingest & search.	Validate hypothesis 1 with consistent ingest/search traffic
Zero replicas configured	any			Ensure no downgrade to docrep.

ankitkala · 2023-06-27T11:56:13Z

For small cluster configuration, let's test with shards in multiple of 3 (currently testing with 3, 6, 12, 24). Similar configuration is being used for remote store perf testing as well

mch2 · 2023-06-27T16:58:53Z

Similar configuration is being used for remote store perf testing as well

Where can we track that testing? I expect there is overlap here given node-node replication is not used with remote store?

tlfeng · 2023-07-06T17:28:33Z

About the performance testing, I created an dedicated issue #8874 to track.
I will run the performance test locally to satisfy all the requirement in this page.
I plan to post the steps to locally run the benchmark test script that existed in opensearch-build repository there.
While to run these test scenario nightly in CI server, there are a few work needed, I created two issues for that.
opensearch-project/opensearch-benchmark#339 Capture and report replication lag metric
opensearch-project/opensearch-build#3701 Integrate feature of increasing the data size with benchmark test script

mch2 added enhancement Enhancement or improvement to existing feature or request untriaged distributed framework and removed untriaged labels Jun 16, 2023

github-actions bot added the untriaged label Jun 16, 2023

mch2 removed the untriaged label Jun 16, 2023

Rishikesh1159 mentioned this issue Jun 20, 2023

[Meta] [Segment Replication] Run all integration tests with segment replication enabled #6761

Closed

12 tasks

mch2 mentioned this issue Jun 27, 2023

[DOC] Segment Replication + Remote Storage opensearch-project/documentation-website#4405

Closed

4 tasks

anasalkouz assigned mch2 Jun 29, 2023

kotwanikunal assigned tlfeng Jun 30, 2023

dreamer-89 added the v2.10.0 label Jul 11, 2023

mch2 mentioned this issue Jul 25, 2023

Benchmark Segment Replication with Remote Store #8087

Closed

tlfeng mentioned this issue Jul 25, 2023

[Segment Replication + Remote Store] GA performance test #8874

Closed

Bukhtawar added Indexing:Replication Issues and PRs related to core replication framework eg segrep Storage Issues and PRs relating to data and metadata storage labels Jul 27, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

Poojita-Raj closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication + Remote Store] GA test planning #8109

[Segment Replication + Remote Store] GA test planning #8109

mch2 commented Jun 16, 2023 •

edited by anasalkouz

Loading

mch2 commented Jun 20, 2023 •

edited

Loading

ankitkala commented Jun 21, 2023

mch2 commented Jun 26, 2023

ankitkala commented Jun 27, 2023

mch2 commented Jun 27, 2023

tlfeng commented Jul 6, 2023 •

edited

Loading

[Segment Replication + Remote Store] GA test planning #8109

[Segment Replication + Remote Store] GA test planning #8109

Comments

mch2 commented Jun 16, 2023 • edited by anasalkouz Loading

Objective

General Hypothesis

Performance Benchmark Plan #

Feature Stability and Testing #

mch2 commented Jun 20, 2023 • edited Loading

ankitkala commented Jun 21, 2023

mch2 commented Jun 26, 2023

ankitkala commented Jun 27, 2023

mch2 commented Jun 27, 2023

tlfeng commented Jul 6, 2023 • edited Loading

mch2 commented Jun 16, 2023 •

edited by anasalkouz

Loading

mch2 commented Jun 20, 2023 •

edited

Loading

tlfeng commented Jul 6, 2023 •

edited

Loading