Skip to content

Latest commit

 

History

History
73 lines (57 loc) · 3.59 KB

readme.md

File metadata and controls

73 lines (57 loc) · 3.59 KB

Introduction

Repairs a Cassandra cluster using read repairs. Supports the same options as nodetool repair where possible.

Usage

ic-repair [ ...]

Option Description
-u,--username Cassandra username to connect with
-pw,--password Cassandra password to connect with
-h,--host Host to connect to. Defaults to localhost
-p,--port Port to connect to. Defaults to 9042
-ssl Enable connecting with SSL. Uses JSSE.
-f,--file File to load/save repair state.
-fresh Forces a fresh repair (ie. no resuming of previous repair)
-report Print report of repair state and exit.
-nosys,--exclude-system Exclude repair of system keyspaces
-s,--steps Steps per token range. Defaults to 1.
-pr,--partitioner-range Perform partitioner range repair.
-t,--threads threads> Maximum number parallel repairs. Defaults to number of available processors.
-r,--retry <max_retry> Maximum number of retries when there are unavailable nodes.
-d,--retry-delay <delay_ms> Base delay between retries when nodes are unavailable.

SSL

The -ssl flag enables connecting with SSL using JSSE. You can config SSL settings via the JSSE system properties.

Example of connecting with SSL:

ic-repair -Djavax.net.ssl.trustStore=/path/to/client.truststore -Djavax.net.ssl.trustStorePassword=password123 -ssl

Throttling

Read repairs can be quite intensive on the cluster therefore you will want to adjust the maximum number of parallel read repairs with the -t flag. The optimal setting may vary between tables since the data model and compaction strategy have a large impact on the performance of read repairs. Therefore you may want to run repair for each table separately with different maximum request parameters.

Unavailable nodes

If a node becomes unavailable the repair application will wait up to max_retry times for the node to become available. It will wait delay_ms milliseconds and increase this exponentially on each subsequent retry. Once the maximum number of retry attempts is reached the repair will be suspended.

Background

The problem with standard repairs occur when there is large amounts of inconsistency as any differences in the merkle tree requries streaming replicas from all nodes involved which can lead to:

  • Running out of disk space due to sending multiple replicas
  • Lots of sstables from streaming sstable sections for the inconsistent token range
  • Compactions falling behind from all the sstables being streamed
  • High read latency from an increase in sstables per read
  • High number of sstables causes high CPU usage in sorting them into buckets

In our experience we have seen repairs lead to cluster outages. The aim of this application is to avoid these issues by relying on read repairs which in comparison just send a mutation with the correct version of the row to nodes without it. Additionally this application supports suspending and resuming the repair. It can also handle nodes going down. This makes it more robust then even tools such as Cassandra reaper.

Please see https://www.instaclustr.com/support/documentation/announcements/instaclustr-open-source-project-status/ for Instaclustr support status of this project.