Testing

Approaches

Test driven development. Relevant etcd raft tests have been ported to dragonboat to ensure all corner cases identified by the etcd project have been handled.
High test coverage. Extensively tested by unit testing and monkey testing.
Linearizability checkers. Jepsen's Knossos and porcupine are utilized to check whether IOs are linearizable.
Fuzz testing using go-fuzz.
I/O error injection tests. charybdefs from scylladb is employed to inject I/O errors to the underlying file-system to ensure that Dragonboat handle them correctly.
Power loss tests. We test the system to see what actually happens after power loss.

Monkey Testing

Setup

5 NodeNosts and 3 Drummer servers per process
hundreds of Raft clusters per process
randomly kill and restart NodeHosts and Drummer servers, each NodeHost usually stay online for a few minutes
randomly delete all data owned by a certain NodeHost to emulate permanent disk failure
randomly drop and re-order messages exchanged between NodeHosts
randomly partition NodeHosts from rest of the network
for selected instances, snapshotting and log compaction happen all the time in the background
committed entries are applied with random delays
snapshots are captured and applied with random delays
a list of background workers keep writing to/reading from random Raft clusters with stale read checks
client activity history files are verified by linearizability checkers such as Jepsen's Knossos
run hundreds of above described processes concurrently on each test server, 30 minutes each iteration, many iterations every night
run concurrently on many servers every night

Checks

no linearizability violation
no cluster is permanently stuck
state machines must be in sync
cluster membership must be consistent
raft log saved in LogDB must be consistent
no zombie cluster node

Results

Some history files in Jepsen's Knossos edn format have been made publicly available.

Benchmark

Setup

Three servers each with a single 22-core Intel XEON E5-2696v4 processor, all cores can boost to 2.8Ghz
40GE Mellanox NIC
Intel 900P for storing the RocksDB's WAL and Intel P3700 1.6T for storing all other data
Ubuntu 16.04 with Spectre and Meltdown patches, ext4 file-system

Benchmark method

48 Raft clusters on three NodeHost instances across three servers
Each Raft node is backed by a in-memory Key-Value data store as RSM
Mostly update operations in the Key-Value store
All I/O requests are launched from local processes
Each request is handled in its own goroutine, simple threading model & easy for application debugging
fsync is strictly honored
MutualTLS is disabled

Intel Optane SSD

Compared with enterprise NVME SSDs such as Intel P3700, Optane based SSD doesn't increase throughput when payload is 16/128 bytes. It does slightly increase the throughput when the payload size is 1024 byte each. It also improves write latency when the payload size is 1024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test.md

test.md

Testing

Approaches

Monkey Testing

Setup

Checks

Results

Benchmark

Setup

Benchmark method

Intel Optane SSD

Files

test.md

Latest commit

History

test.md

File metadata and controls

Testing

Approaches

Monkey Testing

Setup

Checks

Results

Benchmark

Setup

Benchmark method

Intel Optane SSD