Skip to content
Engel A. Sanchez edited this page Jun 11, 2014 · 9 revisions

Overview

This library manages consensus groups over a cluster, where each consensus group (an ensemble) can execute linearizable key/value (K/V) operations over a partition of a key set. riak_ensemble was designed and created to implement strongly consistent K/V operations in Riak, but it uses pluggable backends so it can be used elsewhere. Maintaining data consistency across ensemble peers (the sync operation) is also pluggable and implemented in Riak using its AAE (active anti-entropy) system.

Each ensemble has a leader that coordinates the processing of the request with the rest of the ensemble peers who follow it. Leadership changes with time if there are failures. Although having a leader is not required for the safety of the Paxos consensus algorithm, it is a very important optimization. A lease system helps leadership be as stable as possible. While the lease is current, no new elections will occur, which avoids leader flapping. This is an availability trade-off, as no new leader will emerge until the lease expires even when the leader actually fails.

Clients can send K/V requests to an ensemble using the riak_ensemble_client module. Routers help those requests reach the current leader of the ensemble (See riak_ensemble_router). The leader will either coordinate writes with its followers or serve reads directly if it believes its data is up to date*. As much as possible, riak_ensemble tries to have replies sent from the last process to handle the request back to the original client, instead of having replies back up the original request chain.

* A TODO item in the code shows we might need to make a better effort to determine when the leader can serve reads.

Process hierarchy

  • riak_ensemble_sup
    • riak_ensemble_router_sup
      • riak_ensemble_router_1
      • riak_ensemble_router_2
      • ...
      • riak_ensemble_router_7
    • riak_ensemble_storage
    • riak_ensemble_peer_sup
      • riak_ensemble_peer
      • ...
    • riak_ensemble_manager

These are part of a "rest for one" hierarchy and started in the order above. When a child of riak_ensemble_sup dies, it and the children started after it are all restarted. So the death of riak_ensemble_storage, for example, would take down manager, riak_ensemble_peer_sup and restart them in the original order.

  • Managing ensembles in a cluster Adding and removing nodes to a cluster is done through the riak_ensemble_manager module. All nodes participating in the cluster need to have a running manager process. Before this can happen, the ensemble system must be activated on one and only one node*. Managing ensembles is itself done by an ensemble: the root ensemble (see riak_ensemble_root), which is started on activation.

* Previous versions allowed multiple active clusters to be joined into one, but some nasty edge cases were possible.

Clone this wiki locally