Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Use raft consensus for networking #1591

Closed
ch1bo opened this issue Sep 2, 2024 · 7 comments
Closed

Spike: Use raft consensus for networking #1591

ch1bo opened this issue Sep 2, 2024 · 7 comments
Assignees
Labels

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Sep 2, 2024

Why

We created a new test suite about resilience of our network stack in #1532 (see also #1106, #1436, #1505). With this in place, we can now explore various means to reach our goal of a crash-tolerant network layer.

This fairly old research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:

the form of consensus relevant for blockchain is technically known as atomic
broadcast

Furthermore, it listed at least one of these early, permissioned blockchains that achieved crash-tolerance of $t < n/2$ by simply using etcd with its Raft consensus algorithm.

What

  • Run fault injection tests with a hydra-node network connected through etcd
  • Create a PR with results of the spike, but do not merge it

How

  • Create a Hydra.Network.Etcd network component that implements broadcast using the replicated log of etcd
    • Move authentication into application
    • Re-use --peer from command line
    • Fork etcd when starting
    • Message type + sender as key
    • Use watch and revisions to be notified of messages while offline
  • Maybe: Compact messages (this will upper bound reliance to faults?) or use leases
@ch1bo ch1bo added the spike label Sep 2, 2024
@ch1bo ch1bo changed the title Spike: Using raft consensus for networking Spike: Use raft consensus for networking Sep 2, 2024
@ch1bo ch1bo self-assigned this Sep 2, 2024
@ch1bo ch1bo linked a pull request Sep 13, 2024 that will close this issue
4 tasks
@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 13, 2024

After directly identifying a required change in semantics of our NetworkComponentss #1624, I started implementation yesterday and achieved so far:

  • Initialization of a etcd cluster by re-using existing --peer (and other) command line arguments
  • Sending and receiving of messages: very basic, hex-encoded, using etcdctl as "client" and by polling a single key
  • First manual testing and some end-to-end tests / benchmarks working -> fragile (likely because of process calls and polling)

Notes so far:

  • First step: Starting a etcd in the background of network component.
  • When making etcd available to hydra-node (repl), I stumbled over cached paths (somewhere in dist-newstyle)
  • Separated --host and --port is annoying, should just parse a full Host.
  • etcd has a few command line arguments to get right, this guide was helpful: https://etcd.io/docs/v3.5/op-guide/clustering/
  • Etcd client libraries are not in the best state in haskell land, resorted to just invoking etcdctl for now (this will be have horrible performance)
    • Need to marshal binary encoded messages to etcdtcl put through stdin using hex
    • Can use etcdctl get -w json to get a JSON encoded result with base64 encoded values (of hex encoded bytes)
    • Should really use a grpc or at least http/json client
  • Who should be signing and verifying messages? Re-use authentication layer or build it into etcd network component (and re-use keys for transport-level security)?
    • Decided to re-use withAuthentication and do some plumbing
    • Expand list of valid senders to include "us" (because we do not do short-cuts anymore)
  • Stopping the last etcd instance does not work properly. It seems like its not gracefully handling the SIGTERM while reconnecting to other nodes.
  • Surprisingly, the etcd network component even works if we just poll a single key in a busy loop and re-deliver this message over and over. At least in a manual, interactive test using the hydra-tui.

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 17, 2024

Continued and got the networked offline mode nodes and 3-node benchmarks to run today:

λ cabal bench hydra-cluster --benchmark-options="datasets datasets/3-nodes.json --timeout 1000s"
[...]
Writing results to: /tmp/bench-3-nodes.json-701cf5c6814573b8/results.csv
Confirmed txs/Total expected txs: 9000/9000 (100.00 %)
Average confirmation time (ms): 53.090586988
P99: 65.27539939ms
P95: 62.45322745ms
P50: 53.0196145ms
Invalid txs: 0
Benchmark bench-e2e: FINISH

Its likely slow because the current state of the branch is still invoking etcdctl put for sending and parsing stdout of etcdctl watch.

Next steps

  • Figure out how to handle unreliable network connections
    • Submission while not connected to the majority seems to be just swallowed
    • How can a minority catch up after reconnecting? From which revision on to watch?
  • Run fault injection tests against this
  • Summarize findings

Notes from today are in the logbook: https://github.com/cardano-scaling/hydra/wiki/Logbook-2024-H1#sn-on-using-etcd-for-networking

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 23, 2024

Persisted some of the original thoughts about Network component properties in #1649.

Also, this updated architecture diagram also mentions the same properties and displays how etcd is embedded in the hydra-node:

Image

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 23, 2024

Got the etcd-based network now running in our fault injection test suite, both locally and using github actions: https://github.com/cardano-scaling/hydra/actions/runs/10998675568

Notes from making it work:

  • Benchmarks using docker containers failed because the etcd is sensitive to inconsistencies between --host/--port and --peer command line arguments.

  • Tested with various package loss percentages. 10, 15, 20, 25 all still seem fine!

  • With 40 percent package loss I see the bench driver fail: Something went wrong while waiting for all confirmations: at bench/Bench/EndToEnd.hs:482:33 in main:Bench.EndToEnd: waitNext: ConnectionClosed and later time out.

  • As the inlined call to pumba makes it clear, this is only testing outbound traffic from alice to bob and carol!

    nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
      --duration 20m \
      --target 172.16.238.20 \
      --target 172.16.238.30 \
      loss --percent ${percent} \
      "re2:hydra-node-1" &
    
  • This pumba invocation should result in 10 percent package loss to alice: nix run github:noonio/pumba/noon/add-flake -- -l debug netem --duration 20m --target 172.16.238.10 loss --percent 10 "re2:hydra-node-"

Next steps:

  • Use pumba (or so) to test crashes of a node

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 25, 2024

To run the package drop tests locally, we just need to use the instructions from fault tolerance github action (also updated on this branch). Also persisted here for easy reproducibility.

Fault injection

Build the pumba-compatible docker image:

nix build .#docker-hydra-node-for-netem
./result | docker load

Set up a local devnet like in https://hydra.family/head-protocol/docs/getting-started/, but using the updated local hydra-node from the .#demo dev shell:

cd demo
./prepare-devnet.sh
docker compose up -d cardano-node
sudo chmod a+w devnet/node.socket
export CARDANO_NODE_SOCKET_PATH=devnet/node.socket
./seed-devnet.sh $(which cardano-cli) $(which hydra-node)

Start the pumba compatible hydra nodes:

docker compose -f docker-compose.yaml -f docker-compose-netem.yaml up -d hydra-node-{1,2,3}

Now we can run the workload with:

mkdir -p benchmarks
nix run .#hydra-cluster.components.benchmarks.bench-e2e -- demo \
  --output-directory=benchmarks \
  --scaling-factor=100 \
  --timeout=1000s \
  --testnet-magic 42 \
  --node-socket=devnet/node.socket \
  --hydra-client=localhost:4001 \
  --hydra-client=localhost:4002 \
  --hydra-client=localhost:4003

While this runs, we can start dropping packets from alice with, e.g. 10 percent likelihood:

nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
  --duration 20m \
  --target 172.16.238.20 \
  --target 172.16.238.30 \
  loss --percent 10 \
  "re2:hydra-node-1"

And/or drop packets to alice with, e.g. 10 percent likelihood:

nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
  --duration 20m |
  --target 172.16.238.10 \
  loss --percent 10 \
  "re2:hydra-node-"

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 26, 2024

Solved the two scenarios:

  • Broadcast while on a minority cluster -> persist outbound messages and retry
  • Deliver messages when (re-)connecting to majority cluster -> keep a last known revision and watch starting with it

Kept notes in the logbook and will summarize findings tomorrow: https://github.com/cardano-scaling/hydra/wiki/Logbook-2024-H1#2024-09-26

@ch1bo
Copy link
Collaborator Author

ch1bo commented Sep 27, 2024

Summary of Raft-based network using etcd

  • Seems very robust, but "overkill"?
  • Requires etcd as runtime dependency
  • Similar user experience
    • Full, static topology with listing everone as --peer
    • Simpler configuration via peer discovery possible
  • Introspectable network / cluster state
    • could improve debugging experience
  • etcd has a few features we have not thought of yet
    • use TLS to secure peer connections
    • separate advertised and binding addresses
  • Unclear yet whether this (mis-)use of etcd would work in the long run

Approach

  • Broadcast implemented as put to the etcd cluster works well
    • Using indidivual keys per peer is not needed, but handy to debug
  • Need an outbound buffer to handle failure to put
    • unsuccessful when disconnected / connected to only a minority cluster
    • this highlights possibility that broadcast can fail -> discuss with researchers about this abstraction!
  • Relying on the overall revision to deliver on recovery works well
    • This also de-duplicates any past messages
    • Maybe not needed as our protocol seems resilient against duplication
  • Cluster configuration would be less fragile if we know everyones --node-id
  • Not tested compaction feature, but could use to prune revision history and bound memory/disk usage
  • PeerConnected and PeerDisconnected would need to change
    • we might get some information from etcd
    • more likely though, our connectivity is interesting
    • distinguish between connected (to majority) or disconnected (only on minority)
  • Protocol versioning would need to change
    • could be trivial by using versioned keys or values
  • Re-used Authenticate network component to provide message signing
    • Alternative: move message signing / verification into application logic

Performance

  • Slower when comparing average confirmation time benchmarks with Github action runs:
    • single node: 4ms -> 40ms
    • three nodes: 20ms -> 100ms
  • Faster when comparing average confirmation time benchmarks on my machine:
    • single node: 26ms -> 26ms
    • three nodes: 100ms -> 50ms
  • Unclear if faster or slower?
    • Possible explanation: Workload is spread across more threads and Github-hosted runners may have poor multi-thread performance
  • Definitely some potential as implementation spawns etcdctl processes on every broadcast
  • Also persisting of last known revision is a synchronous write (could be asynchronous, best-effort)
  • Is etcdctl ensuring durability of their write-ahead-log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants