Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update networking page #1606

Merged
merged 2 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-nix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ jobs:
- name: 🚧 Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 16
node-version: 18
cache: 'yarn'
cache-dependency-path: docs/yarn.lock

Expand Down
66 changes: 60 additions & 6 deletions docs/docs/dev/architecture/networking.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
# Networking

This document provides details about the Hydra networking layer, which encompasses the network of Hydra nodes where heads can be opened.

:::warning

🛠 This document is a work in progress. We recognize that the current state of networking is suboptimal, serving as an initial implementation to establish a functional basis. Efforts are underway to enhance the network dynamics through a proposed improvement initiative, detailed in [this proposal](https://github.com/input-output-hk/hydra/pull/237).
:::
This page provides details about the Hydra networking layer, which encompasses
the network of Hydra nodes where heads can be opened.

## Questions

Expand All @@ -30,6 +26,64 @@ This document provides details about the Hydra networking layer, which encompass

## Investigations

### Network resilience

In August 2024 we added some network resilience tests, implemented as a GitHub
action step in [network-test.yaml](https://github.com/cardano-scaling/hydra/blob/master/.github/workflows/network-test.yaml).

The approach is to use [Pumba](https://github.com/alexei-led/pumba) to inject
networking faults into a docker-based setup. This is effective, because of the
[NetEm](https://srtlab.github.io/srt-cookbook/how-to-articles/using-netem-to-emulate-networks.html)
capability that allows for very powerful manipulation of the networking stack
of the containers.

Initially, we have set up percentage-based loss on some very specific
scenarios; namely a three-node setup between `Alice`, `Bob` and `Carol`.

With this setup, we tested the following scenarios:

- Three nodes, 900 transactions ("scaling=10"):
- 1% packet loss to both peers: ✅ Success
- 2% packet loss to both peers: ✅ Success
- 3% packet loss to both peers: ✅ Success
- 4% packet loss to both peers: ✅ Success
- 5% packet loss to both peers: Sometimes works, sometimes fails
- 10% packet loss to both peers: Sometimes works, sometimes fails
- 20% packet loss to both peers: ❌Failure

- Three nodes, 4500 transactions ("scaling=50"):
- 1% packet loss to both peers: ✅ Success
- 2% packet loss to both peers: ✅ Success
- 3% packet loss to both peers: ✅ Success
- 4% packet loss to both peers: Sometimes works, sometimes fails
- 5% packet loss to both peers: Sometimes works, sometimes fails
- 10% packet loss to both peers: ❌Failure
- 20% packet loss to both peers: ❌Failure

"Success" here means that _all_ transactions were processed; "Failure" means
one or more transactions did not get confirmed by all participants within a
particular timeframe.

The main conclusion here is ... there's a limit to the amount of packet loss
we can sustain, it's related to how many transactions we are trying to send
(naturally, [given the percent of failure is per
packet](http://www.voiptroubleshooter.com/indepth/burstloss.html).)

You can keep an eye on the runs of this action here: [Network fault
tolerance](https://github.com/cardano-scaling/hydra/actions/workflows/network-test.yaml).

The main things to note are:

- Overall, the CI job will succeed even if every scenario fails. This is,
ultimately, due to a bug in [GitHub
actions](https://github.com/actions/runner/issues/2347) that prevents one
from declaring an explicit pass-or-fail expectation per scenario. The impact
is that you should manually check this job on each of your PRs.
- It's okay to see certain configurations fail, but it's certainly not
expected to see them _all_ fail; certainly not the zero-loss cases. Anything
that looks suspcisious should be investigated.


### Ouroboros

We held a meeting with the networking team on February 14, 2022, to explore the integration of the Ouroboros network stack into Hydra. During the discussion, there was a notable focus on performance, with Neil Davies providing insightful performance metrics.
Expand Down
Loading