perf: don't allocate in UDP send & recv path #2093

mxinden · 2024-09-08T09:49:55Z

Description

This change is best summarized by the process function signature.

On main branch the process function looks as such:

pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {

It takes as input an optional reference to a Datagram. That Datagram owns an allocation of the UDP payload, i.e. a Vec<u8>. Thus for each incoming UDP datagram, its payload is allocated in a new Vec.
It returns as output an owned Output. Most relevantly the Output variant Output::Datagram(Datagram) contains a Datagram that again owns an allocation of the UDP payload, i.e. a Vec<u8>. Thus for each outgoing UDP datagram too, its payload is allocated in a new Vec.

This commit changes the process function to:

pub fn process_into_buffer<'a>(
    &mut self,
    input: Option<Datagram<&[u8]>>,
    now: Instant,
    write_buffer: &'a mut Vec<u8>,
) -> Output<&'a [u8]> {

It takes as input an optional Datagram<&[u8]>. But contrary to before, Datagram<&[u8]> does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload.
It returns as output an Output<&'a [u8]> where the Output::Datagram(Datagram<&'a [u8]>) variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of neqo_transport::Connection and is provided to process_into_buffer as write_buffer: &'a mut Vec<u8>. Note that both write_buffer and Output use the lifetime 'a, i.e. the latter is a view into the former.

This change to the process function enables the following:

A user of neqo_transport (e.g. neqo_bin) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. recvmmsg).
They pass that receive buffer to neqo_transport::Connection::process_into_buffer along with a long-lived write buffer.
process_into_buffer reads the UDP datagram from the long-lived receive buffer through the Datagram<&[u8]> view and writes outgoing datagrams into the provided long-lived write_buffer, returning a view into said buffer via a Datagram<&'a [u8]>.
The user, after having called process_into_buffer can then pass the write buffer to the OS (e.g. via sendmsg).

To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path.

As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.

Performance impact

Early benchmarks are promising, showing e.g. a 10% improvement in the Download benchmark, and up to 40% improvement in the neqo-neqo-reno-pacing benchmark.

This pull request

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.
       time:   [102.43 ms 102.62 ms 102.81 ms]
       thrpt:  [972.68 MiB/s 974.46 MiB/s 976.24 MiB/s]
change:
       time:   [-9.6439% -9.2860% -8.9313%] (p = 0.00 < 0.05)
       thrpt:  [+9.8072% +10.237% +10.673%]
Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative

neqo neqo reno on 152.5 ± 87.8 95.6 365.6 1.00

neqo neqo reno 141.6 ± 67.7 94.9 326.1 1.00

neqo neqo cubic on 170.4 ± 121.9 94.6 622.5 1.00

neqo neqo cubic 131.4 ± 48.4 95.6 298.7 1.00

Current `main` for comparison

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.
       time:   [112.72 ms 113.13 ms 113.51 ms]
       thrpt:  [880.95 MiB/s 883.97 MiB/s 887.14 MiB/s]
change:
       time:   [-2.1601% -1.6758% -1.1570%] (p = 0.00 < 0.05)
       thrpt:  [+1.1705% +1.7044% +2.2078%]
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) low mild
Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative

neqo neqo reno on 260.8 ± 159.4 127.4 691.6 1.00

neqo neqo reno 221.2 ± 90.7 139.6 432.0 1.00

neqo neqo cubic on 214.5 ± 87.0 125.0 375.0 1.00

neqo neqo cubic 236.2 ± 118.1 136.8 540.0 1.00

https://github.com/mozilla/neqo/actions/runs/10850817785

Pull request status

This pull request is ready for review.

Replaces #2076.
Part of #1693.
Closes #1922.

This change is best summarized by the `process` function signature. On `main` branch the `process` function looks as such: ```rust pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { ``` - It takes as **input** an optional reference to a `Datagram`. That `Datagram` owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each incoming UDP datagram, its payload is allocated in a new `Vec`. - It returns as **output** an owned `Output`. Most relevantly the `Output` variant `Output::Datagram(Datagram)` contains a `Datagram` that again owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each outgoing UDP datagram too, its payload is allocated in a new `Vec`. This commit changes the `process` function to: ```rust pub fn process_into<'a>( &mut self, input: Option<Datagram<&[u8]>>, now: Instant, write_buffer: &'a mut Vec<u8>, ) -> Output<&'a [u8]> { ``` (Note the rename to `process_into` is temporary.) - It takes as **input** an optional `Datagram<&[u8]>`. But contrary to before, `Datagram<&[u8]>` does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload. - It returns as **output** an `Output<&'a [u8]>` where the `Output::Datagram(Datagram<&'a [u8]>)` variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of `neqo_transport::Connection` and is provided to `process` as `write_buffer: &'a mut Vec<u8>`. Note that both `write_buffer` and `Output` use the lifetime `'a`, i.e. the latter is a view into the former. This change to the `process` function enables the following: 1. A user of `neqo_transport` (e.g. `neqo_bin`) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. `recvmmsg`). 2. They pass that receive buffer to `neqo_transport::Connection::process` along with a long-lived write buffer. 3. `process` reads the UDP datagram from the long-lived receive buffer through the `Datagram<&[u8]>` view and writes outgoing datagrams into the provided long-lived `write_buffer`, returning a view into said buffer via a `Datagram<&'a [u8]>`. 4. The user, after having called `process` can then pass the write buffer to the OS (e.g. via `sendmsg`). To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path. As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.

github-actions · 2024-09-08T09:50:28Z

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

Linux: Debug Release
macOS: Debug Release
Windows: Debug Release

github-actions · 2024-09-08T10:14:41Z

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. go-x-net: H DC LR M B U A L2 C2 6
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. kwik: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. lsquic: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6 V2
neqo-latest vs. msquic: H DC LR C20 M S R Z B U L1 L2 C1 C2 6 V2
neqo-latest vs. mvfst: H DC LR M R Z 3 B U L2 C2 6
neqo-latest vs. neqo: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. nginx: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
neqo-latest vs. picoquic: H DC LR C20 M S R Z 3 B U E A L2 C1 C2 6 V2
neqo-latest vs. quic-go: H DC LR C20 M S R Z 3 B U A L1 L2 C2 6
neqo-latest vs. quiche: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. quinn: H DC LR C20 M S R Z 3 B U E A L1 L2 C2 6
neqo-latest vs. s2n-quic: H DC LR C20 M S R 3 B U E A L1 L2 C1 C2 6
neqo-latest vs. xquic: H DC LR C20 M R Z 3 B U L1 L2 C1 C2 6

neqo-latest as server

aioquic vs. neqo-latest: H DC LR C20 M S R Z 3 B A L1 L2 C1 C2 6 V2
chrome vs. neqo-latest: 3
go-x-net vs. neqo-latest: H DC LR M B U A L2 C2 6
kwik vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
lsquic vs. neqo-latest: H DC LR M S R 3 B E A L2 C1 C2 6 V2
msquic vs. neqo-latest: H DC LR C20 M S R Z B A L1 L2 C1 C2 6 V2
mvfst vs. neqo-latest: H DC LR M 3 B L2 C2 6
neqo vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
ngtcp2 vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
picoquic vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2
quic-go vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
quiche vs. neqo-latest: H DC LR M S R Z 3 B A L1 L2 C1 C2 6
quinn vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6
s2n-quic vs. neqo-latest: H DC LR M S R 3 B E A L1 L2 C1 C2 6
xquic vs. neqo-latest: H DC LR C20 S R Z 3 B U A L1 L2 C1 C2 6

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2
neqo-latest vs. haproxy: E
neqo-latest vs. kwik: E
neqo-latest vs. msquic: 3 E
neqo-latest vs. mvfst: C20 S E V2
neqo-latest vs. nginx: E V2
neqo-latest vs. quic-go: E V2
neqo-latest vs. quiche: E V2
neqo-latest vs. quinn: V2
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. xquic: S E V2

neqo-latest as server

aioquic vs. neqo-latest: U E
chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2
go-x-net vs. neqo-latest: C20 S R Z 3 E L1 C1 V2
kwik vs. neqo-latest: E
lsquic vs. neqo-latest: C20 Z U
msquic vs. neqo-latest: 3 E
mvfst vs. neqo-latest: C20 S R U E V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2
xquic vs. neqo-latest: E V2

neqo-common/src/codec.rs

neqo-common/src/datagram.rs

neqo-transport/src/connection/mod.rs

codecov · 2024-09-14T12:24:55Z

Codecov Report

Attention: Patch coverage is 98.48837% with 13 lines in your changes missing coverage. Please review.

Project coverage is 95.37%. Comparing base (eb92e43) to head (f866a2a).

Files with missing lines	Patch %	Lines
neqo-transport/src/connection/mod.rs	96.59%	6 Missing ⚠️
neqo-transport/src/server.rs	94.91%	3 Missing ⚠️
neqo-http3/src/server.rs	93.10%	2 Missing ⚠️
neqo-transport/src/packet/mod.rs	98.86%	1 Missing ⚠️
neqo-udp/src/lib.rs	98.38%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2093      +/-   ##
==========================================
+ Coverage   95.35%   95.37%   +0.01%     
==========================================
  Files         112      112              
  Lines       36357    36715     +358     
==========================================
+ Hits        34669    35016     +347     
- Misses       1688     1699      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-09-14T12:25:02Z

Benchmark results

Performance differences relative to 55e3a93.

coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.

       time:   [98.877 ns 99.204 ns 99.534 ns]
       change: [-12.567% -11.944% -11.215%] (p = 0.00 < 0.05)
Found 14 outliers among 100 measurements (14.00%)

10 (10.00%) high mild

4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.

       time:   [116.94 ns 117.71 ns 118.88 ns]
       change: [-33.464% -33.143% -32.768%] (p = 0.00 < 0.05)
Found 17 outliers among 100 measurements (17.00%)

1 (1.00%) low mild

4 (4.00%) high mild

12 (12.00%) high severe

coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.

       time:   [116.55 ns 117.02 ns 117.57 ns]
       change: [-39.686% -35.379% -32.759%] (p = 0.00 < 0.05)
Found 13 outliers among 100 measurements (13.00%)

1 (1.00%) low severe

1 (1.00%) low mild

2 (2.00%) high mild

9 (9.00%) high severe

coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.

       time:   [97.876 ns 98.011 ns 98.159 ns]
       change: [-31.803% -31.179% -30.566%] (p = 0.00 < 0.05)
Found 7 outliers among 100 measurements (7.00%)

2 (2.00%) high mild

5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.

       time:   [111.73 ms 111.78 ms 111.83 ms]
       change: [+0.2898% +0.3546% +0.4219%] (p = 0.00 < 0.05)
Found 11 outliers among 100 measurements (11.00%)

6 (6.00%) low mild

5 (5.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.

       time:   [26.340 ms 27.597 ms 28.883 ms]
       change: [-8.9450% -3.4441% +2.3427%] (p = 0.25 > 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

transfer/pacing-true/varying-seeds: No change in performance detected.

       time:   [34.417 ms 36.080 ms 37.798 ms]
       change: [-11.452% -5.8066% +0.6803%] (p = 0.07 > 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

transfer/pacing-false/same-seed: No change in performance detected.

       time:   [26.051 ms 26.928 ms 27.833 ms]
       change: [-5.5948% -1.5392% +3.1644%] (p = 0.50 > 0.05)
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

transfer/pacing-true/same-seed: No change in performance detected.

       time:   [43.023 ms 45.160 ms 47.311 ms]
       change: [-3.2363% +3.2694% +10.031%] (p = 0.33 > 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.

       time:   [104.12 ms 104.36 ms 104.60 ms]
       thrpt:  [956.00 MiB/s 958.22 MiB/s 960.45 MiB/s]
change:
       time:   [-10.280% -9.9749% -9.6778%] (p = 0.00 < 0.05)
       thrpt:  [+10.715% +11.080% +11.458%]

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: 💔 Performance has regressed.

       time:   [326.60 ms 329.71 ms 332.82 ms]
       thrpt:  [30.047 Kelem/s 30.330 Kelem/s 30.619 Kelem/s]
change:
       time:   [+1.3183% +2.9210% +4.4778%] (p = 0.00 < 0.05)
       thrpt:  [-4.2859% -2.8381% -1.3012%]
Found 2 outliers among 100 measurements (2.00%)

1 (1.00%) low mild

1 (1.00%) high mild

1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.

       time:   [34.715 ms 34.913 ms 35.130 ms]
       thrpt:  [28.465  elem/s 28.643  elem/s 28.806  elem/s]
change:
       time:   [+2.6418% +3.4761% +4.2420%] (p = 0.00 < 0.05)
       thrpt:  [-4.0693% -3.3593% -2.5738%]
Found 13 outliers among 100 measurements (13.00%)

6 (6.00%) low mild

7 (7.00%) high severe

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client	Server	CC	Pacing	Mean [ms]	Min [ms]	Max [ms]	Relative
msquic	msquic			220.4 ± 139.4	101.6	645.6	1.00
neqo	msquic	reno	on	279.0 ± 120.2	207.5	593.1	1.00
neqo	msquic	reno		281.0 ± 122.0	204.7	613.9	1.00
neqo	msquic	cubic	on	260.9 ± 78.4	206.9	456.7	1.00
neqo	msquic	cubic		215.8 ± 17.0	193.7	244.4	1.00
msquic	neqo	reno	on	120.7 ± 84.2	80.8	363.3	1.00
msquic	neqo	reno		90.4 ± 20.2	79.8	176.8	1.00
msquic	neqo	cubic	on	96.3 ± 26.5	82.7	218.0	1.00
msquic	neqo	cubic		94.6 ± 21.2	82.3	190.5	1.00
neqo	neqo	reno	on	157.4 ± 75.9	99.5	354.0	1.00
neqo	neqo	reno		120.6 ± 27.6	95.0	212.1	1.00
neqo	neqo	cubic	on	172.4 ± 89.8	100.2	405.5	1.00
neqo	neqo	cubic		163.5 ± 88.5	98.4	365.1	1.00

⬇️ Download logs

mxinden · 2024-09-14T13:25:21Z

1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.

time: [98.680 ms 98.918 ms 99.154 ms]
thrpt: [1008.5 MiB/s 1010.9 MiB/s 1013.4 MiB/s]
change:
time: [-12.936% -12.560% -12.190%] (p = 0.00 < 0.05)
thrpt: [+13.882% +14.364% +14.858%]

This looks promising.

This reverts commit 995c499.

One can just use process(None, ...)

…-no-alloc

This reverts commit 3df6660.

mxinden · 2024-09-30T17:54:46Z

.github/workflows/polonius.yml

+# This workflow first removes the workarounds necessary to please NLL and then
+# runs with Polonius to ensure each workaround only fixes the false-positive of
+# NLL and doesn't mask an actually undefined behavior.


Somewhat unconventional. Still, I think statically proving unsafe code to be safe is worth it. As always, open for alternative suggestions.

mxinden · 2024-09-30T18:03:49Z

neqo-transport/src/packet/mod.rs

+    /// The size limit of [`Self::encoder`], i.e. the maximum number of bytes to
+    /// be encoded into [`Self::encoder`]. Note that [`Self::encoder`] might
+    /// contain one or more packets already.
    limit: usize,


Previously we would allocate a new Vec for each UDP datagram, with the capacity of the Vec set to the limit.

Now we have one long-lived receive buffer, with a fixed capacity. In order to enforce the limit PacketBuilder now owns the limit in its limit: usize, i.e. is explicit about it, instead of implicitly depending on the capacity of the Vec of PacketBuilder::encoder.

This reverts commit 760aade.

mxinden · 2024-09-30T18:49:39Z

.github/workflows/polonius.diff

See .github/workflows/polonius.yml right below.

mxinden · 2024-09-30T19:08:08Z

This pull request is ready for a full review. All TODOs are addressed.

…-no-alloc

larseggert

LGTM overall, some suggestions.

larseggert · 2024-10-03T06:06:55Z

neqo-bin/src/server/mod.rs

-    }
-
-    async fn process(&mut self, mut dgram: Option<&Datagram>) -> Result<(), io::Error> {
+    async fn process(&mut self, mut socket_inx: Option<usize>) -> Result<(), io::Error> {


larseggert · 2024-10-03T06:11:43Z

neqo-http3/src/server.rs

+    pub fn process_output(&mut self, now: Instant) -> Output {
+        self.process(None, now)


Codecov says these lines are not covered, is this maybe an unused function?

larseggert · 2024-10-03T06:14:42Z

neqo-transport/src/connection/mod.rs

@@ -2319,7 +2375,8 @@ impl Connection {

        // Frames for different epochs must go in different packets, but then these
        // packets can go in a single datagram
-        let mut encoder = Encoder::with_capacity(profile.limit());
+        assert_eq!(out.len(), 0);


Should this rather be a debug_assert? Or return an error when the condition hits?

larseggert · 2024-10-03T06:15:46Z

neqo-transport/src/packet/mod.rs

+        // TODO: I don't know what the 64 is all about. Thus leaving the infer_limit function intact
+        // for now.


Ping @martinthomson

We should also maybe make 64 and 2048 appropriately-named consts, if we determine what they do & that we want to keep them.

larseggert · 2024-10-03T06:17:57Z

neqo-transport/src/server.rs

+    pub fn process_output(&mut self, now: Instant) -> Output {
+        self.process(None, now)


Not covered by tests; unused function?

larseggert · 2024-10-03T06:24:50Z

neqo-transport/src/connection/mod.rs

+        now: Instant,
+        out: &'a mut Vec<u8>,
+    ) -> Output<&'a [u8]> {
+        assert!(out.is_empty());


debug_assert or return error?

larseggert · 2024-10-03T06:26:25Z

neqo-transport/src/packet/mod.rs

        let burn = prot.encrypt(0, &[], &[]).expect("burn OK");
        assert_eq!(burn.len(), prot.expansion());


Unrelated to this PR, but these should probably return errors and not panic.

Note that this particular line is part of a unit test. What would be the benefit of returning an error over right away panicking @larseggert?

(Agreed on your comments above on non-unit-test lines, using debug_assert and returning an error in release mode.)

larseggert · 2024-10-03T06:27:21Z

neqo-transport/src/server.rs

-    ) -> Output {
+        out: &'a mut Vec<u8>,
+    ) -> Output<&'a [u8]> {
+        assert!(out.is_empty());


debug_assert or return error?

larseggert · 2024-10-03T06:27:39Z

neqo-transport/src/server.rs

+        now: Instant,
+        out: &'a mut Vec<u8>,
+    ) -> Output<&'a [u8]> {
+        assert!(out.is_empty());


debug_assert or return error?

larseggert · 2024-10-03T06:27:57Z

neqo-transport/src/server.rs

@@ -426,11 +440,17 @@ impl Server {

    /// Iterate through the pending connections looking for any that might want
    /// to send a datagram.  Stop at the first one that does.
-    fn process_next_output(&mut self, now: Instant) -> Output {
+    fn process_next_output<'a>(&mut self, now: Instant, out: &'a mut Vec<u8>) -> Output<&'a [u8]> {
+        assert!(out.is_empty());


debug_assert or return error?

martinthomson

It seems to me like you could do this in pieces, starting with the encoder/decoder changes.

martinthomson · 2024-10-03T07:19:45Z

neqo-transport/src/connection/mod.rs

+    fn process_output_into_buffer<'a>(
+        &mut self,
+        now: Instant,
+        out: &'a mut Vec<u8>,


Why &mut Vec<u8> and not &mut [u8]? I don't think that -- as long as we are accepting someone else's memory -- we should be reallocating it.

martinthomson · 2024-10-03T07:21:03Z

neqo-transport/src/connection/mod.rs

@@ -1146,18 +1162,37 @@ impl Connection {
        }
    }

-    /// Process input and generate output.
+    /// Same as [`Connection::process_into_buffer`] but allocating output into
+    /// new [`Vec`].
    #[must_use = "Output of the process function must be handled"]
    pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output {


Why not Option<impl Into<Datagram<&'a [u8]>>>?

martinthomson · 2024-10-03T07:21:09Z

neqo-transport/src/connection/mod.rs

+    #[must_use = "Output of the process function must be handled"]
+    pub fn process_into_buffer<'a>(
+        &mut self,
+        input: Option<Datagram<&[u8]>>,


martinthomson · 2024-10-03T07:22:09Z

neqo-transport/src/connection/mod.rs

+        let d = Datagram::new(
+            d.source(),
+            d.destination(),
+            d.tos(),
+            d[d.len() - remaining..].to_vec(),
+            Some(d.segment_size()),
+        );


Maybe you could implement a to_owned() for Datagram rather than do it this way.

martinthomson · 2024-10-03T07:23:23Z

neqo-transport/src/connection/mod.rs

-        let mut slc = &d[..];
-        let mut dcid = None;
+    fn input_path(&mut self, path: &PathRef, d: Datagram<&[u8]>, now: Instant) -> Res<()> {
+        for mut slc in d.iter_segments() {


I have to say, this is too much for me. The idea that a datagram is actually multiple datagrams is not something I'm comfortable with.

Making this function more complicated is less than ideal as well.

neqo-common/src/datagram.rs

martinthomson · 2024-10-03T07:33:45Z

neqo-common/src/datagram.rs

-            d: d.into(),
-        }
-    }
+impl Copy for Datagram<&[u8]> {}


Really? That's a pretty large thing to throw around like that. I might prefer not to have it implement Copy. I understand why this might be convenient, but that convenience hides actual work. A Datagram will be at least 59 bytes in size, plus adjustments for alignment.

martinthomson · 2024-10-03T07:34:28Z

neqo-common/src/datagram.rs

+            src,
+            dst,
+            tos,
+            segment_size: segment_size.unwrap_or_else(|| d.as_ref().len()),


Suggested change

segment_size: segment_size.unwrap_or_else(|| d.as_ref().len()),

segment_size: segment_size.unwrap_or(usize::MAX),

You always iterate and truncate the last chunk, so this is cheaper.

martinthomson · 2024-10-03T07:36:26Z

neqo-common/src/datagram.rs

+    }
+}
+
+impl From<Datagram<&[u8]>> for Datagram {


Should this be a ToOwned implementation instead?

I don't think implementing ToOwned is possible.

pub trait ToOwned { type Owned: Borrow<Self>; fn to_owned(&self) -> Self::Owned; }

https://doc.rust-lang.org/alloc/borrow/trait.ToOwned.html

ToOwned requires the associated type Owned to implement Borrow<Self>, in other words for Datagram<Vec<u8>> to implement Borrow<Datagram<&'a [u8]>. I don't think that is possible, see #2093 (comment).

martinthomson · 2024-10-03T07:36:50Z

neqo-common/src/datagram.rs

    #[must_use]
    fn deref(&self) -> &Self::Target {
        &self.d
    }
 }

-impl std::fmt::Debug for Datagram {
+impl<'a> From<&'a Datagram> for Datagram<&'a [u8]> {


Should this be a Borrow implementation instead?

I don't think implementing Borrow is possible.

Here is the trait definition:

pub trait Borrow<Borrowed: ?Sized> { fn borrow(&self) -> &Borrowed; }

https://doc.rust-lang.org/alloc/borrow/trait.Borrow.html

The problematic part is the return value, i.e. &Borrowed. It would need to return a &Datagram<&'a [u8]>. Given that the Datagram<&'a [u8]> would need to be instantiated within the borrow implementation, the reference would point to a temporary value.

I have created the following simplified playground as a showcase.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=177f0e0f97ebbce7376092c924c72c55

mxinden · 2024-10-03T17:22:03Z

Thank you for the reviews @larseggert and @martinthomson.

It seems to me like you could do this in pieces, starting with the encoder/decoder changes.

Sounds good. I will address the comments here and then break this large pull request into smaller ones

Back to draft for now.

mxinden mentioned this pull request Sep 8, 2024

perf: don't allocate in UDP recv path #2076

Closed

martinthomson reviewed Sep 9, 2024

View reviewed changes

neqo-common/src/codec.rs Outdated Show resolved Hide resolved

neqo-common/src/datagram.rs Outdated Show resolved Hide resolved

neqo-transport/src/connection/mod.rs Outdated Show resolved Hide resolved

mxinden added 5 commits September 9, 2024 09:30

Merge Encoder impl blocks

b334e84

Fix tests

2db53a2

clippy

9fef795

fix some, ignore some

995c499

Always run bench

1c653de

mxinden added 17 commits September 14, 2024 15:43

First process_input then process_http3

c05bc64

Revert "fix some, ignore some"

08eba9d

This reverts commit 995c499.

Remove process_multiple_input

ae112c8

Cleanup classic process fn delegating to process_x_2

828da75

Consolidate process functions

dfa33b2

Rename process to process_alloc

763b391

Rename process_into to process

8dee7b3

New TODO

5576875

Copy only for Datagram &[u8]

b9457bb

Fix more tests

94d1a68

remove all public process_output

e2d1452

One can just use process(None, ...)

Merge branch 'main' of https://github.com/mozilla/neqo into send-recv…

2e2a76e

…-no-alloc

Remove process_2

1947d33

Intra doc links

8ea56b2

Thread local receive buffer

3df6660

Cleanup UdpSocket::recv_inner

52dfa91

Revert "Thread local receive buffer"

8699209

This reverts commit 3df6660.

mxinden added 3 commits September 30, 2024 19:42

Copy segment size when saving datagram

5ad1f35

Line break comment

136a8c4

Use Datagram::num_segments

b070374

mxinden commented Sep 30, 2024

View reviewed changes

mxinden added 4 commits September 30, 2024 20:05

Revert "Remove Http3Client::process_input"

55219b8

This reverts commit 760aade.

Mark process_input for testing only

e247d47

Fix GRO test

6b30ee5

Remove reference to process_input

f57d1b2

mxinden commented Sep 30, 2024

View reviewed changes

.github/workflows/polonius.diff Outdated

Copy link

Collaborator Author

mxinden Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See .github/workflows/polonius.yml right below.

mxinden marked this pull request as ready for review September 30, 2024 19:06

mxinden mentioned this pull request Oct 1, 2024

ci: macos failing with bindings/nspr_io.h:7:10: fatal error: 'prio.h' file not found #2141

Closed

Merge branch 'main' of https://github.com/mozilla/neqo into send-recv…

f866a2a

…-no-alloc

larseggert approved these changes Oct 3, 2024

View reviewed changes

larseggert mentioned this pull request Oct 3, 2024

chore: prepare v0.9.1 release #2148

Merged

martinthomson reviewed Oct 3, 2024

View reviewed changes

mxinden marked this pull request as draft October 3, 2024 17:22

mxinden added 6 commits October 3, 2024 19:45

%s/_inx/_index

24ba2d4

Return error and do debug_assert on non-empty send buf

fe6ed10

%s/with/in

6aa438d

Remove impl Copy for Datagram

364f5bd

Introduce BorrowedDatagram

f5b2d7f

Encoder encode into &mut [u8] instead of &mut Vec<u8>

bbb93cc

		pub fn process_output(&mut self, now: Instant) -> Output {
		self.process(None, now)

		// TODO: I don't know what the 64 is all about. Thus leaving the infer_limit function intact
		// for now.

		let burn = prot.encrypt(0, &[], &[]).expect("burn OK");
		assert_eq!(burn.len(), prot.expansion());

	segment_size: segment_size.unwrap_or_else(\|\| d.as_ref().len()),
	segment_size: segment_size.unwrap_or(usize::MAX),

perf: don't allocate in UDP send & recv path #2093

Are you sure you want to change the base?

perf: don't allocate in UDP send & recv path #2093

Conversation

mxinden commented Sep 8, 2024 • edited Loading

Description

Performance impact

This pull request

Client/server transfer results

Current main for comparison

Client/server transfer results

Pull request status

github-actions bot commented Sep 8, 2024 • edited Loading

Firefox builds for this PR

github-actions bot commented Sep 8, 2024 • edited Loading

Failed Interop Tests

neqo-latest as client

neqo-latest as server

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

codecov bot commented Sep 14, 2024 • edited Loading

Codecov Report

github-actions bot commented Sep 14, 2024 • edited Loading

Benchmark results

Client/server transfer results

mxinden commented Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden commented Sep 30, 2024 • edited Loading

larseggert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinthomson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden commented Oct 3, 2024

mxinden commented Sep 8, 2024 •

edited

Loading

Current `main` for comparison

github-actions bot commented Sep 8, 2024 •

edited

Loading

github-actions bot commented Sep 8, 2024 •

edited

Loading

codecov bot commented Sep 14, 2024 •

edited

Loading

github-actions bot commented Sep 14, 2024 •

edited

Loading

mxinden commented Sep 14, 2024 •

edited

Loading

mxinden commented Sep 30, 2024 •

edited

Loading