-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: don't allocate in UDP send & recv path #2093
base: main
Are you sure you want to change the base?
Conversation
This change is best summarized by the `process` function signature. On `main` branch the `process` function looks as such: ```rust pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { ``` - It takes as **input** an optional reference to a `Datagram`. That `Datagram` owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each incoming UDP datagram, its payload is allocated in a new `Vec`. - It returns as **output** an owned `Output`. Most relevantly the `Output` variant `Output::Datagram(Datagram)` contains a `Datagram` that again owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each outgoing UDP datagram too, its payload is allocated in a new `Vec`. This commit changes the `process` function to: ```rust pub fn process_into<'a>( &mut self, input: Option<Datagram<&[u8]>>, now: Instant, write_buffer: &'a mut Vec<u8>, ) -> Output<&'a [u8]> { ``` (Note the rename to `process_into` is temporary.) - It takes as **input** an optional `Datagram<&[u8]>`. But contrary to before, `Datagram<&[u8]>` does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload. - It returns as **output** an `Output<&'a [u8]>` where the `Output::Datagram(Datagram<&'a [u8]>)` variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of `neqo_transport::Connection` and is provided to `process` as `write_buffer: &'a mut Vec<u8>`. Note that both `write_buffer` and `Output` use the lifetime `'a`, i.e. the latter is a view into the former. This change to the `process` function enables the following: 1. A user of `neqo_transport` (e.g. `neqo_bin`) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. `recvmmsg`). 2. They pass that receive buffer to `neqo_transport::Connection::process` along with a long-lived write buffer. 3. `process` reads the UDP datagram from the long-lived receive buffer through the `Datagram<&[u8]>` view and writes outgoing datagrams into the provided long-lived `write_buffer`, returning a view into said buffer via a `Datagram<&'a [u8]>`. 4. The user, after having called `process` can then pass the write buffer to the OS (e.g. via `sendmsg`). To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path. As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.
Failed Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2093 +/- ##
==========================================
+ Coverage 95.35% 95.37% +0.01%
==========================================
Files 112 112
Lines 36357 36715 +358
==========================================
+ Hits 34669 35016 +347
- Misses 1688 1699 +11 ☔ View full report in Codecov by Sentry. |
Benchmark resultsPerformance differences relative to 55e3a93. coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.time: [98.877 ns 99.204 ns 99.534 ns] change: [-12.567% -11.944% -11.215%] (p = 0.00 < 0.05) coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.time: [116.94 ns 117.71 ns 118.88 ns] change: [-33.464% -33.143% -32.768%] (p = 0.00 < 0.05) coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.time: [116.55 ns 117.02 ns 117.57 ns] change: [-39.686% -35.379% -32.759%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.time: [97.876 ns 98.011 ns 98.159 ns] change: [-31.803% -31.179% -30.566%] (p = 0.00 < 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [111.73 ms 111.78 ms 111.83 ms] change: [+0.2898% +0.3546% +0.4219%] (p = 0.00 < 0.05) transfer/pacing-false/varying-seeds: No change in performance detected.time: [26.340 ms 27.597 ms 28.883 ms] change: [-8.9450% -3.4441% +2.3427%] (p = 0.25 > 0.05) transfer/pacing-true/varying-seeds: No change in performance detected.time: [34.417 ms 36.080 ms 37.798 ms] change: [-11.452% -5.8066% +0.6803%] (p = 0.07 > 0.05) transfer/pacing-false/same-seed: No change in performance detected.time: [26.051 ms 26.928 ms 27.833 ms] change: [-5.5948% -1.5392% +3.1644%] (p = 0.50 > 0.05) transfer/pacing-true/same-seed: No change in performance detected.time: [43.023 ms 45.160 ms 47.311 ms] change: [-3.2363% +3.2694% +10.031%] (p = 0.33 > 0.05) 1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.time: [104.12 ms 104.36 ms 104.60 ms] thrpt: [956.00 MiB/s 958.22 MiB/s 960.45 MiB/s] change: time: [-10.280% -9.9749% -9.6778%] (p = 0.00 < 0.05) thrpt: [+10.715% +11.080% +11.458%] 1-conn/10_000-parallel-1b-resp (aka. RPS)/client: 💔 Performance has regressed.time: [326.60 ms 329.71 ms 332.82 ms] thrpt: [30.047 Kelem/s 30.330 Kelem/s 30.619 Kelem/s] change: time: [+1.3183% +2.9210% +4.4778%] (p = 0.00 < 0.05) thrpt: [-4.2859% -2.8381% -1.3012%] 1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.time: [34.715 ms 34.913 ms 35.130 ms] thrpt: [28.465 elem/s 28.643 elem/s 28.806 elem/s] change: time: [+2.6418% +3.4761% +4.2420%] (p = 0.00 < 0.05) thrpt: [-4.0693% -3.3593% -2.5738%] Client/server transfer resultsTransfer of 33554432 bytes over loopback.
|
This looks promising. |
# This workflow first removes the workarounds necessary to please NLL and then | ||
# runs with Polonius to ensure each workaround only fixes the false-positive of | ||
# NLL and doesn't mask an actually undefined behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat unconventional. Still, I think statically proving unsafe
code to be safe is worth it. As always, open for alternative suggestions.
/// The size limit of [`Self::encoder`], i.e. the maximum number of bytes to | ||
/// be encoded into [`Self::encoder`]. Note that [`Self::encoder`] might | ||
/// contain one or more packets already. | ||
limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we would allocate a new Vec
for each UDP datagram, with the capacity of the Vec
set to the limit.
Now we have one long-lived receive buffer, with a fixed capacity. In order to enforce the limit PacketBuilder
now owns the limit in its limit: usize
, i.e. is explicit about it, instead of implicitly depending on the capacity of the Vec
of PacketBuilder::encoder
.
.github/workflows/polonius.diff
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See .github/workflows/polonius.yml
right below.
This pull request is ready for a full review. All TODOs are addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, some suggestions.
neqo-bin/src/server/mod.rs
Outdated
} | ||
|
||
async fn process(&mut self, mut dgram: Option<&Datagram>) -> Result<(), io::Error> { | ||
async fn process(&mut self, mut socket_inx: Option<usize>) -> Result<(), io::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_inx
?
pub fn process_output(&mut self, now: Instant) -> Output { | ||
self.process(None, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov says these lines are not covered, is this maybe an unused function?
neqo-transport/src/connection/mod.rs
Outdated
@@ -2319,7 +2375,8 @@ impl Connection { | |||
|
|||
// Frames for different epochs must go in different packets, but then these | |||
// packets can go in a single datagram | |||
let mut encoder = Encoder::with_capacity(profile.limit()); | |||
assert_eq!(out.len(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this rather be a debug_assert
? Or return an error when the condition hits?
// TODO: I don't know what the 64 is all about. Thus leaving the infer_limit function intact | ||
// for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @martinthomson
We should also maybe make 64
and 2048
appropriately-named const
s, if we determine what they do & that we want to keep them.
pub fn process_output(&mut self, now: Instant) -> Output { | ||
self.process(None, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not covered by tests; unused function?
neqo-transport/src/connection/mod.rs
Outdated
now: Instant, | ||
out: &'a mut Vec<u8>, | ||
) -> Output<&'a [u8]> { | ||
assert!(out.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug_assert
or return error?
let burn = prot.encrypt(0, &[], &[]).expect("burn OK"); | ||
assert_eq!(burn.len(), prot.expansion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but these should probably return errors and not panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this particular line is part of a unit test. What would be the benefit of returning an error over right away panicking @larseggert?
(Agreed on your comments above on non-unit-test lines, using debug_assert
and returning an error in release mode.)
neqo-transport/src/server.rs
Outdated
) -> Output { | ||
out: &'a mut Vec<u8>, | ||
) -> Output<&'a [u8]> { | ||
assert!(out.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug_assert
or return error?
neqo-transport/src/server.rs
Outdated
now: Instant, | ||
out: &'a mut Vec<u8>, | ||
) -> Output<&'a [u8]> { | ||
assert!(out.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug_assert
or return error?
neqo-transport/src/server.rs
Outdated
@@ -426,11 +440,17 @@ impl Server { | |||
|
|||
/// Iterate through the pending connections looking for any that might want | |||
/// to send a datagram. Stop at the first one that does. | |||
fn process_next_output(&mut self, now: Instant) -> Output { | |||
fn process_next_output<'a>(&mut self, now: Instant, out: &'a mut Vec<u8>) -> Output<&'a [u8]> { | |||
assert!(out.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug_assert
or return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me like you could do this in pieces, starting with the encoder/decoder changes.
fn process_output_into_buffer<'a>( | ||
&mut self, | ||
now: Instant, | ||
out: &'a mut Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why &mut Vec<u8>
and not &mut [u8]
? I don't think that -- as long as we are accepting someone else's memory -- we should be reallocating it.
@@ -1146,18 +1162,37 @@ impl Connection { | |||
} | |||
} | |||
|
|||
/// Process input and generate output. | |||
/// Same as [`Connection::process_into_buffer`] but allocating output into | |||
/// new [`Vec`]. | |||
#[must_use = "Output of the process function must be handled"] | |||
pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Option<impl Into<Datagram<&'a [u8]>>>
?
neqo-transport/src/connection/mod.rs
Outdated
#[must_use = "Output of the process function must be handled"] | ||
pub fn process_into_buffer<'a>( | ||
&mut self, | ||
input: Option<Datagram<&[u8]>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
let d = Datagram::new( | ||
d.source(), | ||
d.destination(), | ||
d.tos(), | ||
d[d.len() - remaining..].to_vec(), | ||
Some(d.segment_size()), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could implement a to_owned()
for Datagram
rather than do it this way.
let mut slc = &d[..]; | ||
let mut dcid = None; | ||
fn input_path(&mut self, path: &PathRef, d: Datagram<&[u8]>, now: Instant) -> Res<()> { | ||
for mut slc in d.iter_segments() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say, this is too much for me. The idea that a datagram is actually multiple datagrams is not something I'm comfortable with.
Making this function more complicated is less than ideal as well.
neqo-common/src/datagram.rs
Outdated
d: d.into(), | ||
} | ||
} | ||
impl Copy for Datagram<&[u8]> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? That's a pretty large thing to throw around like that. I might prefer not to have it implement Copy. I understand why this might be convenient, but that convenience hides actual work. A Datagram will be at least 59 bytes in size, plus adjustments for alignment.
src, | ||
dst, | ||
tos, | ||
segment_size: segment_size.unwrap_or_else(|| d.as_ref().len()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
segment_size: segment_size.unwrap_or_else(|| d.as_ref().len()), | |
segment_size: segment_size.unwrap_or(usize::MAX), |
You always iterate and truncate the last chunk, so this is cheaper.
neqo-common/src/datagram.rs
Outdated
} | ||
} | ||
|
||
impl From<Datagram<&[u8]>> for Datagram { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a ToOwned implementation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think implementing ToOwned
is possible.
pub trait ToOwned {
type Owned: Borrow<Self>;
fn to_owned(&self) -> Self::Owned;
}
https://doc.rust-lang.org/alloc/borrow/trait.ToOwned.html
ToOwned
requires the associated type Owned
to implement Borrow<Self>
, in other words for Datagram<Vec<u8>>
to implement Borrow<Datagram<&'a [u8]>
. I don't think that is possible, see #2093 (comment).
neqo-common/src/datagram.rs
Outdated
#[must_use] | ||
fn deref(&self) -> &Self::Target { | ||
&self.d | ||
} | ||
} | ||
|
||
impl std::fmt::Debug for Datagram { | ||
impl<'a> From<&'a Datagram> for Datagram<&'a [u8]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a Borrow implementation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think implementing Borrow
is possible.
Here is the trait definition:
pub trait Borrow<Borrowed: ?Sized> {
fn borrow(&self) -> &Borrowed;
}
https://doc.rust-lang.org/alloc/borrow/trait.Borrow.html
The problematic part is the return value, i.e. &Borrowed
. It would need to return a &Datagram<&'a [u8]>
. Given that the Datagram<&'a [u8]>
would need to be instantiated within the borrow
implementation, the reference would point to a temporary value.
I have created the following simplified playground as a showcase.
Thank you for the reviews @larseggert and @martinthomson.
Sounds good. I will address the comments here and then break this large pull request into smaller ones Back to draft for now. |
Description
This change is best summarized by the
process
function signature.On
main
branch theprocess
function looks as such:Datagram
. ThatDatagram
owns an allocation of the UDP payload, i.e. aVec<u8>
. Thus for each incoming UDP datagram, its payload is allocated in a newVec
.Output
. Most relevantly theOutput
variantOutput::Datagram(Datagram)
contains aDatagram
that again owns an allocation of the UDP payload, i.e. aVec<u8>
. Thus for each outgoing UDP datagram too, its payload is allocated in a newVec
.This commit changes the
process
function to:Datagram<&[u8]>
. But contrary to before,Datagram<&[u8]>
does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload.Output<&'a [u8]>
where theOutput::Datagram(Datagram<&'a [u8]>)
variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside ofneqo_transport::Connection
and is provided toprocess_into_buffer
aswrite_buffer: &'a mut Vec<u8>
. Note that bothwrite_buffer
andOutput
use the lifetime'a
, i.e. the latter is a view into the former.This change to the
process
function enables the following:neqo_transport
(e.g.neqo_bin
) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g.recvmmsg
).neqo_transport::Connection::process_into_buffer
along with a long-lived write buffer.process_into_buffer
reads the UDP datagram from the long-lived receive buffer through theDatagram<&[u8]>
view and writes outgoing datagrams into the provided long-livedwrite_buffer
, returning a view into said buffer via aDatagram<&'a [u8]>
.process_into_buffer
can then pass the write buffer to the OS (e.g. viasendmsg
).To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path.
As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.
Performance impact
Early benchmarks are promising, showing e.g. a 10% improvement in the Download benchmark, and up to 40% improvement in the neqo-neqo-reno-pacing benchmark.
This pull request
Current
main
for comparisonhttps://github.com/mozilla/neqo/actions/runs/10850817785
Pull request status
This pull request is ready for review.
Replaces #2076.
Part of #1693.
Closes #1922.