Sigverify wraps each PacketBatch with Arc #3035

apfitzge · 2024-09-30T14:02:47Z

Problem

Currently SigVerify stage passes Arc<(Vec<PacketBatch>, Option<SigverifyTracerPacketStats>)> to banking_trace/banking_stage.
We want to support zero-copy for new transaction type
If we Arc::clone the current message type, the worst-case overhead per packet (assuming only one packet per message is kept) is very large
Wrapping each PacketBatch in an Arc reduces this worst-case overhead considerably

Summary of Changes

Wrap each PacketBatch in an Arc before sending to BankingStage

Fixes #

core/src/sigverify.rs

apfitzge · 2024-09-30T14:41:13Z

core/src/banking_trace.rs

@@ -65,7 +65,7 @@ pub struct BankingTracer {
 #[cfg_attr(
    feature = "frozen-abi",
    derive(AbiExample),
-    frozen_abi(digest = "F5GH1poHbPqipU4DB3MczhSxHZw4o27f3C7QnMVirFci")
+    frozen_abi(digest = "GxZSv4cLjY97v6UospZnyfKurN6UciYqDvFgZJByP9KW")


No actual change in serialization because arc itself isnt' serialized, only contents.

apfitzge · 2024-09-30T18:41:41Z

streamer/src/streamer.rs

@@ -366,17 +366,17 @@ fn recv_send(

 pub fn recv_packet_batches(


we do the Arc wrapping here - this is only called by sigverify stage afaict

apfitzge · 2024-09-30T18:55:46Z

@alessandrod would be nice to profile this to know the overhead on the additional Arc-(wrapping/copying) of PacketBatch that we're doing.

I think should be relatively minor since:

heap allocation is quite small (40 [PacketBatch] + 16 [Arc strong&weak] = 56 bytes)
copy is small (40 bytes)

bw-solana

Couple minor comments, but LGTM.

Do we have any existing bencher or anything that will help understand the improvement from not copying? Or any performance data you've collected?

core/benches/banking_stage.rs

apfitzge · 2024-10-01T20:43:38Z

Do we have any existing bencher or anything that will help understand the improvement from not copying? Or any performance data you've collected?

I didn't include these as standard benches since we don't plan to do it; but I can make a few lines changes to compare the transaction-view benches w/ and w/o copying.
See below for details - between 0-81% loss of throughput by copying the slice to a Vec<u8>.
Probably slightly different results if we cloned a Packet directly, since it will always copy the full 1232 bytes, but on the stack.

Benchmark Results

min sized transactions/TransactionView
                        time:   [20.421 µs 20.462 µs 20.499 µs]
                        thrpt:  [49.954 Melem/s 50.044 Melem/s 50.145 Melem/s]
                 change:
                        time:   [+427.14% +429.30% +431.42%] (p = 0.00 < 0.05)
                        thrpt:  [-81.182% -81.107% -81.030%]
                        Performance has regressed.

simple transfers/TransactionView
                        time:   [22.925 µs 23.001 µs 23.082 µs]
                        thrpt:  [44.364 Melem/s 44.521 Melem/s 44.667 Melem/s]
                 change:
                        time:   [+375.62% +377.80% +379.86%] (p = 0.00 < 0.05)
                        thrpt:  [-79.161% -79.071% -78.975%]
                        Performance has regressed.

packed transfers/TransactionView
                        time:   [279.81 µs 279.95 µs 280.14 µs]
                        thrpt:  [3.6553 Melem/s 3.6578 Melem/s 3.6596 Melem/s]
                 change:
                        time:   [+9.4727% +10.020% +10.533%] (p = 0.00 < 0.05)
                        thrpt:  [-9.5290% -9.1074% -8.6531%]
                        Performance has regressed.

packed noops/TransactionView
                        time:   [1.5705 ms 1.5728 ms 1.5755 ms]
                        thrpt:  [649.97 Kelem/s 651.05 Kelem/s 652.01 Kelem/s]
                 change:
                        time:   [+0.8979% +1.3427% +1.7919%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7604% -1.3249% -0.8899%]
                        Change within noise threshold.

packed atls/TransactionView
                        time:   [158.94 µs 159.12 µs 159.34 µs]
                        thrpt:  [6.4265 Melem/s 6.4354 Melem/s 6.4426 Melem/s]
                 change:
                        time:   [+18.663% +19.145% +19.649%] (p = 0.00 < 0.05)
                        thrpt:  [-16.422% -16.069% -15.728%]
                        Performance has regressed.

Description of Changes for Benchmark

Normal benchmark code. Create view using a reference to byte slice:

                let _ = TransactionView::try_new_unsanitized(black_box(bytes.as_ref())).unwrap();

Copying a byte slice to vec:

                let _ = TransactionView::try_new_unsanitized(black_box(bytes.to_vec())).unwrap();

and adding impl for TransactionData:

impl TransactionData for Vec<u8> {
    #[inline]
    fn data(&self) -> &[u8] {
        &self
    }
}

bw-solana · 2024-10-03T16:22:28Z

Benchmark Results

min sized transactions/TransactionView
                        time:   [20.421 µs 20.462 µs 20.499 µs]
                        thrpt:  [49.954 Melem/s 50.044 Melem/s 50.145 Melem/s]
                 change:
                        time:   [+427.14% +429.30% +431.42%] (p = 0.00 < 0.05)
                        thrpt:  [-81.182% -81.107% -81.030%]
                        Performance has regressed.

simple transfers/TransactionView
                        time:   [22.925 µs 23.001 µs 23.082 µs]
                        thrpt:  [44.364 Melem/s 44.521 Melem/s 44.667 Melem/s]
                 change:
                        time:   [+375.62% +377.80% +379.86%] (p = 0.00 < 0.05)
                        thrpt:  [-79.161% -79.071% -78.975%]
                        Performance has regressed.

packed transfers/TransactionView
                        time:   [279.81 µs 279.95 µs 280.14 µs]
                        thrpt:  [3.6553 Melem/s 3.6578 Melem/s 3.6596 Melem/s]
                 change:
                        time:   [+9.4727% +10.020% +10.533%] (p = 0.00 < 0.05)
                        thrpt:  [-9.5290% -9.1074% -8.6531%]
                        Performance has regressed.

packed noops/TransactionView
                        time:   [1.5705 ms 1.5728 ms 1.5755 ms]
                        thrpt:  [649.97 Kelem/s 651.05 Kelem/s 652.01 Kelem/s]
                 change:
                        time:   [+0.8979% +1.3427% +1.7919%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7604% -1.3249% -0.8899%]
                        Change within noise threshold.

packed atls/TransactionView
                        time:   [158.94 µs 159.12 µs 159.34 µs]
                        thrpt:  [6.4265 Melem/s 6.4354 Melem/s 6.4426 Melem/s]
                 change:
                        time:   [+18.663% +19.145% +19.649%] (p = 0.00 < 0.05)
                        thrpt:  [-16.422% -16.069% -15.728%]
                        Performance has regressed.

Results are somewhat surprising to me. I would have naively expected performance improvement to be correlated with size of transactions (e.g. packed transfers improve more than simple transfers). Any idea why that's not the case?

apfitzge · 2024-10-03T20:38:06Z

Results are somewhat surprising to me. I would have naively expected performance improvement to be correlated with size of transactions (e.g. packed transfers improve more than simple transfers). Any idea why that's not the case?

Small transactions are (now) pretty damn fast. Any overhead from copying (or in this case reaching out to allocator to get Vec) is large in comparison to the parsing speed.

When we get to larger transactions, parsing does become fairly expensive so the relative impact of the overhead from allocating/copying is lower.
If we do no allocation, copy only tx-size into pre-allocated buffer, seeing minimally sized txs still ~20% slower than Arc-clone approach; which is more inline with my original expectation.

Hadn't stated previously, but we could probably do pretty well with copying data into pre-allocated buffers.
BUT it requires us to do some odd things around our transaction container, for buffer space re-use, and it makes the buffering of transactions more complicated than a simple map we can insert into.

Sigverify wraps each PacketBatch with Arc

fabf971

apfitzge commented Sep 30, 2024

View reviewed changes

core/src/sigverify.rs Outdated Show resolved Hide resolved

frozen abi - no actual change

8a08735

apfitzge commented Sep 30, 2024

View reviewed changes

apfitzge added 2 commits September 30, 2024 10:39

trait so we can arc earlier

af33eb5

unneccessary trait bound

540e732

apfitzge marked this pull request as ready for review September 30, 2024 18:39

apfitzge requested review from bw-solana and alessandrod September 30, 2024 18:39

apfitzge commented Sep 30, 2024

View reviewed changes

apfitzge self-assigned this Oct 1, 2024

bw-solana reviewed Oct 1, 2024

View reviewed changes

core/benches/banking_stage.rs Outdated Show resolved Hide resolved

core/benches/banking_stage.rs Outdated Show resolved Hide resolved

apfitzge added 2 commits October 1, 2024 15:19

combine uses

85f4fa3

use trait

89e4c15

apfitzge mentioned this pull request Oct 8, 2024

reduce banking capacity #3114

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sigverify wraps each PacketBatch with Arc #3035

Sigverify wraps each PacketBatch with Arc #3035

apfitzge commented Sep 30, 2024

apfitzge Sep 30, 2024

apfitzge Sep 30, 2024

apfitzge commented Sep 30, 2024

bw-solana left a comment

apfitzge commented Oct 1, 2024

bw-solana commented Oct 3, 2024

apfitzge commented Oct 3, 2024

		@@ -366,17 +366,17 @@ fn recv_send(

		pub fn recv_packet_batches(

Sigverify wraps each PacketBatch with Arc #3035

Are you sure you want to change the base?

Sigverify wraps each PacketBatch with Arc #3035

Conversation

apfitzge commented Sep 30, 2024

Problem

Summary of Changes

apfitzge Sep 30, 2024

Choose a reason for hiding this comment

apfitzge Sep 30, 2024

Choose a reason for hiding this comment

apfitzge commented Sep 30, 2024

bw-solana left a comment

Choose a reason for hiding this comment

apfitzge commented Oct 1, 2024

bw-solana commented Oct 3, 2024

apfitzge commented Oct 3, 2024