Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sigverify wraps each PacketBatch with Arc #3035

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

apfitzge
Copy link

Problem

  • Currently SigVerify stage passes Arc<(Vec<PacketBatch>, Option<SigverifyTracerPacketStats>)> to banking_trace/banking_stage.
  • We want to support zero-copy for new transaction type
  • If we Arc::clone the current message type, the worst-case overhead per packet (assuming only one packet per message is kept) is very large
  • Wrapping each PacketBatch in an Arc reduces this worst-case overhead considerably

Summary of Changes

  • Wrap each PacketBatch in an Arc before sending to BankingStage

Fixes #

core/src/sigverify.rs Outdated Show resolved Hide resolved
@@ -65,7 +65,7 @@ pub struct BankingTracer {
#[cfg_attr(
feature = "frozen-abi",
derive(AbiExample),
frozen_abi(digest = "F5GH1poHbPqipU4DB3MczhSxHZw4o27f3C7QnMVirFci")
frozen_abi(digest = "GxZSv4cLjY97v6UospZnyfKurN6UciYqDvFgZJByP9KW")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actual change in serialization because arc itself isnt' serialized, only contents.

@apfitzge apfitzge marked this pull request as ready for review September 30, 2024 18:39
@@ -366,17 +366,17 @@ fn recv_send(

pub fn recv_packet_batches(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do the Arc wrapping here - this is only called by sigverify stage afaict

@apfitzge
Copy link
Author

@alessandrod would be nice to profile this to know the overhead on the additional Arc-(wrapping/copying) of PacketBatch that we're doing.

I think should be relatively minor since:

  1. heap allocation is quite small (40 [PacketBatch] + 16 [Arc strong&weak] = 56 bytes)
  2. copy is small (40 bytes)

@apfitzge apfitzge self-assigned this Oct 1, 2024
Copy link

@bw-solana bw-solana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments, but LGTM.

Do we have any existing bencher or anything that will help understand the improvement from not copying? Or any performance data you've collected?

core/benches/banking_stage.rs Outdated Show resolved Hide resolved
core/benches/banking_stage.rs Outdated Show resolved Hide resolved
@apfitzge
Copy link
Author

apfitzge commented Oct 1, 2024

Do we have any existing bencher or anything that will help understand the improvement from not copying? Or any performance data you've collected?

I didn't include these as standard benches since we don't plan to do it; but I can make a few lines changes to compare the transaction-view benches w/ and w/o copying.
See below for details - between 0-81% loss of throughput by copying the slice to a Vec<u8>.
Probably slightly different results if we cloned a Packet directly, since it will always copy the full 1232 bytes, but on the stack.

Benchmark Results
min sized transactions/TransactionView
                        time:   [20.421 µs 20.462 µs 20.499 µs]
                        thrpt:  [49.954 Melem/s 50.044 Melem/s 50.145 Melem/s]
                 change:
                        time:   [+427.14% +429.30% +431.42%] (p = 0.00 < 0.05)
                        thrpt:  [-81.182% -81.107% -81.030%]
                        Performance has regressed.

simple transfers/TransactionView
                        time:   [22.925 µs 23.001 µs 23.082 µs]
                        thrpt:  [44.364 Melem/s 44.521 Melem/s 44.667 Melem/s]
                 change:
                        time:   [+375.62% +377.80% +379.86%] (p = 0.00 < 0.05)
                        thrpt:  [-79.161% -79.071% -78.975%]
                        Performance has regressed.

packed transfers/TransactionView
                        time:   [279.81 µs 279.95 µs 280.14 µs]
                        thrpt:  [3.6553 Melem/s 3.6578 Melem/s 3.6596 Melem/s]
                 change:
                        time:   [+9.4727% +10.020% +10.533%] (p = 0.00 < 0.05)
                        thrpt:  [-9.5290% -9.1074% -8.6531%]
                        Performance has regressed.

packed noops/TransactionView
                        time:   [1.5705 ms 1.5728 ms 1.5755 ms]
                        thrpt:  [649.97 Kelem/s 651.05 Kelem/s 652.01 Kelem/s]
                 change:
                        time:   [+0.8979% +1.3427% +1.7919%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7604% -1.3249% -0.8899%]
                        Change within noise threshold.

packed atls/TransactionView
                        time:   [158.94 µs 159.12 µs 159.34 µs]
                        thrpt:  [6.4265 Melem/s 6.4354 Melem/s 6.4426 Melem/s]
                 change:
                        time:   [+18.663% +19.145% +19.649%] (p = 0.00 < 0.05)
                        thrpt:  [-16.422% -16.069% -15.728%]
                        Performance has regressed.
Description of Changes for Benchmark

Normal benchmark code. Create view using a reference to byte slice:

                let _ = TransactionView::try_new_unsanitized(black_box(bytes.as_ref())).unwrap();

Copying a byte slice to vec:

                let _ = TransactionView::try_new_unsanitized(black_box(bytes.to_vec())).unwrap();

and adding impl for TransactionData:

impl TransactionData for Vec<u8> {
    #[inline]
    fn data(&self) -> &[u8] {
        &self
    }
}

@bw-solana
Copy link

Benchmark Results

min sized transactions/TransactionView
                        time:   [20.421 µs 20.462 µs 20.499 µs]
                        thrpt:  [49.954 Melem/s 50.044 Melem/s 50.145 Melem/s]
                 change:
                        time:   [+427.14% +429.30% +431.42%] (p = 0.00 < 0.05)
                        thrpt:  [-81.182% -81.107% -81.030%]
                        Performance has regressed.

simple transfers/TransactionView
                        time:   [22.925 µs 23.001 µs 23.082 µs]
                        thrpt:  [44.364 Melem/s 44.521 Melem/s 44.667 Melem/s]
                 change:
                        time:   [+375.62% +377.80% +379.86%] (p = 0.00 < 0.05)
                        thrpt:  [-79.161% -79.071% -78.975%]
                        Performance has regressed.

packed transfers/TransactionView
                        time:   [279.81 µs 279.95 µs 280.14 µs]
                        thrpt:  [3.6553 Melem/s 3.6578 Melem/s 3.6596 Melem/s]
                 change:
                        time:   [+9.4727% +10.020% +10.533%] (p = 0.00 < 0.05)
                        thrpt:  [-9.5290% -9.1074% -8.6531%]
                        Performance has regressed.

packed noops/TransactionView
                        time:   [1.5705 ms 1.5728 ms 1.5755 ms]
                        thrpt:  [649.97 Kelem/s 651.05 Kelem/s 652.01 Kelem/s]
                 change:
                        time:   [+0.8979% +1.3427% +1.7919%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7604% -1.3249% -0.8899%]
                        Change within noise threshold.

packed atls/TransactionView
                        time:   [158.94 µs 159.12 µs 159.34 µs]
                        thrpt:  [6.4265 Melem/s 6.4354 Melem/s 6.4426 Melem/s]
                 change:
                        time:   [+18.663% +19.145% +19.649%] (p = 0.00 < 0.05)
                        thrpt:  [-16.422% -16.069% -15.728%]
                        Performance has regressed.

Results are somewhat surprising to me. I would have naively expected performance improvement to be correlated with size of transactions (e.g. packed transfers improve more than simple transfers). Any idea why that's not the case?

@apfitzge
Copy link
Author

apfitzge commented Oct 3, 2024

Results are somewhat surprising to me. I would have naively expected performance improvement to be correlated with size of transactions (e.g. packed transfers improve more than simple transfers). Any idea why that's not the case?

Small transactions are (now) pretty damn fast. Any overhead from copying (or in this case reaching out to allocator to get Vec) is large in comparison to the parsing speed.

When we get to larger transactions, parsing does become fairly expensive so the relative impact of the overhead from allocating/copying is lower.
If we do no allocation, copy only tx-size into pre-allocated buffer, seeing minimally sized txs still ~20% slower than Arc-clone approach; which is more inline with my original expectation.

Hadn't stated previously, but we could probably do pretty well with copying data into pre-allocated buffers.
BUT it requires us to do some odd things around our transaction container, for buffer space re-use, and it makes the buffering of transactions more complicated than a simple map we can insert into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants