Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use generics instead of dynamic dispatch internally in tar::Builder #395

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GabrielDertoni
Copy link

@GabrielDertoni GabrielDertoni commented Jan 12, 2025

TL;DR Using generics instead of dynamic dispatch allows for specialization and a performance win on modern Linux systems.

bench old time new time delta
cache=yes size=583M files=23k 1.9 s ± 0.1 s 1.7 s ± 0.1 s -10.08%
cache=no size=583M files=23k 7.3 s ± 0.2 s 6.7 s ± 0.3 s -8.95%
cache=yes size=284M files=1k 500 ms ± 4 ms 362 ms ± 28 ms -38.12%
cache=no size=284M files=1k 918 ms ± 25 ms 889 ms ± 29 ms -3.26%

(look here for full benchmark details)

This PR is specifically targeted to allow triggering the specialization of the Write implementations when using tar::Builder. In particular, the stdlib has a specialization on Linux for std::io::copy that allows it to use the copy_file_range syscall. Unfortunately the code as-is is not capable of performing such specialization (or potentially others) since it relies on dynamic dispatch internally. This PR changes the underlying implementation to uses generics instead of dynamic dispatch. This understandably comes at the potential cost of increased code size but I'd argue this choice is already exposed to users that should be willing to pay the code size cost when using tar::Builder<T> directly instead of tar::Builder<dyn Write>. Users looking for reduced code footprint can explicitly opt in through using tar::Builder<dyn Write> instead.

PS: I had to change one of the tests because it looks like this specialization behaves differently than the usual before. Previously since the pax_simple_write test was using a BufWriter it would try to append_file the archive inside the archive itself. That would work because the writes to the file would be buffered and thus reading the archive at that point would read nothing. But with copy_file_range that is no longer true. Anyway, I am pretty sure this was a bug in the test code, but if not, the correct fix would be to do some sort of inode tracking like tar does.

@cgwalters
Copy link
Collaborator

In particular, the stdlib has a specialization on Linux for std::io::copy that allows it to use the copy_file_range syscall.

And that would happen here when your tar output is to a file, and you're providing a File instance as input e.g. append()? OK. (Or a pipe/socket for either of those)

OK, though I do kind of question how often this occurs, because most use cases for tar end up being compressed (i.e. the output side is not a fd).

Can you at briefly explain your higher level use case and what led you to try this optimization?

@GabrielDertoni
Copy link
Author

GabrielDertoni commented Jan 17, 2025

Fair point on the fact that this is mostly used for compression purposes. However, this specialization also applies to writers like BufWriter<T> which can have their larger internal buffers used directly, avoiding one unnecessary copy.

I stumbled on this when someone told me they were trying to archive a python environment and taring was taking a significant portion of the time. I then looked through the implementation and realized many syscalls were being issued that could be significantly reduced. Although this optimization may dramatically reduce the number of syscalls in some cases it seldom results in as big time reduction, but it is not negligible.

This can be quite useful whey you'd like to write to an archive temporarily, even if it is going to be compressed afterwards, or you want to pipe that result into another program to handle the compression. I guess what surprised me is that since tar::Builder is a generic type my performance expectation is that it will monorphize and benefit where it can from that. So I found it quite odd that internally it uses dynamic dispatch.

@cgwalters
Copy link
Collaborator

There are certainly use cases for uncompressed tar; over in OCI land we were discussing that uncompressed tar should used when the data embedded is already compressed - AI models can be like this. This helps avoid a problem in OCI because it checksums compressed data, and that can make it hard to ensure reproducible caching.

Anyways though...the change sounds reasonable but, this crate is widely used and I personally try to be pretty conservative with what we merge. This type of change has the potential to cause regressions (probably in code size mostly - but certainly not exclusively).

Can you please provide a bit more benchmarking in a targeted use case?

I stumbled on this when someone told me they were trying to archive a python environment and taring was taking a significant portion of the time. I then looked through the implementation and realized many syscalls were being issued that could be significantly reduced. Although this optimization may dramatically reduce the number of syscalls in some cases it seldom results in as big time reduction, but it is not negligible.

A bit more data? Are we talking 1%? 10%? As far as syscall traffic, of course others have been looking at io-uring based work (see #355 ), but that's a much deeper change.

@GabrielDertoni
Copy link
Author

Hi, I understand you being conservative and I appreciate it! Unfortunately I am not able to provide benchmarks more specific than I have already provided in the original description, but the perf benefits are in the order showed on the previous benchmarks (up to 38%, but 10% mostly). I am certainly looking at io-uring based alternatives myself but I agree that's a much larger change and may not fit in this crate. This PR though I thought would benefit upstream for a sensible optimization without too many downsides. And I'd be interested to know what other downsides you think this would cause downstream and if there is some specific example you might be worried about. I'd also be happy to consider feature-flagging this if you think that'd be more appropriate.

@NobodyXu
Copy link

I am certainly looking at io-uring based alternatives myself but I agree that's a much larger change and may not fit in this crate.

IIRC there are desires to extract and create a sans-IO core tar crate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants