-
-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File.Open given to the handler for Tar.Extract can't be used after the handler has returned #371
Comments
Finally had time to read this and digest the issue being described (sorry!) 😅 You're right, this is a little annoying; I've encountered similar issues with the I'm not sure seeking tar files is really useful unless you know the offset for every file in the archive; you'd have to scan the whole archive once first to build that index, since I don't think tar keeps one like zip does. It would be worth some experimenting to see if it's possible to get the byte offset reliably, especially using just the standard lib reader. If it doesn't do any/much buffering, then one possibility is to create a new method -- not sure what to call it, maybe Ok, so now that I'm on the same page, I don't have a great solution to this either :( I've kind of come to accept the fact that tar archives can be slow to read in some cases -- and I should probably update the docs to mention that until we find a way to speed things up. But I'm very interested in speeding things up if possible. I know tar files are the darling archive format of the Unix world, but frankly they're only good at one thing: streaming tape archives. They're not good if you need random access or structured file listings. :( In many ways, I do consider zip to be a superior format in that sense... |
Sorry it was a bit of a brain dump issue!
I can turn on caching for this in rclone and it will cache the archive to disk which would make the performance acceptable at the cost of disk usage. I could make this automatic for everything except .zips and that would probably satisfy most rclone users.
You are right. It would require seeking the whole file to find out what files are within. I was thinking this might be useful if you had a big tar with only a few files (imagine a 1GB tar with 10 x 100MB files) and just wanted to extract one of them, but with a tar with lots of files, seeking to each tar header in the stream will be very slow! (A seek in an HTTP stream means re-opening it with a new Range request and typically takes about 1s). Most tars are compressed though, and seeking in a compressed tar is very hard. If the tar was compressed with gzip with sync points then it is easier, but that isn't standard.
I think this is possible, but not with
I guess one possibility is to integrate this: https://pkg.go.dev/github.com/google/[email protected]/stargz which reads and writes tars with individually compressed files and an index - that would seek perfectly! Yes that is a bit of bradfitz magic :-) Its worth reading the explainer on the parent package. That could be another compression format for archiver to support - not sure how you'd deal with the overlap with the existing tar format, or if you can easily tell a stargz file from normal .tar.gz.
Aye :-) |
Thanks for the discussion! Stargz looks like an improvement, although it is definitely non-standard... I'd be surprised if it was encountered in any significant number of places in the wild. I might be open to a proposal to how that would look/work with this library, but might only be convinced to accept it if it is popular enough. Another option may be to simply not support the advanced/seeking feature for .tar (or at least compressed .tar) -- basically just support the functionality for file formats that are properly designed for such a thing. I know that kind of sucks, but so does the .tar format for seeking. 🙃 |
From an rclone point of view, only the I guess I could do a backend specific command which would effectively run I'm attracted to the stargz format as it would work well for reading and writing with the rclone backend so from that point of view it would be useful if it was part of Archiver. If you want I can have a go at a PR |
Sure, I'd be curious what a stargz PR would look like if it's not too invasive 🙂 👍 |
So, in #426, I clarified the documentation to state that files should not be used after the handler returns (they should be closed). I gave this a lot more thought over the last few days and came to the same conclusion you started with, but I will still hope to see some enhancements in either the std lib or a tar package that can support seeking better once the offset of a target file is known. |
What would you like to have changed?
Currently when you call Extract on a .zip file the handler method is given
File
objects withOpen
methods which can be called some time in the future, after the handler has returned.This is because the zip reader is upgraded into an io.ReaderAt so the files can be seeked and opened later.
However if you do the same with Tar, the
File.Open
methods are only valid as long as the handler hasn't returned. If you call them after the handler has returned they return 0 bytes.One solution would be to run Extract on the archive again with just the file name required. This would be an O(n^2) solution though :-( In rclone's case this would mean reading the entire tar file off remote storage in entirety for each file extracted which is super inefficient. Rclone could use local caching for this which would help somewhat.
For plain
.tar
files, they should be quite seekable in theory, but I don't think this is something that we can get from the standard library. For compressed.tar.gz
for example one would need to keep the state of the decompressor at the start of each file in order to seek which is starting to sound very complicated. You can create seekable gzip files (rclone does this in its compress backend) but that is complicated and non-standard.I think for rclone's purposes compressed tar files would mean you've got to read the entire file just to see the directory entries - I don't think there is any way around that. Where in theory at least you should be able to skip the actual file data when scanning an uncompressed
.tar
. I looked through the source ofarchive/tar
and if the reader is capable of seeking then it uses it to skip data where necessary so that is good.I think reading the seek position of the underlying reader each file in an uncompressed
.tar
could be interesting but that would need a fork of archive/tar.Any ideas?
The text was updated successfully, but these errors were encountered: