Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How delete entry from Tar archive? #73458

Closed
iSazonov opened this issue Aug 5, 2022 · 12 comments
Closed

How delete entry from Tar archive? #73458

iSazonov opened this issue Aug 5, 2022 · 12 comments
Assignees
Labels
area-System.Formats.Tar needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@iSazonov
Copy link
Contributor

iSazonov commented Aug 5, 2022

PowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116

While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't.

I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file.

I emulated the same behavior with the new API but failed because additional characters appear in the file.
Notice, 7-Zip and .Net API can read the resulting file. 7-Zip shows new folder 00000000000 on top level.
image
but source file looks like
image

The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars).

            using (var sourceStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            using (var destinationStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            {
                using var reader = new TarReader(sourceStream);
                using var writer = new TarWriter(destinationStream);
                TarEntry? entry = null;
                try
                {
                    entry = reader.GetNextEntry();
                }
                catch
                {
                    // It is not tar archive. Go to next format detection.
                }

                if (entry is not null)
                {
                    do
                    {
                        writer.WriteEntry(entry);
                    }
                    while ((entry = reader.GetNextEntry()) is not null);

                    return;
                }
            }

Source file
q1.zip
Resulting file
q.zip

So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 5, 2022
@ghost
Copy link

ghost commented Aug 5, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

PowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116

While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't.

I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file.

I emulated the same behavior with the new API but failed because additional characters appear in the file.
Notice, 7-Zip and .Net API can read the resulting file. 7-Zip shows new folder 00000000000 on top level.
image
but source file looks like
image

The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars).

            using (var sourceStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            using (var destinationStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            {
                using var reader = new TarReader(sourceStream);
                using var writer = new TarWriter(destinationStream);
                TarEntry? entry = null;
                try
                {
                    entry = reader.GetNextEntry();
                }
                catch
                {
                    // It is not tar archive. Go to next format detection.
                }

                if (entry is not null)
                {
                    do
                    {
                        writer.WriteEntry(entry);
                    }
                    while ((entry = reader.GetNextEntry()) is not null);

                    return;
                }
            }

Source file
q1.zip
Resulting file
q.zip

So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?

Author: iSazonov
Assignees: -
Labels:

area-System.IO, untriaged

Milestone: -

@jeffhandley jeffhandley added this to the 7.0.0 milestone Aug 9, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 9, 2022
@ghost
Copy link

ghost commented Aug 9, 2022

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

PowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116

While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't.

I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file.

I emulated the same behavior with the new API but failed because additional characters appear in the file.
Notice, 7-Zip and .Net API can read the resulting file. 7-Zip shows new folder 00000000000 on top level.
image
but source file looks like
image

The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars).

            using (var sourceStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            using (var destinationStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
            {
                using var reader = new TarReader(sourceStream);
                using var writer = new TarWriter(destinationStream);
                TarEntry? entry = null;
                try
                {
                    entry = reader.GetNextEntry();
                }
                catch
                {
                    // It is not tar archive. Go to next format detection.
                }

                if (entry is not null)
                {
                    do
                    {
                        writer.WriteEntry(entry);
                    }
                    while ((entry = reader.GetNextEntry()) is not null);

                    return;
                }
            }

Source file
q1.zip
Resulting file
q.zip

So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?

Author: iSazonov
Assignees: carlossanlop
Labels:

area-System.IO.Compression

Milestone: 7.0.0

@carlossanlop
Copy link
Member

Summary:

You can't delete entries in an existing TAR archive. What you can do is open an archive with a reader, create another empty archive with a writer, and then
rewrite the entries one by one from the reader into the writer, excluding the entry you wanted to "delete".

Explanation:

The spec of the TAR format was not designed with entry deletions in mind. It was merely designed to backup files. The files are sequentially stored in the archive until you reach either the end of the archive or you find only null chars.

Trying to delete entries from a TAR archive would be an complicated task because there is no central directory (like in zip) to track the locations and sizes of all the archive entries. Zip's central directory lets you delete entries in any location of the archive, because when you're done, the new central directory is rewritten, with the new locations and sizes of the files, and excluding the ones that got deleted.

@carlossanlop
Copy link
Member

In your code snippet, the problem seems to be that you're opening both the reader and the writer both to the same archive path (LiteralPath!). You need to use two different paths.

@carlossanlop
Copy link
Member

carlossanlop commented Aug 9, 2022

Also, the 000000 seems to be a weird problem with 7-zip.

Details:

I used the code snippet that I'm sarhing below to try to reproduce what you described. I used the Powershell tar.gz file that you provided in the other issue: https://github.com/PowerShell/PowerShell/releases/download/v7.3.0-preview.6/powershell-7.3.0-preview.6-linux-x64.tar.gz.

I was able to generate a valid tar file that 7-zip can read:

  • I first tried to only insert the first entry in the original archive, but 7-zip didn't like an archive with just the ./ directory entry.
  • Then I tried inserting only the second entry in the original archive (.\ThirdPartyNotices.txt), but 7-zip also didn't like it because the initial .\ entry was missing.
  • I finally got 7-zip to like the resulting archive by inserting both the first and second entry: .\ and .\ThirdPartyNotices.txt.

I can see that 7-zip is weirdly showing an initial tar entry named 000000000. I don't know why, it's probably a bug on their side.

But you can find your entries inside. First, the .\ folder:

And second, the txt:

I also inspected the resulting archive using the HxD hex editor, and there are only two entries. There is no 000000000 entry, which confirms it's a 7-zip bug:

  • In blue i focused the first entry, which contains the 512 bytes describing the .\ folder.
  • The subsequent 512 bytes describe the .\ThirdPartyNotices.txt file entry.
  • After that file, you find 1024 chars, which indicates the end of the file, per the spec.

Code:

string sourcePath = "D:/powershell-7.3.0-preview.6-linux-x64.tar.gz";
string destinationPath = "D:/destination.tar.gz";

TarEntry entry1;
TarEntry entry2;
using (FileStream sourceStream = File.OpenRead(sourcePath))
{
    using (GZipStream decompressorStream = new(sourceStream, CompressionMode.Decompress))
    {
        using (TarReader reader = new(decompressorStream))
        {
            entry1 = reader.GetNextEntry(copyData: true);
            if (entry1 == null)
            {
                throw new Exception("null entry1");
            }
            Console.WriteLine($"{entry1.Format} - {entry1.EntryType} - {entry1.Name}");

            entry2 = reader.GetNextEntry(copyData: true);
            if (entry2 == null)
            {
                throw new Exception("null entry2");
            }
            Console.WriteLine($"{entry2.Format} - {entry2.EntryType} - {entry2.Name}");
        }
    }
}

if (File.Exists(destinationPath))
{
    File.Delete(destinationPath);
}

using (FileStream destinationStream = File.Create(destinationPath))
{
    using (GZipStream compressorStream = new(destinationStream, CompressionMode.Compress))
    {
        using (TarWriter writer = new(compressorStream, leaveOpen: false))
        {
            writer.WriteEntry(entry1);
            writer.WriteEntry(entry2);
        }
    }
}

using FileStream sourceStream2 = File.OpenRead(destinationPath);
using GZipStream decompressorStream2 = new(sourceStream2, CompressionMode.Decompress);
using TarReader reader2 = new(decompressorStream2);

TarEntry entry;
while ((entry = reader2.GetNextEntry()) != null)
{
    Console.WriteLine($"{entry.Format} - {entry.EntryType} - {entry.Name}");
}

@carlossanlop
Copy link
Member

Hope my replies answer your questions. Let me know if I can help with anything else.

@carlossanlop carlossanlop added the needs-author-action An issue or pull request that requires more info or actions from the author. label Aug 9, 2022
@ghost
Copy link

ghost commented Aug 9, 2022

This issue has been marked needs-author-action and may be missing some important information.

@carlossanlop carlossanlop modified the milestones: 7.0.0, Future Aug 9, 2022
@iSazonov
Copy link
Contributor Author

iSazonov commented Aug 10, 2022

You can't delete entries in an existing TAR archive.
The spec of the TAR format was not designed with entry deletions in mind.

Can you point RFC or another document where the design is defined? I assume that you are simply reasoning.

As I mentioned above I looked up the Unix tar which has a delete parameter.
This actually performs a removal in place.
Of course, they work on buffer level since entries have different size.
My example is not correct in this sense but this is the first step in trying to emulate their behavior. Of course, I would prefer the delete operation to be implemented in .Net in the most efficient way.


As for the file corruption problem as accurately listed in OP, my main concern is that there is a bug in the implementation - I assume that some buffer is not cleared. After @adamsitnik's Stream API improvements I'd expect this work with two file descriptors as expected and doesn't break file.
Tested on Preview7.

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Aug 10, 2022
@iSazonov
Copy link
Contributor Author

Is it possible to modify an tar entry so that it be considered as "deleted" (skipped at read)?

@carlossanlop
Copy link
Member

Is it possible to modify an tar entry so that it be considered as "deleted" (skipped at read)?

We can't delete entries in-place (within the same archive) for the following reasons:

  • There is no field you can mark on the entry header to indicate it's deleted (this answers your question).
  • You also can't just set the whole header and the variable-sized data section to null chars, and expect the tar tools to interpret that empty space as a deleted entry. What would most likely happen is that all tar tools would assume this empty space is where the archive ends. The spec expects entry headers to be contiguous, so once you finish reading the last byte of an entry header's data section, the next byte should be the first byte of the next entry header.
  • We would have to iterate through all the entries in an archive and find those that match the provided name. You can have duplicate entries in an archive. If the archive is huge (GBs in size) then expect this to take a long time.
  • When finding an entry to delete, we would have to rewrite the entire archive starting at the location of a deleted entry, on top of which we would write the bytes of all the subsequent entries. Expect this to take a long time too.
  • We would always rewrite immediately after each individual deletion. It would not be possible to first mark multiple entries for deletion then commit the operation, because of the restriction of having to rewrite all the subsequent bytes first.
  • We would have to be able to work with PAX and GNU metadata entries: Finding an entry by its name, then checking if it had a metadata entry preceding it (PAX Extended Attributes, or GNU LongLink and/or GNU LongPath), then delete that metadata entry as well. In the case of GNU, it's possible to have both Long* metadata entries preceding an entry.
  • None of this would not work with unseekable streams due to the inability to move the pointer backwards.

The tar tool man page aligns with the behavior I'm describing:

...
--delete
    delete from the archive (not on mag tapes!)
...
--occurrence[=NUMBER]
process only the NUMBERth occurrence of each file in the archive; this option is valid only in conjunction with one of the subcommands --delete, --diff, --extract or --list and when a list of files is given either on the command line or via the -T option; NUMBER defaults to 1
...

The GNU tar manual expands on it some more:

4.2.5 Removing Archive Members Using ‘--delete’

You can remove members from an archive by using the ‘--delete’ option. Specify the name of the archive with ‘--file’ (‘-f’) and then specify the names of the members to be deleted; if you list no member names, nothing will be deleted. The ‘--verbose’ option will cause tar to print the names of the members as they are deleted. As with ‘--extract’, you must give the exact member names when using ‘tar --delete’. ‘--delete’ will remove all versions of the named file from the archive.

The ‘--delete’ operation can run very slowly.

Unlike other operations, ‘--delete’ has no short form.

This operation will rewrite the archive.

You can only use ‘--delete’ on an archive if the archive device allows you to write to any point on the media, such as a disk; because of this, it does not work on magnetic tapes. Do not try to delete an archive member from a magnetic tape; the action will not succeed, and you will be likely to scramble the archive and damage your tape. There is no safe way (except by completely re-writing the archive) to delete files from most kinds of magnetic tape. See section Tapes and Other Archive Media.

Your other question was:

Can you point RFC or another document where the design is defined? I assume that you are simply reasoning.

Here are the specs I consulted. None of them mention any expectation of having a tar implementation support deletion:

Note that the specs do not define everything in detail, and many behaviors had to be interpreted, or had to be based on the behavior of existing tools. So besides the specs, I also consulted many manuals: the two shared above (linux.die and GNU) as well as the IBM pax manual.

In conclusion, the "deletion" scenario, as I mentioned in my previous reply, can be achieved with our APIs by using a reader and a writer, iterate through the reader entries, then write them all into the new archive using the writer, skipping the entries to "delete". This would also take care of the metadata entries for you, and would work when reading unseekable archives.

@iSazonov
Copy link
Contributor Author

@carlossanlop Thanks for clarify!

From PowerShell experience point view users can always fallback to native utilities if they need specific functionality.
So I don't intend to insist on unnecessarily complicating the code here.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Formats.Tar needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

4 participants