-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How delete entry from Tar archive? #73458
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsPowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116 While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't. I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file. I emulated the same behavior with the new API but failed because additional characters appear in the file. The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars). using (var sourceStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
using (var destinationStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
{
using var reader = new TarReader(sourceStream);
using var writer = new TarWriter(destinationStream);
TarEntry? entry = null;
try
{
entry = reader.GetNextEntry();
}
catch
{
// It is not tar archive. Go to next format detection.
}
if (entry is not null)
{
do
{
writer.WriteEntry(entry);
}
while ((entry = reader.GetNextEntry()) is not null);
return;
}
} Source file So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?
|
Tagging subscribers to this area: @dotnet/area-system-io-compression Issue DetailsPowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116 While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't. I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file. I emulated the same behavior with the new API but failed because additional characters appear in the file. The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars). using (var sourceStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
using (var destinationStream = File.Open(LiteralPath!, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
{
using var reader = new TarReader(sourceStream);
using var writer = new TarWriter(destinationStream);
TarEntry? entry = null;
try
{
entry = reader.GetNextEntry();
}
catch
{
// It is not tar archive. Go to next format detection.
}
if (entry is not null)
{
do
{
writer.WriteEntry(entry);
}
while ((entry = reader.GetNextEntry()) is not null);
return;
}
} Source file So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?
|
Summary: You can't delete entries in an existing TAR archive. What you can do is open an archive with a reader, create another empty archive with a writer, and then Explanation: The spec of the TAR format was not designed with entry deletions in mind. It was merely designed to backup files. The files are sequentially stored in the archive until you reach either the end of the archive or you find only null chars. Trying to delete entries from a TAR archive would be an complicated task because there is no central directory (like in zip) to track the locations and sizes of all the archive entries. Zip's central directory lets you delete entries in any location of the archive, because when you're done, the new central directory is rewritten, with the new locations and sizes of the files, and excluding the ones that got deleted. |
In your code snippet, the problem seems to be that you're opening both the reader and the writer both to the same archive path ( |
Also, the Details: I used the code snippet that I'm sarhing below to try to reproduce what you described. I used the Powershell tar.gz file that you provided in the other issue: https://github.com/PowerShell/PowerShell/releases/download/v7.3.0-preview.6/powershell-7.3.0-preview.6-linux-x64.tar.gz. I was able to generate a valid tar file that 7-zip can read:
I can see that 7-zip is weirdly showing an initial tar entry named But you can find your entries inside. First, the And second, the txt: I also inspected the resulting archive using the HxD hex editor, and there are only two entries. There is no
Code: string sourcePath = "D:/powershell-7.3.0-preview.6-linux-x64.tar.gz";
string destinationPath = "D:/destination.tar.gz";
TarEntry entry1;
TarEntry entry2;
using (FileStream sourceStream = File.OpenRead(sourcePath))
{
using (GZipStream decompressorStream = new(sourceStream, CompressionMode.Decompress))
{
using (TarReader reader = new(decompressorStream))
{
entry1 = reader.GetNextEntry(copyData: true);
if (entry1 == null)
{
throw new Exception("null entry1");
}
Console.WriteLine($"{entry1.Format} - {entry1.EntryType} - {entry1.Name}");
entry2 = reader.GetNextEntry(copyData: true);
if (entry2 == null)
{
throw new Exception("null entry2");
}
Console.WriteLine($"{entry2.Format} - {entry2.EntryType} - {entry2.Name}");
}
}
}
if (File.Exists(destinationPath))
{
File.Delete(destinationPath);
}
using (FileStream destinationStream = File.Create(destinationPath))
{
using (GZipStream compressorStream = new(destinationStream, CompressionMode.Compress))
{
using (TarWriter writer = new(compressorStream, leaveOpen: false))
{
writer.WriteEntry(entry1);
writer.WriteEntry(entry2);
}
}
}
using FileStream sourceStream2 = File.OpenRead(destinationPath);
using GZipStream decompressorStream2 = new(sourceStream2, CompressionMode.Decompress);
using TarReader reader2 = new(decompressorStream2);
TarEntry entry;
while ((entry = reader2.GetNextEntry()) != null)
{
Console.WriteLine($"{entry.Format} - {entry.EntryType} - {entry.Name}");
} |
Hope my replies answer your questions. Let me know if I can help with anything else. |
This issue has been marked |
Can you point RFC or another document where the design is defined? I assume that you are simply reasoning. As I mentioned above I looked up the Unix tar which has a delete parameter. As for the file corruption problem as accurately listed in OP, my main concern is that there is a bug in the implementation - I assume that some buffer is not cleared. After @adamsitnik's Stream API improvements I'd expect this work with two file descriptors as expected and doesn't break file. |
Is it possible to modify an tar entry so that it be considered as "deleted" (skipped at read)? |
We can't delete entries in-place (within the same archive) for the following reasons:
The
The
Your other question was:
Here are the specs I consulted. None of them mention any expectation of having a tar implementation support deletion:
Note that the specs do not define everything in detail, and many behaviors had to be interpreted, or had to be based on the behavior of existing tools. So besides the specs, I also consulted many manuals: the two shared above (linux.die and GNU) as well as the IBM pax manual. In conclusion, the "deletion" scenario, as I mentioned in my previous reply, can be achieved with our APIs by using a reader and a writer, iterate through the reader entries, then write them all into the new archive using the writer, skipping the entries to "delete". This would also take care of the metadata entries for you, and would work when reading unseekable archives. |
@carlossanlop Thanks for clarify! From PowerShell experience point view users can always fallback to native utilities if they need specific functionality. |
PowerShell team started to develop new PowerShell module with Tar support as @carlossanlop requested in PowerShell/Microsoft.PowerShell.Archive#116
While experimenting with new Tar API I discovered that ZipArchive API explicitly support deleting entries but Tar API doesn't.
I looked at how Unix tar utility was implemented. It does in-place removal - skip entry for delete and writes rest entries over the same file.
I emulated the same behavior with the new API but failed because additional characters appear in the file.
Notice, 7-Zip and .Net API can read the resulting file. 7-Zip shows new folder
00000000000
on top level.but source file looks like
The essence of the experiment - I just opened the archive and wrote down every entry in it as soon as I read it (if I skipped an entry I got the same result with corrupted file with same extra chars).
Source file
q1.zip
Resulting file
q.zip
So question do we need new API for deleting entries from Tar archive or we can fix in-place deleting?
The text was updated successfully, but these errors were encountered: