Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.write should check that it isn't appending an archive to itself #104527

Open
gabevenberg opened this issue May 16, 2023 · 2 comments
Open
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement

Comments

@gabevenberg
Copy link
Contributor

gabevenberg commented May 16, 2023

A common use pattern for zipfile is to recursively compress an entire directory into a single .zip file. A common implementation of this pattern looks like this:

#!/usr/bin/env python3

import zipfile, pathlib

rootpath=pathlib.Path('targetdir')

with zipfile.ZipFile('outputfile', 'w') as archive:
    for file_path in sorted(rootpath.rglob('*')):
        arcname=file_path.relative_to(rootpath)
        archive.write(file_path, arcname.as_posix())

However, if outputfile is a path that is a child of targetdir, this results in the operation hanging once the rglob operation eventually causes the archive to attempt to write outfile into outfile, causing the write operation to continue indefinitely until the filesystem runs out of space or the archive hits its max file size, like can be observed in this example:

#!/usr/bin/env python3

import zipfile, pathlib

rootpath=pathlib.Path('./')

with zipfile.ZipFile('./foo.zip', 'w') as archive:
    for file_path in sorted(rootpath.rglob('*')):
        arcname=file_path.relative_to(rootpath)
        archive.write(file_path, arcname.as_posix())

Needless to say, this is hardly an intuitive error path, and can cause difficulties with debugging.

Note that it is not simply third party libraries that allow this error to happen. Neither zipapp nor shutil.make_archive include a check making sure the output file is not a child of the target dir.

There are two ways I think this could be fixed:

  • make zipfile.write simply check that self.file and filename are not equal, raising a ValueError if they are
  • patch all users of zipfile to silently skip over outputfile when they are compressing.

I think the first, at the least, should be implemented, in order to provide an actual error message in this situation, instead of hanging for however long it takes for someone to notice a multi-gb zip file growing by the second. Ill be submitting a PR doing so shortly.

Linked PRs

@gabevenberg gabevenberg added the type-bug An unexpected behavior, bug, or error label May 16, 2023
@hugovk hugovk changed the title zipfile.write should check that it isnt apending an archive to itself. zipfile.write should check that it isn't appending an archive to itself May 17, 2023
@gpshead gpshead added the type-feature A feature request or enhancement label May 20, 2023
@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Jun 26, 2023
pfmoore pushed a commit that referenced this issue Jun 26, 2023
…-106076)

zippapp will now avoid appending an archive to itself.
@serhiy-storchaka
Copy link
Member

  1. It is inefficient. realpath() is expensive, and it will be called twice for every added file and directory. samefile() or samestat() could be more efficient, but even they add an overhead. It should be measured.
  2. It is not comprehensive. It does not always work when an open file was passed to ZipFile constructor.
  3. It needs tests. Several tests for different corner cases.

It would be nice to have such feature, but we should estimate its cost and make it as small as possible. If it will still be significant, we only can add a warning in the documentation or make this feature optional and off by default.

@gabevenberg
Copy link
Contributor Author

  1. It is inefficient. realpath() is expensive, and it will be called twice for every added file and directory. samefile() or samestat() could be more efficient, but even they add an overhead. It should be measured.

    1. It is not comprehensive. It does not always work when an open file was passed to ZipFile constructor.

    2. It needs tests. Several tests for different corner cases.

It would be nice to have such feature, but we should estimate its cost and make it as small as possible. If it will still be significant, we only can add a warning in the documentation or make this feature optional and off by default.

Im open to make new PRs, I was a student when I discovered the issue and made the PR. Are there any corner cases your thinking of specifically? And how would you recommend timing the impact? What time impact would you consider acceptable? What level should the check be made at, at zipfile.write itself (given it is public) or just the consumers of it in the standard library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

4 participants