-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract .zip files with recorded file timestamps (#15268) #15269
base: develop2
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read #15268, but I still have some concerns, I'd need to understand better what would be failing because of unzipping with the current timestamps. Unzipping a zip downloaded from Github with the OS sets the modified time to the current time, it doesn't set the modified time to anything else. We have had problems in the past because the timestamps from a downloaded zip were in the future, while the current behavior (which seems standard one) of assigning the current time, has given no problems so far.
for file_ in zip_info: | ||
extracted_size += file_.file_size | ||
print_progress(extracted_size, uncompress_size) | ||
try: | ||
z.extract(file_, full_path) | ||
file_path = os.path.join(full_path, file_.filename) | ||
ts = time.mktime(file_.date_time + (0, 0, -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this 0, 0, -1? This would need some comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_.date_time
provides six timestamp fields, but time.mktime()
needs three additional fields to make a struct_time
.
I can add a comment to this effect.
except Exception as e: | ||
output.error("Error extract %s\n%s" % (file_.filename, str(e))) | ||
for file_path, ts in file_timestamps: | ||
os.utime(file_path, (ts, ts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the z.extract()
will not automatically handle the zip timestamp? It doesn't make much sense that this has to be handled manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beats me. Seems like quite the oversight for the API.
Apparently not even .extractall()
gets you that.
Hi @memsharded,
Reproducible builds. Remember that the SOURCE_DATE_EPOCH mechanism I'm developing uses the latest timestamp present in the source archive.
The specific source timestamps used are less important than that they be consistent every time the package is built.
We can clamp the timestamps to the current time, but isn't this issue also possible with tarballs? I didn't see any code to guard against it in That aside, extracting an archive with current timestamps is definitely not normal, expected behavior. No archiving program that I'm aware of does this by default. |
16b797e
to
4997dd7
Compare
Rebased the commit. @memsharded, do you have any further concerns? This just brings the zip extractor into parity with the tar one; there isn't anything novel here. |
Thanks for the ping. I have reviewed this again, and I think this was not merged yet because it is still a bit concerning to me. For example from https://stackoverflow.com/questions/9813243/extract-files-from-zip-file-and-retain-mod-date
In other words, why is it that the Python I am asking for extra feedback from the team. |
Couldn't that be an issue in tarballs as well? Guarding against future dates may be beneficial, though in my experience, that's more a protection against user error than anything having to do with security.
The .zip format is not great, but given that it's what GitHub serves up when you download a repo snapshot---which I've had to do for many projects that don't put out proper releases---it needs to be handled on par with tarballs.
That's a fair point, but keep in mind that the Python folks have a much bigger set of use cases to deal with than we do. |
The problem is that this will happen. Conan usage is very large, and in many different use cases, and users doing very different things. My concern is that there will be users building things from zips created elsewhere, that might not be fully perfect zipped, but it is working for them, and if we introduce this, their builds will start to fail, and they will be blocked until the next release we revert this. This has happened us in the past for even more unlikely changes, the Hyrum's law has hit us long time ago...
Sure, I am not concerned about Github, I am concerned about the myriad of other source origins, including zip files created in an archaic workstation decades ago.
Even more concerning to me :) If they haven't made this even an opt-in, this is an even stronger signal! I have been reviewing this with the team, and we have realized there is already a |
Builds will fail with timestamps not being preserved---packages using GNU Autotools are an easy example of this. And as I mentioned before, this completely wrecks reproducible builds. If breakage can occur both ways, I think you're better off handling timestamps in a manner consistent with other tools. If wonky timestamps are a concern, then add a layer of sanitization. If some users are unknowingly depending on incorrect timestamp handling to let their builds work, we're doing them no favors by keeping them ignorant.
How often is such a scenario going to arise, compared to GitHub and the like? This is optimizing for an edge case, instead of the common one.
You can read the documentation they've provided. The only security issue noted is files being created outside the extraction area (look for "Never extract archives from untrusted sources ..."). If they intended to give a signal, I'm pretty sure they would have written it down.
If I can enable it in This behavior not being the default is going to be hella awkward for #14480, and reproducible build support generally. If the source archive is a .zip file, then the build will just break until timestamps are preserved. (Again, this is how I first ran into the issue; my alpha implementation of And lastly, there is an additional implementation issue with .zip timestamps: these are specified in an undefined local time, not UTC. So to keep things general, you'd need an optional time zone parameter. (See the |
I am afraid this is not how the commitment to stability works. The current implementation has been working without issues, not breaking users, for some time and at large scale. It is not that breakage "can occur", the current implementation doesn't fail. Yes, it might not produce reproducible builds, but it doesn't crash or abort because of some zip archive format or timestamps. If we were getting reports of things like that, we would fix it, but we haven't. But doing this change, and having some user reporting some breakage should be considered a regression and reverted, we will not tell them: "you should fix your zip archives", because maybe they don't control those archives or other reasons.
Well, we have many thousands of users already. I have already been there a few times, and we are the ones getting the calls from upset users because the latest Conan release broke their builds in unexpected ways. Often it will not even be visible in Github, because the larger the company the less likely they are to use Github to report and they will escalate the issue in other ways, requiring even doing video calls, sometimes taking hours to pair-debug and understand that their build failed because of that apparently correct change. Maybe it is only a 0.1% of cases, but that still means quite a few users being affected and reporting to us. We don't break users just because they are not the majority.
It is not an optimization, it is the Python default, which has massive usage. Not doing always the timestamp thing isn't an optimization, because it is working already correctly at scale. The edge case is adding the extra functionality to keep those timestamps in zip format.
One of the consequences of using a package manager is that reproducible builds are not that critical. Once a binary is build from some source for a given configuration, the devops best practices recommend that it never should be built from source again. So while it is great to be able to do reproducible builds, the usage of a package manager that tracks the sources unicity with recipe revisions and binary unicity with package-id reduces the needs to have reproducible builds. I know reproducible builds is important for some users, but beyond all the support tickets and pull requests we process we also do several calls per week with different users from many companies, from startups to many fortune 100 companies. And reproducible builds is very far away from being a common concern, not to say the vast majority of users doesn't really care. I am not saying it is not important, just summarizing the feedback we see at scale from many different users, and we need to evaluate and balance the needs, use cases and constraints of all users.
But this would more easily break again reproducible builds, because the conf can change more easily. While defining what is necessary in the
We can consider this, but this doesn't feel right at first sight either, as if the value can change, it is actually possible to get a package successfully built in one run, because the conf is not yet active, and then failing when activating the conf. I'll discuss this possibility with the team. |
It is frustrating to articulate concrete issues with the current handling of .zip file timestamps, and see them dismissed on grounds of breakage that might occur under extremely narrow circumstances that do not even represent typical Conan usage. It's one thing to bend over backwards to accommodate unusual corner cases, but quite another to do so at the expense of reasonable operation for everyone else.
Do I need to add an example of build breakage to #15268? I can easily put one together.
You could provide options for sanitizing/normalizing those archives. But then, how far into the weeds do you want to go to support non-standard files? Is any arbitrary .zip supposed to work? Remember that some .zip files use compression algorithms that are only implemented on Windows, for example. Repacking is going to be necessary in some circumstances.
It would be one thing if the current timestamp handling were documented, or otherwise part of an interface contract. Otherwise, it's within the realm of things that can change, possibly without warning. I've had stuff break when I pull in a new version of Conan. But at least in the last instance, I didn't squawk about it, because I was using internal date-formatting routines---and relying on those was the mistake, not Conan shifting its internals around.
The "common case" I was referring to was that timestamps are preserved when a source archive is unpacked. Why Python made their zipfile API that way, I do not know, but that is not relevant to the point of whether ignoring timestamps is reasonable in a source-archive context---it is not. The most you could argue here is that I'm the first user to actually complain about it.
How do you know that your package-build toolchain hasn't been compromised? Reproducible builds allow you the option of remaking the infrastructure from the ground up and testing whether you still get the same output. A package manager provides many benefits for a build toolchain, but it doesn't touch the "Trusting Trust" problem.
To be sure, it's ahead of where the industry is right now. One factor in making it mainstream is having tools that facilitate that use case. As an example, we're in a very different place now with Docker et al. compared to "golden image" workflows from back in the day.
Note that reproducible builds are only applicable to the specific circumstances of a user or user org. There is no way that a conanfile can be written such that everyone gets the same build, because it doesn't specify the build environment (what container, what compiler, what flags?) that has a significant effect on the outputs. So there is little benefit in having the conanfile nail down the timestamps. But it is important that whatever timestamps you do get, they should remain consistent over time. A build six months from now should be the same as a build today. You do need to have the same config, of course (like I would probably set the config option to always interpret .zip timestamps w.r.t. UTC. The alternative would be to use the time zone of my org's HQ, but that has Daylight Savings, and it's conceptually simpler to not deal with that. (Note that leaving the time zone unconfigured would presumably default to using the system/local time zone, as that is what DOS timestamps officially refer to. But that then leaves the door open to inconsistencies when running on systems in a different time zone, or even misconfigured time zones.)
In my context, if you don't start with a |
Changelog: (Fix): Extract .zip files with recorded file timestamps
Proposed fix for #15268.
Timestamps need to be set after directories have been populated, because directory timestamps are updated whenever a new file is added to them.
Note that the Windows side of things is not yet addressed. I'm not sure what, if anything, needs to be done differently there. If I can get confirmation that the same approach is workable, I can update this change to use the same
file_timestamps
array for both sides.