-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2020-resolver] Pip downloads lots of different versions of the same package #8713
Comments
Thanks for filing this issue @jcugat! pip does indeed try to use multiple versions of the same package. This is because of conflicting requirements in the dependency graph it's working with, and is a part of the proper dependency resolution process (the old resolver did not do things correctly). It basically tries to use a specific version of boto3, and then, when it realizes that version creates a conflict, it backtracks that choice and tries the next version. We are working on what would be a good way to convey this behavior change, and figuring out a good way to present this to the users -- would you have any suggestions/inputs to that end? |
What I found really surprising is that pip needs to download all those packages to do the backtracking. Doesn't have enough info with the versions required to go directly from |
Unfortunately no, that's not how the algorithms work. In theory, intermediate versions could have different dependencies that alter the possibilities - and we have to download to find the dependencies. It's one of the frustrating "this might happen so we have to allow for it, even though in practice nobody¹ ever does this, so it's a waste of time" features of Python packaging that it's difficult to explain well to people who don't have to deal with the silly edge cases... Obligatory XKCD: https://xkcd.com/1172/ ¹ Except that one guy with a package with a really weird workflow, who yells when you assume you can simplify things 🙁 |
@jcugat, pip download the distributions only to retrieve the dependency information. Say if some project has dependency requirements (e.g. spam>42) conflicting with that of boto3>1.12.32 (e.g. spam<42), there's no way pip can know what spam version that boto3 from 1.12.33 to 1.14.34 require by just looking at boto3 1.14.35. It seems that there's no way to lower the complexity of dependency resolution (which is NP-hard). Concerning each download, however, there could be faster way than downloading whole distributions (e.g. only the metadata part of a wheel like |
I have a vague intention to one day look at pip maintaining some sort of persistent cache of dependency data. But it's low priority (and that guy I mentioned would no doubt come along with a use case that invalidates the idea 😉). As @McSinyx says though, the problem of dependency resolution is fundamentally hard, though, so there's limits to how much we can do (if we exclude the option of "get the wrong answers", which is what the old resolver did, and which proved not to be what people wanted 😉) |
I wondered a while ago whether Core Metadata could add a field for packages to declare they follow Semantic Versioning. pip could backtrack much more efficiently with that assumption. But then the question would be what do we do if a package declaring it does not actually follow the rules. |
Ok, that makes a lot of sense now, and I see the complexity and what pip is trying to do in those cases. But from a user's perspective, it was very surprising at first, like pip was under an infinite loop downloading all possible versions of boto3. The issue could be solved with better output from pip during dependency resolution, similar to what's being tracked in #8683 or #8346. |
From a graph exploration perspective -- the algorithm in resolvelib is not the most optimized and we can certainly do a lot more of "tree trimming" tricks there. The issue is that it's gonna be non-trivial for us to implement those and that we're trying to do this work with limited funded developer time. I'd love to spend a few months pulling my hair out trying to figure out why my implementation of some optimization doesn't work BUT I'm pretty sure it is a bigger priority to get something good-enough out of the door instead of getting to perfect. |
@pradyunsg: I don't have a problem with the resolver needing to introspect the dependencies for several versions of a package before deciding on a version to install. In fact, I agree with the reasoning behind it. What I do have a problem with is having to download the full versions to figure out the dependencies. Consider running the following: Granted, this is an extreme example. But considering that there are systems with limited resources (a raspberry pi for instance), this could very well be a problem, especially if there are several libraries that need to be installed all with various interdependencies between them. Ideally, the dependencies should be retrievable as metadata about the package version and only download the full package if that piece of metadata is missing. But considering that adding it as metadata probably isn't trivial, it's probably a future "nice to have". I don't have an easy solution to this, just thought you should keep it in mind. Also, I did want to say thank you for working on the resolver. It sorely needed updating. |
That’s unfortunately how Python packaging works. There are proposals for better methods, but pip does not have a choice at the current time. Feel free to join the conversations on discuss.python.org if you are interested in improving the situation. |
Indeed. It's definitely something that we do have in mind. There are planned changes that would address those concerns. I don't think we can make any promises on the timeline of those since they're volunteer-driven. Two that come to mind are:
In particular, I think the important parts for this issue, in the short term, are that we get the output correct, and communicate about these changes as best as possible (through documentation of the changes + workarounds, signal boosting etc). |
It happens here as well, please take a look!
|
Another example of endless loop caused by the use of the new resolver: https://github.com/ansible-community/molecule/blob/master/Dockerfile -- if you miss to see the |
Apparently new resolved can get into endless loops of building packages. Related: pypa/pip#8713
Apparently new resolved can get into endless loops of building packages. Related: pypa/pip#8713
I've opened this discussion in discuss.python to suggest possible "low-tech" improvements for the users experience in this situation. TL; DR - my suggestion is based on the presumption that these interim packages are no longer important to the user, therefore pip should "clean up" - this can be a prompt to the user, a message printed explaining how the user can do this, or it being done automatically. I don't believe the solution here is to only bring the user to a documentaton page explaining why pip has done this. What is needed is, in order of priority, 1) enable the user to remove these packages (if it does not affect their environment), 2) explain to them why this has happened. |
Has something like the following been considered? (Apologies if this idea is based on wrong assumptions about how pip does its dependency resolution.) |
That was the point behind @uranusjr's comment here and @pradyunsg's follow-up. Basically, we could do a lot better with infrastructure/standards changes, and we are making (slow - we're all volunteers) progress with such changes. Once they happen, pip should take advantage of them (issues like this will be reminders to do that). The question right now is what we can do in pip in the context of current infrastructure. |
Hilarious. Also totally agree with @c7hm4r
That would surely be more efficient than filling up your disk with dozens and dozens of totally unnecessary previous versions of a package. |
In fact keeping them is not so bad. Think about using pip on 100-200 projects, all tested with multiple versions of python, some with maintenance branches. You endup installing a wide range of versions of the same package. IMHO what it needs to do is to index their metadata and store the last use of a package. If every week it would remove the oldest packages (30d?) it would be ok. |
@ssbarnea I think we are talking about different things. I considered a scenario where packages are installed into a virtual env according to a requirements.txt. The problem was then that PIP downloaded plenty versions of a single package into /tmp (i.e. a RAM disk) first. When the operation would have completed, it would have deleted nearly all of the versions of the package. So the open question is whether there is a purpose in accumulating all the versions first – which I doubt – instead of immediately deleting the package version when it is clear that PIP needs another version of the package.
You seem to talk about using a single environment for many different projects. Also I think you assume that PIP downloads the superfluous package versions into persistent storage. (Even in a scenario of 200 Python projects I would think that many of the historic minor versions of many of the packages are not needed. But let's not discuss that.)
I agree that the final solution should be a meta data Web API but probably few is needed to fix the OOM issue first.
|
AFAIK, pip download cache is per user and not per system or per project. My cache is now ~9.5GB 🤔, you can find it running I would indeed be worried about cluttering I do consider this bug fixed for now and I think you should better create a new one that would be very clear about the problem, we are already going in weird directions and I would be sad to see maintainers having to lock the topic to reduce noise. There are 3 different issues debated in the last messages: footprint of cache, footprint of /tmp and amount of downloads from the index. While strategy on one affects others, a bug should be only about one specific issue. |
Thank you to the maintainers for explaining the decision in calm and kind words and for all your hard volunteer work on this project 🙏🏻 I'd like to reinforce what a couple people mentioned above that while the behavior today is overall a good thing and the more extreme cases like boto3 and azure will tend to come out in the wash, the messaging could be improved. The fact that we all arrived here after googling due to our confusion about pip's behavior is proof that it's unclear what pip is doing and why. As an example, rather hundreds of lines like
it could be nice to see a single line like Backtracking boto3 to find a compatible version...
# Later...
Backtracking boto3 to find a compatible version... using boto3-1.7.34-py2.py3-none-any.whl |
We are definitely in agreement with that. However, it's hard to work out how to do that - the resolver mechanism is complex (both in terms of the code, and algorithmically) and working out both how to capture the necessary progress information, and when to report it and how to summarise it, is not easy. If anyone wants to look at the code and come up with suggestions, that would be most welcome. Otherwise be assured that we do want to improve the messages, but we can't promise how soon we'll be able to do so. |
We have the messaging you're suggesting, already being logged. Copying from an example above:
|
Both aws and azure managed to put themselves into a corner by using questionable packing methods for they pypi uploads. Is clearly not pip fault these packages create installation problems. In fact I do like displaying each version on a separated line, it is raising awareness about dependencies being too loosely specified somewhere in the chain. The trick is to narrow down the ranges in order to speed-up the process. |
I'm 8 hours into a "pip install" run - does it ever actually give up? :-) |
Would it be possible to cache the dependency graph of all versions of a package on the pypi side, and provide it to pip clients as one file? |
PyPI doesn't have that information. For wheels it could (but doesn't currently) extract it from the wheel. For sdists, though, it can't know the information without running a build, and there's no infrastructure set up to do that. Also, it's possible for the dependency graph to change based on the target architecture, so "dependencies for all versions" isn't enough, you need "dependencies for all versions on all architectures". |
And, I should note that this is part of a PyPI revamp that the PSF's Packaging-WG is seeking funding for. https://github.com/psf/fundable-packaging-improvements/blob/master/FUNDABLES.md#revamp-pypi-api |
This issue can become a nightmare after merging #9631. |
same here, a real nightmare |
Just faceplanted into this when I found my CI running for 3 hours... I like how #9631's reasoning for let's not merge this is "people will be angry" and not "we'll break a whole bunch of shit without giving a good alternative" |
I'm 45 hours into a |
is it madness to let it keep going?
I think you could have stopped at the one hour mark 😄
Do you have access to the output? Is it stuck at resolution computation
or still downloading?
Or is there something else I should do?
If you could figure out which package causes the conflict you can
contact upstream to for better compatibility.
|
Thank you McSinyx!
Here is the output:
It seems that numerous versions of this package took up the most time, although there was plenty of time spent on others before it: It seems to get slower and slower with each package/version that it tries. Is that right? And if so, why? Thanks again! FYI - I still haven't stopped it because why not. At 64 hours now...[laughs to keep from crying] 😆 |
are you trying to install object detection api? if so..the reason you stuck at Downloading kaggle-1.5.10.tar.gz (59 kB) is because you're have a problem on your installation of protoc buffers...make sure you correctly add the google protobuf to your enviroment path...on your User path.....i have experienced the exact same problem...and adding google protobuf on the path correctly... solved the problem.... |
because you're have a problem on your installation of protoc buffers...
If this is the case IMHO pip should behaves better by failing
instead of continue to try on similar solutions. As for how,
I don't know, since if I understand correctly this fallback mechanism
is also used for favoring binary distributions prioritized after
sdists that fails to build.
|
Thank you for your help. I am indeed trying to install Tensorflow's Object Detection API ( |
here's a good reference for you.. https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html |
First of all, apologies if this was already reported or it's expected behaviour. I tried searching previous issues but couldn't find anything, the only similar issue might be #8683 but the output it's different.
What did you want to do?
From a completely empty virtualenv:
Output
Additional information
I was not expecting to download all those different versions of
boto3
, since previously pip only downloaded a single one:I also tested this with the latest master version of pip and the same issue happens.
Output from pipdeptree:
The text was updated successfully, but these errors were encountered: