Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: what to do with the "import name -> package name" mapping from conda-forge #92

Open
ericdill opened this issue Mar 5, 2023 · 18 comments

Comments

@ericdill
Copy link
Owner

ericdill commented Mar 5, 2023

Hi Team,

depfinder has some code in reports.py that does a pretty good job mapping from "importable module" to "most likely package that has that module". Turns out that the code that enables this behavior in depfinder relies on a part of the bot that has been disabled for a little over a year. That part of the bot generates the files in libcfgraph/import_maps. The import map generation was disabled because it was generating json files that were over 100MB in size. And that was over a year ago. So that brings us to my question of what should we do about this?

  • on the one hand, having depfinder spit out "these are the packages that you need to install given your imports" is really great.
  • on the other hand, the information that depfinder is using to do this is woefully out of date (~13 months old) and so is old information better than no information?

Bringing this functionality back into the bot is not something I'm not particularly keen to solve right now. It seems like a problem that's very well suited to "use a database for this", but since CF doesn't have access to databases, we're left having to do this with files and git.

If we don't have anyone interested in bringing this functionality back into the bot then my vote would be to disable this feature. We can reconsider bringing it back once the conda-forge bot is providing updated information.

What do you think @beckermr @CJ-Wright @mariusvniekerk?

What are the downsides to disabling this in depfinder?

@ericdill
Copy link
Owner Author

ericdill commented Mar 5, 2023

@ocefpaf too!

@beckermr
Copy link
Collaborator

beckermr commented Mar 5, 2023

Good questions. The bot itself also has code that produces a "ranked hub authorities file" that is related to this. I don't understand that relationship. It might be good to flesh that out a bit maybe?

@ericdill
Copy link
Owner Author

ericdill commented Mar 5, 2023

idx file bot depfinder
1 import_maps_meta.json file that contains the upper limit of characters in the import_maps/*.json file names uses to determine which file to go download when looking for the import name -> package artifact relationship
2 import_maps/*.json files that used to be produced by the bot. contain mapping of import name to packages that provide the import depfinder grabs these files to produce the mapping of "all possible packages that could provide this import"
3 .file_listing.json bot produces this file as a sorted list of all of the artifacts that are currently on conda-forge depfinder uses this file to make a mapping of full_package_string : package_name, e.g., 21cmfast-3.0.2-py36h1af98f8_1 : 21cmfast
4 ranked_hubs_authorities.json file produced by the bot that attempts to score packages based on the number of other packages that depend on them, among other things used by depfinder to determine the "most likely package name" given an import

ok so how does depfinder use these files?

A. given import_name, figure out which import_maps/* file needs to be downloaded, then download that file. Open up that file, grab all of the artifacts that provide import_name. This step uses rows 1 & 2 above
B. Download file_listing.json (row 3 above) and make a mapping of full_package_string to package_name. For each of the artifacts pulled out in step A, figure out their package_name from the mapping that we make in this step (step B).
C. Given 1 or more package_name's from step B, grab the first one that appears in the ranked list in ranked_hubs_authorities.json (row 4 in the table above)

Does this help @beckermr ?

@beckermr
Copy link
Collaborator

beckermr commented Mar 5, 2023

Helps a bit but files in libcfgraph are not made by the bot. So I think depfinder uses two services.

@ericdill
Copy link
Owner Author

ericdill commented Mar 5, 2023

oh. weird. ok. i guess cf-scripts only writes to cf-graph-countyfair?

in that case, seems like the pypi_name_mapping github action produces import_name_priority_mapping.json that I could use instead

the above file is a data structure that looks like this:

[
  {"import_name": "ATE", "ranked_conda_names": ["semi-ate"]}, 
  {"import_name": "AWSIoTPythonSDK", "ranked_conda_names": ["awsiotpythonsdk"]}, 
  ...
]

so what writes to libcfgraph then? oh there's a circleci action that updates libcfgraph i guess? what does libcfgraph do?

@beckermr
Copy link
Collaborator

beckermr commented Mar 5, 2023

Right there is a circleci action that writes to libcfgraph. libcfgraph collects info about every package into a single repo. It is used by a bunch of conda-forge stuff including the mamba solver for run exports and our scanning service to try and detect harmful files in packages.

@beckermr
Copy link
Collaborator

beckermr commented Mar 5, 2023

IDK if the import name priority mapping is complete or only covers nodes that are ambiguous. Also note that grayskull uses some of this data too. :/

@beckermr
Copy link
Collaborator

beckermr commented Mar 5, 2023

As usual, the answer is to fix libcfgraph and just keep the status quo. We don't have the resources to pay down debt, but we can service it.

@beckermr
Copy link
Collaborator

beckermr commented Mar 6, 2023

@ericdill New import to pkg maps are appearing here: https://github.com/regro/libcfgraph/tree/master/import_to_pkg_maps

These only have the package name and not the full artifact. They should be a lot smaller.

@ocefpaf
Copy link
Collaborator

ocefpaf commented Mar 6, 2023

@ericdill I'm a bit late for this discussion but, my opinion, is the same as before we added this to depfinder. It is a nice feature to have but I'd rather have it as a plugin/optional/separate module, etc than inside depfinder itself in order to reduce the maintenance burden here.

@beckermr
Copy link
Collaborator

beckermr commented Mar 6, 2023

Agreed. We should ship a package of simple apis for pulling this metadata.

@beckermr
Copy link
Collaborator

beckermr commented Mar 6, 2023

This has the nice side effect that if the data is moved to another device we can easily move everything over.

@ericdill
Copy link
Owner Author

ericdill commented Mar 6, 2023

oh that's a nice idea. would we make that new package part of the regro org?

@ericdill
Copy link
Owner Author

ericdill commented Mar 6, 2023

is the same as before

thanks @ocefpaf . i had forgotten the previous discussion. glad you recall!

@beckermr
Copy link
Collaborator

beckermr commented Mar 6, 2023

oh that's a nice idea. would we make that new package part of the regro org?

Sure. That's the best spot. Something like conda-forge-tick-data would be fine.

@beckermr
Copy link
Collaborator

So grayskull doesn't pull from the bot data for these maps anymore. It maintains its own list of differences. They may have come from the bot at one time, but now it is separated.

@beckermr
Copy link
Collaborator

The data used by depfinder is now wrapped into this package: https://github.com/regro/conda-forge-metadata

Here is how to use it

from conda_forge_metadata.autotick_bot import map_import_to_package


def test_map_import_to_package():
    assert map_import_to_package("numpy") == "numpy"
    assert map_import_to_package("numpy.linalg") == "numpy"

    # something bespoke
    assert map_import_to_package("eastlake") == "des-eastlake"

    assert map_import_to_package("scipy") == "scipy"

@jaimergp
Copy link

The reduced-size mapping is now erroring out too: regro/libcfgraph#14

:D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants