Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create PURL services CLI tool and library #247

Closed
pombredanne opened this issue Dec 13, 2023 · 118 comments
Closed

Create PURL services CLI tool and library #247

pombredanne opened this issue Dec 13, 2023 · 118 comments
Assignees

Comments

@pombredanne
Copy link
Member

To best support using various PURL-based services, I would like to have a command client tool and library as a client API that can expose these services for integration elsewhere.

@johnmhoran
Copy link
Member

@pombredanne @AyanSinhaMahapatra

I've looked at the SCTK fetch_thirdparty.py example, but I have to admit that I don't understand what a complete command for that utility would look like or how it might apply to the current issue. Examples of how to run the fetch_thirdparty example would be helpful for me to explore how that works. (I've looked but found no documentation/examples for that utility.)

In addition, the description above of the current issue seems rather vague. What does it mean to create a client API tool to access PURL services? Examples of PURL services we want to handle, and some descriptions of user input and output, would be particularly helpful.

The only exposure I've had so far with the PurlDB is the experimentation I've done since last Friday evening with the new validate endpoint.

  • Is that an example of a service you want this CLI tool/library to handle?
  • Would we want, for example, a command-line function that enables a user to input -- through the command line or perhaps a .txt or .xlxs file -- the 1+ PURLs that he or she wants to vet with the new validate endpoint? Maybe with options of terminal output and a .xlsx workbook?

@johnmhoran
Copy link
Member

@pombredanne Now that we've (initially) addressed the validate endpoint with our new CLI, what additional "services" do you want me to focus on, and how can I identify them and begin to understand how users use those services?

johnmhoran added a commit that referenced this issue Dec 28, 2023
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Dec 28, 2023
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
@johnmhoran
Copy link
Member

@pombredanne As noted last week, I'm blocked for now from additional CLI work until we can add the missing details to your initial description of this issue, i.e., ID the additional services, commands and use cases we want to include.

@pombredanne
Copy link
Member Author

pombredanne commented Jan 3, 2024

So the next steps after validate (and after adding tests to validate) would be to use the latest and new fetchcode as a library to add two new sub commands:

  • versions: given a PURL, return the list of all known versions for this purl
  • meta: given a PURL, return a mapping of metadata fetched from the API for this PURL

After this I would like to see these:

  • urls: given a PURL, return a list of [{URL type: URL}, ...] as in [{"homepage_url": "https:example.com"}, {"vcs_url": "...."}] and various download URLs. Use the packageurl library for this (purl2url) and this will need updating as needed, and use as well scancode-toolkit packagedcode or code in dejacode. Optionally validate each URL existence with a head request (ask @chinyeungli on how to do this)
  • purlcli: purl2scan #277 scan: given a PURL, fetch the URLs and call a scancode.io API and run a scan with a scan_package pipeline, then return the scan results. Either wait for the scan to complete or poll until completion. Later implement the same with PurlDB which has code in the "priority queue" to handle this.
  • purl2git: From PURL get Git versions and tags  #258 git: takes a PURL and returns a new PURL for the corresponding git repos and the tag matching the PURL version. And this for the implemented Maven, npm and Debian package types.

johnmhoran added a commit that referenced this issue Jan 4, 2024
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
@johnmhoran
Copy link
Member

@pombredanne Re the first bullet above -- a versions subcommand based on fetchcode -- are you looking for this sort of output, or perhaps just a list of versions as strings? (This is an excerpt from the pkg:pypi/scancode-toolkit output.)

        purl_versions = [
            [
                PackageVersion(
                    value="2.0.0",
                    release_date=datetime.datetime(
                        2017, 6, 23, 8, 35, 20, 322426, tzinfo=tzutc()
                    ),
                ),
                PackageVersion(
                    value="2.0.0rc3",
                    release_date=datetime.datetime(
                        2017, 6, 16, 16, 24, 2, 443222, tzinfo=tzutc()
                    ),
                ),
. . .

@johnmhoran
Copy link
Member

Compare just the versions as strings, e.g.,

results_values = ['2.0.0', '2.0.0rc3', '2.0.1', '2.1.0', . . . '32.0.5rc3', '32.0.6', '32.0.7', '32.0.8']

@johnmhoran
Copy link
Member

@pombredanne Do we want to get version data for both one PURL and for multiple PURLs, depending on the user's need? (Just as we do with validating either a single PURL or a list of PURLs.)

Also: What should the output look like: a list of string versions, or JSON (and if so, what would it look like)?

@pombredanne
Copy link
Member Author

Always start with a single PURL. Expanding to a list is easy.

The output could be either:

  1. something like {"purl": "... input purl", "versions": ["1.1", "2.3" , ....]}
  2. or may be better: {"purl": "... input purl", "versions": [{"purl": "pkg:[email protected]", "version": "1.1.2"}, {"purl": "pkg:[email protected]", "version": "1.1.3"}]}

This will account for multiple PURLs in both cases.

Eventually the output will need account for the input in some header instead, much like in a ScanCode scan, but this is for the future, but nothing urgent for now.

@pombredanne
Copy link
Member Author

pombredanne commented Jan 5, 2024

Manage objects internally, and deal with simple/plain serialized Python data at the end only.
Adding the release date of each version works too BTW, just make sure you use an ISO timestamp like it is done in our other APIs.

@johnmhoran
Copy link
Member

Thanks @pombredanne . 👍

@johnmhoran
Copy link
Member

I'm working on the versions command (see above comments).

  • Some queries for a PURL that does not exist in the relevant repo (e.g., pkg:pypi/ogdendunes) return an error message like Error while fetching 'https://pypi.org/pypi/ogdendunes/json': 404 and result in an empty list (from my code). That error message seems to be generated by this fetchcode package_versions.py function.

  • Other queries for a PURL with a similar but no exact name match (e.g., pkg:pypi/foobar -- yes, there are a number of PyPI packages with foobar in the name) result in an empty list (from my code) but no error message from fetchcode.

My CLI code detects the empty list and displays a message in the terminal (There was an error with your '{purl}' query. Make sure that '{purl}' actually exists in the relevant repository.) -- but I'd like to prevent the fetchcode 404 error message from also being displayed in the terminal as is currently the case.

Is there some way to do this?

@johnmhoran
Copy link
Member

More info:

The fetchcode error is displayed in the terminal each time one of these two variables is defined in the code (they produce an empty list):

results = list(versions(purl))
results = list(router.process(purl))

These, otoh, do not invoke a fetchcode error displayed in the terminal, and each produces a generator object.

test01 = versions(purl)
test02 = router.process(purl)

@johnmhoran
Copy link
Member

Actually, I should be able to use 'validate' and display a message to the user for each PURL for which 'validate' returns "exists": false ....

@pombredanne
Copy link
Member Author

@johnmhoran I would not worry too much about the CLI output for now, as long as the JSON is correct
If fetchcode displays an error message, then that's an issue there not here ... @TG1999 @keshav-space

@TG1999
Copy link
Contributor

TG1999 commented Jan 9, 2024

johnmhoran added a commit that referenced this issue Jan 11, 2024
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Jan 11, 2024
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Jan 11, 2024
johnmhoran added a commit that referenced this issue Jan 11, 2024
johnmhoran added a commit that referenced this issue Jan 11, 2024
@johnmhoran
Copy link
Member

@pombredanne @JonoYang I'm close to being ready to commit and push my latest purlcli.py and test_purlcli.py. All 42 tests pass (3 test classes, 1 for each current command/service, e.g., class TestPURLCLI_validate(object), and each is parametrized, thus my use of object as argument per my research -- TestCase and FileBasedTesting seem to be incompatible with @pytest.mark.parametrize()).

I ran make test, expecting just 1 failure as in the past, but this time, 2 failed.

FAILED minecode/tests/test_maven.py::MavenEnd2EndTest::test_visit_and_map_with_index - AssertionError: Lists differ: [{'ur[31 chars]ven2/cnuernber/dtype-next/0.4.2/dtype-next-0.4[49087 chars]one}] != [{'ur[31 chars]ven2/.index/nexus-maven-repository-index.532.g[49087 chars]one}]

FAILED minecode/tests/test_ls.py::ParseDirectoryListingTest::test_parse_listing_from_lslr - AssertionError: Lists differ: [{'pa[1527 chars] '2023-01', 'target': None}, {'path': 'dists/e[974 chars]one}] != [{'pa[1527 chars] '2024-01', 'target': None}, {'path': 'dists/e[974 chars]one}]

No idea why, no reason to think this results from my work, but who knows? test_visit_and_map_with_index has failed with make test since I first cloned the repo. test_parse_listing_from_lslr is a new failure.

Unless you suggest otherwise, I'm going to vet my code and tests for a final cleanup, commit and push. ;-)

johnmhoran added a commit that referenced this issue Jan 17, 2024
@johnmhoran
Copy link
Member

Just committed and pushed.

@JonoYang
Copy link
Member

@johnmhoran I wouldn't mind the test_parse_listing_from_lslr for now. This test fails every so often due to changes in file dates when the test is run. I will make a PR to revisit this test or remove it.

johnmhoran added a commit that referenced this issue Feb 16, 2024
johnmhoran added a commit that referenced this issue Feb 16, 2024
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Feb 17, 2024
@johnmhoran
Copy link
Member

@JonoYang I see that the PR I just merged has a failed test due to the fact that many of the commands I'm adding include dynamic values -- URLs and versions -- and they change and in this case the code_view_url and download_url for pkg:rubygems/rails have changed in the last day.

This sort of issue will arise going forward for many of these URLs, and for versions where new versions are added, for example. I will take a look at all of these tests after my break to see how I can remove those key-value pairs from the tests, but for some, like metadata and urls and versions, this dynamic data is an important part of the output. Please let me know if you have any suggestions and/or tried-and-true approaches.

@JonoYang
Copy link
Member

@johnmhoran

In scancode.io/scanpipe/tests/test_pipelines.py, we have the method PipelinesIntegrationTest.assertPipelineResultEqual() ( https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L459), where we normalize the Package UID and related fields (https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L426) as well as remove other fields that can contain dynamic data (https://github.com/nexB/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L384) before comparing the results to expected results.

You can also look into mocking API responses for certain tests to return a set of expected data that you define. An example of a test using mock is https://github.com/nexB/purldb/blob/1272a61ce885d600ee79969de05d0e1c34b283e4/packagedb/tests/test_package_managers.py#L72

@johnmhoran
Copy link
Member

Thank you @JonoYang -- I'll give these a close look. The _without_keys looks particularly interesting. I'm currently doing something similar with def streamline_headers(headers) from cli_test_utils.py, with which I remove the tool_version key-value pair from the headers section since it's also a value that changes.

@johnmhoran
Copy link
Member

@JonoYang @pombredanne One error I have been unable to catch or otherwise prevent is the error and traceback (interrupting the purlcli command's flow) thrown by fetchcode's package_versions.py when a query for a pkg:deb/debian/* (e.g., pkg:deb/debian/2ping) intermittently encounters a server error. Example (edited to remove additional input PURLs in the command):

(venv) Mon Mar 04, 2024 05:48 PM  /home/jmh/dev/nexb/purldb jmh (247-purlcli-update-validate-and-versions)
$ python -m purldb_toolkit.purlcli versions --purl pkg:deb/debian/2ping --output -
Error while fetching 'https://sources.debian.org/api/src/2ping': 503
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 1074, in <module>
    purlcli()
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 848, in get_versions
    purl_versions = get_versions_details(purls, output, file, unique, command_name)
  File "/home/jmh/dev/nexb/purldb/purldb-toolkit/src/purldb_toolkit/purlcli.py", line 955, in get_versions_details
    for package_version in list(versions(purl)):
  File "/home/jmh/dev/nexb/purldb/venv/lib/python3.10/site-packages/fetchcode/package_versions.py", line 205, in get_deb_versions_from_purl
    for release in response["versions"]:
TypeError: 'NoneType' object is not subscriptable

(venv) Mon Mar 04, 2024 05:53 PM  /home/jmh/dev/nexb/purldb jmh (247-purlcli-update-validate-and-versions)
$

I think the fix will be needed in fetchcode/package.py itself -- perhaps in the first of these 2 package_versions.py functions (below) involved in the error/traceback, adding something like the following might simply return None, which I could catch in the purlcli.py code. What do you think/suggest?

(Meanwhile I will invest no more time on this intermittent error and return to my refactoring/cleaning with the goal of converting my current draft PR #305 to review status ASAP -- that will represent the first 4 commands (metadata, urls, validate and versions) being ready for review. Subject to addressing additional comments, I'll turn to the 5th and 6th targeted commands scan and git -- see #247 (comment) above.)

    if not response:
        logger.error(f"Failed to fetch {url}")
        return

See
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/package_versions.py#L189-L208
and
https://github.com/nexB/fetchcode/blob/master/src/fetchcode/package_versions.py#L512-L524

@JonoYang
Copy link
Member

JonoYang commented Mar 5, 2024

@johnmhoran

I think the fix will be needed in fetchcode/package.py itself -- perhaps in the first of these 2 package_versions.py functions (below) involved in the error/traceback, adding something like the following might simply return None, which I could catch in the purlcli.py code. What do you think/suggest?

That makes sense to me. @keshav-space What do you think is the best way to handle a 503 error when getting debian packages? The exception in question is in #247 (comment)

@keshav-space
Copy link
Member

@johnmhoran

I think the fix will be needed in fetchcode/package.py itself -- perhaps in the first of these 2 package_versions.py functions (below) involved in the error/traceback, adding something like the following might simply return None, which I could catch in the purlcli.py code. What do you think/suggest?

That makes sense to me. @keshav-space What do you think is the best way to handle a 503 error when getting debian packages? The exception in question is in #247 (comment)

@JonoYang @johnmhoran
Strange, but I'm not able to reproduce the mentioned error locally for pkg:deb/debian/2ping.

>>> from fetchcode.package_versions import versions
>>> r=versions("pkg:deb/debian/2ping")
>>> list(r)
[PackageVersion(value='4.5-1.2', release_date=None), PackageVersion(value='4.5-1.1', release_date=None), PackageVersion(value='4.5-1', release_date=None), PackageVersion(value='4.3-1', release_date=None), PackageVersion(value='3.2.1-1+deb9u1', release_date=None), PackageVersion(value='2.1.1-1', release_date=None), PackageVersion(value='2.0-1', release_date=None)]

Nevertheless, we should be bit more skeptical while processing the upstream data. In the event that we're unable to get proper metadata or version information, we should simply return an empty list. @johnmhoran , please enter an issue for this in fetchcode.

@johnmhoran
Copy link
Member

@keshav-space @JonoYang Am I correct in thinking that for now there is nothing I can do on my purlcli end to handle this?

@johnmhoran
Copy link
Member

@JonoYang @pombredanne Earlier today I pushed an update to my open PR #305 that includes substantial updates to the first four commands -- metadata, urls, validate and versions.

  • My inclination, subject to addressing comments on that PR when those come in, is to turn next to the remaining two identified commands: scan and git (initially described at Create PURL services CLI tool and library #247 (comment) with supplemental issues of their own).
  • In addition, the description of the urls command also mentions updating work to be done on the packageurl library for this (purl2url).

Please let me know what priority I should assign to these two new commands and the purl2url updating (and the purl2url task needs enough detail/description that I can turn to that when the time comes).

@keshav-space
Copy link
Member

@keshav-space @JonoYang Am I correct in thinking that for now there is nothing I can do on my purlcli end to handle this?

@johnmhoran You can catch these exceptions in purlcli, but ultimately, they should be handled in fetchcode.

@johnmhoran
Copy link
Member

@keshav-space Can you explain how I can catch them in purlcli? And who will take care of modifying fetchcode to catch them there? Is that falling to me? I did not write and am not familiar with the underlying code.

@keshav-space
Copy link
Member

@keshav-space Can you explain how I can catch them in purlcli? And who will take care of modifying fetchcode to catch them there? Is that falling to me? I did not write and am not familiar with the underlying code.

@johnmhoran IMO this doesn't need immediate handling, and we should not try to handle this from purlcli. We already have the issue that you entered in fetchcode, and we will fix it there later on.

@johnmhoran
Copy link
Member

Ok sounds good -- thanks @keshav-space

johnmhoran added a commit that referenced this issue Mar 7, 2024
Reference: #305
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Mar 7, 2024
Reference: #305
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
johnmhoran added a commit that referenced this issue Mar 7, 2024
Reference: #305
Reference: #247

Signed-off-by: John M. Horan <[email protected]>
@johnmhoran
Copy link
Member

johnmhoran commented Mar 21, 2024

Current status summary:

  • The metadata command has been implemented and tests added.
  • The validate command has been implemented and tests added.
  • The versions command has been implemented and tests added.
  • The urls command has been implemented and tests added.
  • The scan command is tracked separately in purlcli: purl2scan #277
  • The git command is tracked separately in #Add git/vcs subcommand to CLI #371

@pombredanne
Copy link
Member Author

This is now completed.

The main tool has been released at https://pypi.org/project/purldb-toolkit/ which is a new sub project and package for this:
https://github.com/nexB/purldb/tree/21c24f4a47a03c2f47a8661c23f5330a0ecf10ab/purldb-toolkit

This purlcli acts as a client to the REST API end point(s) to expose the new PURL services. It serves both as a tool and as an example on how to use the services programmatically.

We also added support for:

To back this feature, we also created a new PURLDB API endpoint that can validate a PURL. It takes a purl and check whether it's valid PackageURL or not and optionally check the existence referenced package exists in the real world. v3.0.0...main#diff-56868aeeeef335bde38d62cfb44dc6e518ae311bc6bcdf298bb5e7bc73cd1afcR765
See also https://github.com/nexB/purldb/blob/21c24f4a47a03c2f47a8661c23f5330a0ecf10ab/packagedb/serializers.py#L408

We now have these sub commands implemented and tested that are clients to the REST API endpoint(s) and also reuse the other libraries (packageurl and fetchcode) extensively:

To test this feature:

  1. install the tool with pip:
python3 -m venv purlcli
source purlcli/bin/activate
pip install –upgrade pip
pip install purldb-toolkit
  1. Then for instructions run
    purlcli --help

See also https://github.com/nexB/purldb/blob/main/purldb-toolkit/README.rst for command line utility help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants