Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecosystem license strings are not always valid SPDX license expressions #80

Open
goneall opened this issue Oct 30, 2024 · 5 comments
Open
Assignees
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers

Comments

@goneall
Copy link
Contributor

goneall commented Oct 30, 2024

When running Trivy ecosystems enrich [file] the resultant SPDX document will, on occasion, result in a concluded license (in the SPDX format) that does not validate.

From looking at the code, it looks like the license data originates from the ecosystems.me packages API.

My guess is that ecosystem.me is just pulling the raw data from the packages metadata files and passing it along.

Some of the package managers, such as NPM, do a great job of complying with SPDX license IDs and expressions. Others, such as Maven, accept just about any string.

I'm not sure if this is an issue we should tackle in Parlay or in the upstream ecosystems.me packages. Since Parlay is producing SBOMs which specify SPDX license expressions (for both SPDX and CDX) and the upstream doesn't claim to comply with the standard, it may be best to fix it in this library.

Here's some code I wrote that fixes this issue downstream in a Python application for the SPDX standard.

The basic approach is to:

  • Detect if the license is valid using a license expression parser
  • If it is not valid, create a LicenseRef- or ExtractedLicenseInfo in the SBOM to capture the original string
  • Replace the concluded license with the LicenseRef-

Since I have a downstream solution, no urgency on a solution - but if this is something the maintainers would like to fix, I'd be happy to help out. I'm not much of a Golang programmer (I've written less than 10 lines of code so far), but I can help with the algorithms and SPDX spec.

@mcombuechen mcombuechen added bug Something isn't working enhancement New feature or request good first issue Good for newcomers labels Nov 5, 2024
@mcombuechen
Copy link
Collaborator

Hey @goneall
thanks for pointing this out. It sounds like a straightforward fix, but really the data source (ecosyste.ms) should sanitize license identifiers. But I can see how that's not necessarily feasible, there might always be licenses that don't fit in the official SPDX set.
As I'm a bit swamped at the moment, I won't have time to implement this myself. More likely in a few weeks.

@goneall
Copy link
Contributor Author

goneall commented Nov 5, 2024

Thanks @mcombuechen for the reply - no hurry from my perspective as I have a fix downstream from Parlay.

I'd be happy to help from a review / spec perspective - since I'm very new to Golang (having written a total of 5 lines of code so far), it would probably be more work for you to review / correct my attempt at a PR than to implement it yourself ;).

@paulrosca-snyk
Copy link
Contributor

Hello @goneall
For checking if the licenses are valid SPDX license IDs (i.e part of the standard licenses recognised by SPDX) is there a preferred approach? For example the spdx/tools-golang library does define a list of valid licenses but doesn't expose any utility function for accessing it. Would a contribution from us in that regard be welcome?
Alternatively there's also the github/go-spdx library which we could use to do the validation, or even keeping a copy of the licenses list in this repo and using that.

@goneall
Copy link
Contributor Author

goneall commented Dec 18, 2024

For checking if the licenses are valid SPDX license IDs (i.e part of the standard licenses recognised by SPDX) is there a preferred approach?

From looking at the 2 implementations, I would go with the github/go-spdx library. It looks like they are keeping the license list more current.

After looking at the spdx/tools-golang implementation, I added an issue with some suggestions on keeping the license ID list more current.

The idea solution would be to implement the suggested solution to fetch the latest license information in the tools-golang library and use that. If you're not concerned about running Parlay without network access, fetching and parsing the SPDX licenses JSON file and SPDX exceptions JSON file may not be too much effort.

@pooja0805
Copy link

@mcombuechen We recently used Parlay to enrich the SBOM generated by Trivy on a Cassandra container image and observed discrepancies in the license information provided for Java packages. After cross-checking with GitHub and Maven repositories, here are a few examples of the inconsistencies we found:

ST4: The package is licensed under BSD, but Parlay reports it as DSDP.
javassist: This package has three licenses: Apache 2.0, LGPL-2.1, and MPL-1.1, but Parlay outputs SSLP-1.0.
bcpkix-jdk15on: Licensed under MIT, but Parlay shows MirOS.

It would be very helpful if Parlay could include the URL or source of the license information. This would allow users to verify and trace back the data more easily.

Could you please look into this issue?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants