Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(spdx): use the hasExtractedLicensingInfos field for licenses that are not listed in the SPDX #8077

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

DmitriyLewen
Copy link
Contributor

@DmitriyLewen DmitriyLewen commented Dec 10, 2024

Description

For cases where the SPDX license list does not contain the detected license, we should use the hasExtractedLicensingInfos field.
See #7721 for more details.

Before:

➜ trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"name": "libsasl2-2",' -A 8 
      "name": "libsasl2-2",
      "SPDXID": "SPDXRef-Package-eb685f3647d2fa92",
...
      "licenseConcluded": "BSD-4-Clause AND OpenSSL AND SSLeay AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND RSA-MD AND BSD-4-clause-KTH AND text:--BSD-4-clause AND IBM-as-is AND BSD-3-clause-JANET AND BSD-3-clause-PADL AND MIT-OpenVision AND OpenLDAP AND FSFULLR AND MIT-CMU AND MIT-Export AND BSD-2.2-clause AND text:--IBM-as-is",
      "licenseDeclared": "BSD-4-Clause AND OpenSSL AND SSLeay AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND RSA-MD AND BSD-4-clause-KTH AND text:--BSD-4-clause AND IBM-as-is AND BSD-3-clause-JANET AND BSD-3-clause-PADL AND MIT-OpenVision AND OpenLDAP AND FSFULLR AND MIT-CMU AND MIT-Export AND BSD-2.2-clause AND text:--IBM-as-is",

After:

➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"name": "libsasl2-2",' -A 8
      "name": "libsasl2-2",
      "SPDXID": "SPDXRef-Package-eb685f3647d2fa92",
...
      "licenseConcluded": "BSD-4-Clause AND OpenSSL AND LicenseRef-ea55a018594e6bf5 AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND LicenseRef-65f52c543487922d AND LicenseRef-a02a4c35e67d69da AND LicenseRef-148e27c4af509b1f AND LicenseRef-6f1acaa4d7f6c64a AND LicenseRef-f3daae6e2ead6f1 AND LicenseRef-90cfa1054f767716 AND LicenseRef-d336b55b7573f941 AND LicenseRef-7955e7acaf9b5cc3 AND LicenseRef-17a620271d638d1c AND LicenseRef-157e923866327e93 AND LicenseRef-5a72bdaf3db99c53 AND LicenseRef-d5845310a6a8e792",
      "licenseDeclared": "BSD-4-Clause AND OpenSSL AND LicenseRef-ea55a018594e6bf5 AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND LicenseRef-65f52c543487922d AND LicenseRef-a02a4c35e67d69da AND LicenseRef-148e27c4af509b1f AND LicenseRef-6f1acaa4d7f6c64a AND LicenseRef-f3daae6e2ead6f1 AND LicenseRef-90cfa1054f767716 AND LicenseRef-d336b55b7573f941 AND LicenseRef-7955e7acaf9b5cc3 AND LicenseRef-17a620271d638d1c AND LicenseRef-157e923866327e93 AND LicenseRef-5a72bdaf3db99c53 AND LicenseRef-d5845310a6a8e792",
➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"licenseId": "LicenseRef-a02a4c35e67d69da",' -A 2
      "licenseId": "LicenseRef-a02a4c35e67d69da",
      "extractedText": "NOASSERTION",
      "name": "BSD-4-clause-KTH"
➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"licenseId": "LicenseRef-148e27c4af509b1f",' -A 2
      "licenseId": "LicenseRef-148e27c4af509b1f",
      "extractedText": "BSD-4-clause and IBM-as-is",
      "name": "NOASSERTION"

test runs for cron action:

Related issues

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

return fmt.Sprintf("(%s)", license)
}), " AND ")

normalizedLicense, err := expression.Normalize(license, licensing.NormalizeLicense, expression.NormalizeForSPDX)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to override the expression.NormalizeForSPDX function, but we use With as a delimiter to parse licenses:

%token<token> IDENT OR AND WITH

So there are problems with overwriting licenses with the wrong exception

@DmitriyLewen
Copy link
Contributor Author

@goneall I created this PR to use the hasExtractedLicensingInfos field (as you wrote in #7716).
It would be great if you took a look if you have time

@DmitriyLewen DmitriyLewen marked this pull request as ready for review December 11, 2024 07:03
@DmitriyLewen DmitriyLewen self-assigned this Dec 11, 2024
Copy link

@goneall goneall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this code will fix the issue and provide compliance SPDX license expressions.

I had a few suggestions to produce a bit more understandable results and a few coding suggestions.

}
if text {
otherLicense.ExtractedText = license
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {

I would always include the license as the license name - not just when the text is present

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thought, this may be appropriate if the license text is the actual text of the license. In most cases, the metadata for packages includes text that are supposed to be a license name or identifier in which case it should also be in the name. If we know the license text is really the text, then the existing code is OK.

If, however, the license text is what is found in the package metadata files and they are not the actual text, I would add the same field as the name PLUS add a LicenseComment to explain - something like `otherLicense.LicenseComment = "The license text represents text found in package metadata and may not represent the full text of the license"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases (e.g. license field from python METADATA files) when we can't understand that license is name/id/text.
We tried to detect text of license:

var licenseTextKeywords = []string{
"http://",
"https://",
"(c)",
"as-is",
";",
"hereby",
"permission to use",
"permission is",
"use in source",
"use, copy, modify",
"using",
}
func isLicenseText(str string) bool {
for _, keyword := range licenseTextKeywords {
if strings.Contains(str, keyword) {
return true
}
}
return false
}

So Trivy split license name/id and license text(text:// prefix).
That is why i used both LicenseName and ExtractedText fields:
https://github.com/DmitriyLewen/trivy/blob/f851f9bb18411db838fc65e0c6c4351b04953f8f/pkg/sbom/spdx/marshal_private_test.go#L92-L112

Another question - i used NOASSERTION for LicenseName and ExtractedText fields (see link above), because these fields are mandatory (https://github.com/spdx/tools-golang/blob/f6e45fdb9e4e0c993105f798bee5f8aa8ea70f84/spdx/v2/v2_3/other_license.go#L12-L20).
Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a LicenseComment to explain - something like `otherLicense.LicenseComment = "The license text represents text found in package metadata and may not represent the full text of the license"

I liked this idea 👍
Added in 041ab21

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Trivy split license name/id and license text(text:// prefix).
That is why i used both LicenseName and ExtractedText fields

Makes sense - I wasn't sure how the text was captured. Since you are capturing the text and name distinctly, your approach should work. BTW, it's a bit tricky to find the start and end of the license text and even trickier to match it to know licenses - something we've been working on in the SPDX java libraries for about 10 years and still don't have it perfected ;).

i used NOASSERTION for LicenseName and ExtractedText fields (see link above), because these fields are mandatory (https://github.com/spdx/tools-golang/blob/f6e45fdb9e4e0c993105f798bee5f8aa8ea70f84/spdx/v2/v2_3/other_license.go#L12-L20).
Is this correct?

For the license name, this is OK. The spec isn't very clear on how these should be treated, so many people make up a name based on the text. Unlike other parts of the spec, NOASSERTION is not required if you don't know - but in this case I think it would be fine to use NOASSERTION - for the name.

For the text, I would put in whatever string was actually found - even if it is the name. The definition of the field is the license text found - so if someone puts in "This software is licensed under mylicense" - you can put that exact string in the text field even though it is not technically the text of "mylicense". I wouldn't use "NOASSERTION" for the text field.

Copy link
Contributor Author

@DmitriyLewen DmitriyLewen Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for your opinion!
Updated text field in 659f992

"VSFTPD-OPENSSL-EXCEPTION",
"WXWINDOWS-EXCEPTION-3.1",
"X11VNC-OPENSSL-EXCEPTION",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may deserve to be a separate suggestion, but in reading the code I would recommend building the license and exception IDs from the JSON files maintained by the SPDX legal team. The license list is updated every 3 months with new IDs and maintaining these in code can be a challenge to keep up and maintain. What I do in the code I maintain is attempt to access the current JSON files on the website https://spdx.org/licenses/licenses.json and https://spdx.org/licenses/exceptions.json. If I can not access the website or if the user specified not to use the online version, I'll use a cached version of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases when users run multiple times.
Downloading these files for each run is not good.
But we can save licenses.json and exceptions.json files in the cache dir and use them.
The files contain releaseDate field, so we can update this file only when releaseDate + 3 months has expired.

The license list is updated every 3 months

How strictly is this rule followed?

Anyway let's move this discussion into another issue/pr.


What I do in the code I maintain is attempt to access the current JSON files on the website https://spdx.org/licenses/licenses.json and https://spdx.org/licenses/exceptions.json

I found that https://spdx.org/licenses/exceptions.json and https://github.com/spdx/license-list-data/blob/592c2dcb8497c6fe829eea604045f77d3bce770b/json/exceptions.json are different (see harbour-exception).
Which file would be more correct to use?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strictly is this rule followed?

Not very strictly. There is, however, a license list version field which is reliably incremented on release.

I found that https://spdx.org/licenses/exceptions.json and https://github.com/spdx/license-list-data/blob/592c2dcb8497c6fe829eea604045f77d3bce770b/json/exceptions.json are different (see harbour-exception).
Which file would be more correct to use?

The lists at https://spdx.org/licenses - these will always be the latest released version. The github repo master will have the latest in development version which may not be stable. The github repo is tagged with release versions, so if you go to the tag for the latest release in github, it will match what is on the website.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, update exception list in 659f992

}
return NormalizeLicense(c.Licenses)
otherLicense.LicenseIdentifier = LicenseRefPrefix + "-" + licenseID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion - you could simplify this by including the "-" in the LicenseRefPrefix definition and just use LicenseRefPrefix + licenseID The string "LicenseRef" will likely never be used without the trailing "-".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our internal decision - we don't use - suffixes in constants :)

Comment on lines 369 to 371
// SpdxLicenseExceptions contains all supported SPDX Exceptions
// cf. https://spdx.org/licenses/exceptions.json
// used `awk -F'"' '/"licenseExceptionId":/ {print toupper("\"" $4 "\"," )}' exceptions.json ` command
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to keep it up-to-date, it should be done by mage spdx or something like that. I think we should create a separate file for the list and add

// Code generated by "mage spdx", DO NOT EDIT.
// source: https://spdx.org/licenses/exceptions.json

to the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check exceptions.json file in tests?
I mean the same as for mage docs:generate

This will help keep the file up-to-date, but can be noisy for PRs when a new version of file is released.
On the other hand, we can add a separate action to check the file's relevance once a week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using curl and awk through go generate is the easiest way, but some environments don't have curl, and CLI flags might be different. Ideally, we should do that in Go.

Copy link
Collaborator

@knqyf263 knqyf263 Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check exceptions.json file in tests?

No, we don't need it for now. We can update the file when we notice that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what i planned
this command (using awk and other commands) is just a quick way to get all exceptions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a separate action to check the file's relevance once a week.

This sounds better. Also, we don't need to fail the test. We can notify it on Microsoft Teams.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated PR.
Take a look, when you have time, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix(spdx): use hasExtractedLicensingInfos for licenses not in the SPDX license list
3 participants