-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce indexed embedded CPE dictionary #1897
Conversation
810f128
to
ff95fbb
Compare
This is odd... 🤔 (link) Building snapshot artifacts�
# create a config with the dist dir overridden
echo "dist: ./snapshot" > ./.tmp/goreleaser.yaml
cat .goreleaser.yaml >> ./.tmp/goreleaser.yaml
# build release snapshots
./.tmp/goreleaser release --clean --skip-publish --skip-sign --snapshot --config ./.tmp/goreleaser.yaml
make: ./.tmp/goreleaser: Command not found
make: *** [Makefile:329: snapshot] Error [12](https://github.com/anchore/syft/actions/runs/5497306836/jobs/10017970119?pr=1897#step:4:13)7
Error: Process completed with exit code 2. |
From a suggestion in the community Slack, I tried to run the quality gate locally to make sure this change wouldn't cause any regression in matching quality. I'm about 60% sure I did this correctly. 😆 I used this in tools:
- name: syft
# note: we want to use a fixed version of syft for capturing all results (NOT "latest")
version: v0.74.1
produces: SBOM
refresh: false
- name: syft
# note: we want to use a fixed version of syft for capturing all results (NOT "latest")
version: cpe-dict
produces: SBOM-new
refresh: false
- name: grype
version: latest
takes: SBOM
- name: grype
version: latest
takes: SBOM-new And I ran Quality gate passed! |
...kg/cataloger/common/cpe/dictionary/index-generator/testdata/official-cpe-dictionary_v2.3.xml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Looking forward to more accurate CPEs being generated -- left one comment around adding unit tests around specific expectations during indexing.
Signed-off-by: Dan Luhring <[email protected]>
Signed-off-by: Dan Luhring <[email protected]>
Signed-off-by: Dan Luhring <[email protected]>
Makefile
Outdated
@@ -338,7 +343,7 @@ release: | |||
@.github/scripts/trigger-release.sh | |||
|
|||
.PHONY: ci-release | |||
ci-release: ci-check clean-dist $(CHANGELOG) | |||
ci-release: ci-check clean-dist $(CHANGELOG) cpe-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, generating new data during the release step isn't the right place for this. Though it answers the question of how this gets refreshed, it introduces potentially breaking changes right at the release, bypassing testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can just take that step out, lean on the index that's checked into the repo, and periodically regenerate what's in the repo. I've seen other projects take a similar approach with generated/embedded data. How does that sound to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to get this change in, but I'm going to take one more pass at validating the impact downstream with grype. Additionally I added one more comment about how the data is getting refreshed that will need to get addressed. I don't have a specific suggestion yet.
It looks like there wasn't sufficient sample coverage in the labeled data we use in the grype quality gate to demonstrate performance improvement. But! the gate run did show that from the existing samples, there was no degradation in performance (which is half the battle to prove). I did an ad-hoc analysis with:
where each package was selected based on it's existence in the generated CPE index, there are CVEs that map to the CPE, and affected versions exist in the package repository to use for cataloging... so chances for matching against these packages was high. Here's what I found
Which is a clear win 🎉 ! I deferred looking specifically at the jenkins plugins and rust crates until we can incorporate this analysis approach into the quality gate. This is slightly different than our typical analysis since we want to ignore unmatched labels (partial evaluation) which is flagged as invalid by the current gates. In the future we'll allow for images under test with both modes of analysis. |
@wagoodman Thanks for the analysis. Glad it's looking all positive! In case it helps with future coverage testing, one bucket of false positives that this PR solves (and I'm really looking forward to) is matching Ruby's bundled openssl gem with OpenSSL itself. The presence of this CPE in the index — What's the next step here? |
Signed-off-by: Alex Goodman <[email protected]>
nice! The packages I selected were random samples so I happened to not run across that case was all (openssl wasn't in there for certain). That should be an obvious win here 🙌 . Additionally I just updated the branch to account for how this data gets generated -- right now by a weekly workflow that would show up as a PR.
|
…ses it Signed-off-by: Alex Goodman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work 🚀 I added one more test to check that the generated data is wired to the function that uses it.
Thanks a million 🙇 ❤️ — very excited about this one |
* Introduce indexed embedded CPE dictionary Signed-off-by: Dan Luhring <[email protected]> * Don't generate cpe-index on make snapshot Signed-off-by: Dan Luhring <[email protected]> * Add unit tests for individual addEntry funcs Signed-off-by: Dan Luhring <[email protected]> * migrate CPE index build to go generate and add periodic workflow Signed-off-by: Alex Goodman <[email protected]> * add test to ensure generated cpe index is wired up to function that uses it Signed-off-by: Alex Goodman <[email protected]> --------- Signed-off-by: Dan Luhring <[email protected]> Signed-off-by: Alex Goodman <[email protected]> Co-authored-by: Alex Goodman <[email protected]>
This PR tries out a new approach to CPE generation (that was discussed briefly in Slack) in a limited capacity.
In this PR, we leverage the information in NVD's CPE Dictionary to understand what CPEs are actually defined — and, thus usable when querying NVD's CVE data — and then attempt to relate CPE dictionary entries to identifiable ecosystem packages that Syft's SCA logic detects.
We capture this knowledge in a concise JSON file that we embed into Syft at build time. (This embedded file is currently ~95 KB.)
Finally, when Syft is adding CPEs to each package it discovered, it first looks for entries in the embedded dictionary index, and uses the CPE from the dictionary if available. If no entry can be found, Syft falls back to today's existing CPE generation logic. The net effect is that Syft's existing generation logic is still used the majority of the time, but Syft's CPE quality greatly improves when a hit is found in the embedded dictionary.
Example
Using Chainguard's Node.js image, here's what Syft does before this change:
... and after this change:
It's important to note that in the "before" state, Syft makes several guesses at the CPE for the package, but none of its guesses are correct. And with the new approach, Syft gets the CPE right in the first try. This can be confirmed by setting the CPE version to
*
and querying NVD's CVE data.