Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce indexed embedded CPE dictionary #1897

Merged
merged 5 commits into from
Jul 21, 2023
Merged

Conversation

luhring
Copy link
Contributor

@luhring luhring commented Jun 27, 2023

This PR tries out a new approach to CPE generation (that was discussed briefly in Slack) in a limited capacity.

In this PR, we leverage the information in NVD's CPE Dictionary to understand what CPEs are actually defined — and, thus usable when querying NVD's CVE data — and then attempt to relate CPE dictionary entries to identifiable ecosystem packages that Syft's SCA logic detects.

We capture this knowledge in a concise JSON file that we embed into Syft at build time. (This embedded file is currently ~95 KB.)

Finally, when Syft is adding CPEs to each package it discovered, it first looks for entries in the embedded dictionary index, and uses the CPE from the dictionary if available. If no entry can be found, Syft falls back to today's existing CPE generation logic. The net effect is that Syft's existing generation logic is still used the majority of the time, but Syft's CPE quality greatly improves when a hit is found in the embedded dictionary.

Example

Using Chainguard's Node.js image, here's what Syft does before this change:

$ syft cgr.dev/chainguard/node -qo json | jq '.artifacts[] | select(.name == "ansi-regex").cpes'
[
  "cpe:2.3:a:ansi-regex:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi-regex:ansi_regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi_regex:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi_regex:ansi_regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi:ansi_regex:5.0.1:*:*:*:*:*:*:*"
]

... and after this change:

$ go run ./cmd/syft cgr.dev/chainguard/node -qo json | jq '.artifacts[] | select(.name == "ansi-regex").cpes'
[
  "cpe:2.3:a:ansi-regex_project:ansi-regex:5.0.1:*:*:*:*:node.js:*:*"
]

It's important to note that in the "before" state, Syft makes several guesses at the CPE for the package, but none of its guesses are correct. And with the new approach, Syft gets the CPE right in the first try. This can be confirmed by setting the CPE version to * and querying NVD's CVE data.

@luhring
Copy link
Contributor Author

luhring commented Jul 9, 2023

This is odd... 🤔 (link)

Building snapshot artifacts�
# create a config with the dist dir overridden
echo "dist: ./snapshot" > ./.tmp/goreleaser.yaml
cat .goreleaser.yaml >> ./.tmp/goreleaser.yaml
# build release snapshots
./.tmp/goreleaser release --clean --skip-publish --skip-sign --snapshot --config ./.tmp/goreleaser.yaml
make: ./.tmp/goreleaser: Command not found
make: *** [Makefile:329: snapshot] Error [12](https://github.com/anchore/syft/actions/runs/5497306836/jobs/10017970119?pr=1897#step:4:13)7
Error: Process completed with exit code 2.

@luhring
Copy link
Contributor Author

luhring commented Jul 9, 2023

From a suggestion in the community Slack, I tried to run the quality gate locally to make sure this change wouldn't cause any regression in matching quality. I'm about 60% sure I did this correctly. 😆

I used this in .yardstick.yaml in Grype:

tools:

  - name: syft
    # note: we want to use a fixed version of syft for capturing all results (NOT "latest")
    version: v0.74.1
    produces: SBOM
    refresh: false

  - name: syft
    # note: we want to use a fixed version of syft for capturing all results (NOT "latest")
    version: cpe-dict
    produces: SBOM-new
    refresh: false

  - name: grype
    version: latest
    takes: SBOM

  - name: grype
    version: latest
    takes: SBOM-new

And I ran make capture and then make validate. I didn't notice any differences in the output, and got:

Quality gate passed!

@luhring luhring marked this pull request as ready for review July 9, 2023 23:14
Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Looking forward to more accurate CPEs being generated -- left one comment around adding unit tests around specific expectations during indexing.

Makefile Outdated
@@ -338,7 +343,7 @@ release:
@.github/scripts/trigger-release.sh

.PHONY: ci-release
ci-release: ci-check clean-dist $(CHANGELOG)
ci-release: ci-check clean-dist $(CHANGELOG) cpe-index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, generating new data during the release step isn't the right place for this. Though it answers the question of how this gets refreshed, it introduces potentially breaking changes right at the release, bypassing testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can just take that step out, lean on the index that's checked into the repo, and periodically regenerate what's in the repo. I've seen other projects take a similar approach with generated/embedded data. How does that sound to you?

Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to get this change in, but I'm going to take one more pass at validating the impact downstream with grype. Additionally I added one more comment about how the data is getting refreshed that will need to get addressed. I don't have a specific suggestion yet.

@wagoodman wagoodman self-assigned this Jul 20, 2023
@wagoodman
Copy link
Contributor

wagoodman commented Jul 21, 2023

It looks like there wasn't sufficient sample coverage in the labeled data we use in the grype quality gate to demonstrate performance improvement. But! the gate run did show that from the existing samples, there was no degradation in performance (which is half the battle to prove).

I did an ad-hoc analysis with:

  • python packages: 22
  • npm packages: 50
  • gem packages: 31

where each package was selected based on it's existence in the generated CPE index, there are CVEs that map to the CPE, and affected versions exist in the package repository to use for cataloging... so chances for matching against these packages was high. Here's what I found

  • Additional TPs found due to cpe fix: 14
  • Additional FNs due to cpe fix: 0
  • FPs eliminated due to cpe fix: 0
  • FPs introduced due to cpe fix: 0

Which is a clear win 🎉 ! I deferred looking specifically at the jenkins plugins and rust crates until we can incorporate this analysis approach into the quality gate. This is slightly different than our typical analysis since we want to ignore unmatched labels (partial evaluation) which is flagged as invalid by the current gates. In the future we'll allow for images under test with both modes of analysis.

@luhring
Copy link
Contributor Author

luhring commented Jul 21, 2023

@wagoodman Thanks for the analysis. Glad it's looking all positive!

In case it helps with future coverage testing, one bucket of false positives that this PR solves (and I'm really looking forward to) is matching Ruby's bundled openssl gem with OpenSSL itself. The presence of this CPE in the index — "cpe:2.3:a:ruby-lang:openssl:*:*:*:*:*:*:*:*" — saves Grype from incorrectly matching the gem to loads of OpenSSL vulns that don't at all affect Ruby's wrapper.

What's the next step here?

@wagoodman
Copy link
Contributor

wagoodman commented Jul 21, 2023

one bucket of false positives that this PR solves is matching Ruby's bundled openssl gem with OpenSSL itself

nice! The packages I selected were random samples so I happened to not run across that case was all (openssl wasn't in there for certain). That should be an obvious win here 🙌 .

Additionally I just updated the branch to account for how this data gets generated -- right now by a weekly workflow that would show up as a PR.

Let me try out that workflow as a sanity check then I think [edit: I forgot this is a fork, I'll test it on the first run this monday] this is ready to be merged.

Copy link
Contributor

@wagoodman wagoodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work 🚀 I added one more test to check that the generated data is wired to the function that uses it.

@wagoodman wagoodman enabled auto-merge (squash) July 21, 2023 13:50
@wagoodman wagoodman merged commit 99d172f into anchore:main Jul 21, 2023
@luhring luhring deleted the cpe-dict branch July 21, 2023 14:25
@luhring
Copy link
Contributor Author

luhring commented Jul 21, 2023

Thanks a million 🙇 ❤️ — very excited about this one

@kzantow kzantow added the enhancement New feature or request label Jul 31, 2023
GijsCalis pushed a commit to GijsCalis/syft that referenced this pull request Feb 19, 2024
* Introduce indexed embedded CPE dictionary

Signed-off-by: Dan Luhring <[email protected]>

* Don't generate cpe-index on make snapshot

Signed-off-by: Dan Luhring <[email protected]>

* Add unit tests for individual addEntry funcs

Signed-off-by: Dan Luhring <[email protected]>

* migrate CPE index build to go generate and add periodic workflow

Signed-off-by: Alex Goodman <[email protected]>

* add test to ensure generated cpe index is wired up to function that uses it

Signed-off-by: Alex Goodman <[email protected]>

---------

Signed-off-by: Dan Luhring <[email protected]>
Signed-off-by: Alex Goodman <[email protected]>
Co-authored-by: Alex Goodman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants