Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar Packages Should Be Aggregated #1162

Open
cpendery opened this issue Aug 16, 2022 · 3 comments
Open

Similar Packages Should Be Aggregated #1162

cpendery opened this issue Aug 16, 2022 · 3 comments
Assignees
Labels
bug Something isn't working format:cyclonedx CycloneDX related enhancement or bug

Comments

@cpendery
Copy link
Contributor

What happened:
Producing an sbom with Syft is creating almost duplicate packages, where they are derived from different sources, but all of the information about the dependency is the same. We see this primarily using the java-cataloger

Ex:

  1. A pom.xml and an archive file exist for the same package. Scanning these will yield duplicates for every dependency in this, even though it's the same package, because they have different sources (path, virtualPath).
  2. Two archive files exist, each for different packages sharing a dependency. Scanning these will yield duplicates for the shared dependency as it will be produced for each, but again the only difference is the sources (path, virtualPath)

What you expected to happen:
I'd expect these two packages in the sbom to be aggregated into a single object where all the possible sources are included as an array, rather than having multiple of the same package where this is the only difference.

How to reproduce it (as minimally and precisely as possible):

docker pull elasticsearch:8.3.3 && syft docker:elasticsearch:8.3.3 -o cyclonedx-json | jq '.components[].purl' | grep 'pkg:maven/org.slf4j/[email protected]'

This shows that the same purl is appearing multiple times, but looking at the full sbom output shows the separate items that are all the same except for some metadata attributes which could be combined

Anything else we need to know?:

Environment:

  • Output of syft version: v0.53.2
  • OS (e.g: cat /etc/os-release or similar): MacOS
@cpendery cpendery added the bug Something isn't working label Aug 16, 2022
@spiffcs
Copy link
Contributor

spiffcs commented Aug 16, 2022

Wow great find thanks so much @cpendery!

@wagoodman
Copy link
Contributor

wagoodman commented Oct 12, 2022

Changing VirtualPath on the java metadata from a string to a slice of string is very tempting, however, doing this would be inconsistent with how we merge packages today (which is by considering location). I was looking forward to the overall paradigm change in your PR #1249 , however, after more consideration I don't think we should make this change (see #1249 (comment) for more details).

Can I ask more about the use cases that motivated package merging? Maybe we can find a good workaround for your needs (for instance syft docker:elasticsearch:8.3.3 -o cyclonedx-json | jq '.components[].purl' | unique, though I think that this example was more to illustrate the symptoms and less about a specific need to get unique pURLs).

@aaa912
Copy link

aaa912 commented Oct 20, 2022

@wagoodman , thanks for the update and additional comments. Following up from the initial issue from @cpendery; the use case here is to eliminate what appear to be duplicate vulnerabilities (or at least very similar vulnerabilites) being reported after running grype on the sbom generated by syft. The pURL, vulnerability id and datasource are all the same and the only difference is the virtual path.

With the original example (using syft 0.59.0 and grype 0.51.0)
docker pull elasticsearch:8.3.3 && syft docker:elasticsearch:8.3.3 -o cyclonedx-json > syft_output.json

Looking at the syft output (and choosing an example package) we can see the same pURL pkg:maven/org.apache.httpcomponents/[email protected] shows up 5 times with different virtual paths in the syft output

Then we can run grype
grype sbom:./syft_output.json -o json > grype_output.json

Using some very rough jq we can try to count some of the vulnerabilities that appear to be duplicates (except for the virtual path). Using the example pURL pkg:maven/org.apache.httpcomponents/[email protected] it up 5 times from nvd and 5 times from GitHub data sources but we would like it to only show up once
cat grype_output.json | jq ".matches[] | {artifactPurl: .artifact.purl, artifactArchiveDigests: .artifact.metadata.archiveDigests, vulnerabilityId: .vulnerability.id, vulnerabilityDataSource: .vulnerability.dataSource}" | jq -s "group_by(.artifactPurl, .artifactArchiveDigests, .vulnerabilityId, .vulnerabilityDataSource) | map(.[]+{count: length}) | unique_by(.artifactPurl, .artifactArchiveDigests, .vulnerabilityId, .vulnerabilityDataSource)"

As you mentioned it is possible to filter out some of the apparent duplicates pre/post running syft or grype but it would be very nice to be able to get the vulnerability, package (and then multiple virtual paths where the package is used) as part of the output without doing additional processing/merging for this case where the pURL (and almost everything else) is the same and the only difference is the virtual path.

Will try to follow up on this in the next community meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working format:cyclonedx CycloneDX related enhancement or bug
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants