-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only include rules with diagnostics in SARIF metadata #13268
Only include rules with diagnostics in SARIF metadata #13268
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow up.
I'm not sure if reducing the rules to the reported tools is the right behavior. According to the spec
A toolComponent object MAY contain a property named rules whose value is an array of zero or more unique (§3.7.3) reportingDescriptor objects (§3.49) each of which provides information about an analysis rule supported by the tool component.
The way I understand this is that the rules
table should list all rules supported by the tool and not just rules seen in this analysis run. Do you have a resource that states if it is okay for the rules table to only include the reported rules?
@MichaReiser Thanks for the reply! I have found two resources that might be relevant:
To summarize, including only the rules relevant to the results or including all rules is okay, but IMO the latter would result in a large output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. These two sections are extremely helpful.
From the Appendix E:
Metadata should be included in a log file in the following circumstances:
Our challenge is that we can't make assumptions about the use case and I don't fully understand the following scenario:
The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.
What do you make of this scenario?
Either way. I'm slightly leaning towards moving forward with this until someone comes up with a use case where they need all metadata.
@@ -27,6 +28,13 @@ impl Emitter for SarifEmitter { | |||
.map(SarifResult::from_message) | |||
.collect::<Result<Vec<_>>>()?; | |||
|
|||
let mut rule_ids = HashSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this to
let mut rule_ids = HashSet::new(); | |
let unique_rules: HashSet<_> = results.iter().filter_map(|result| result.rule).collect(); | |
let mut rules: Vec<SarifRule> = unique_rules.into_iter().map(SarifRule::from).collect(); | |
rules.sort_by(|a, b| a.code.cmp(&b.code)); |
and then use rules
on line 46. This avoids allocating unnecessary String
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is fantastic!
(This PR is my second time writing Rust code. Learned a lot!)
To answer this question, let me analyze the second section (from spec) step by step.
Regarding "display metadata related to each result", my understanding is that rules are primarily prepared for the results.
From "requires information about every rule specified by every result", again, I think rules are mainly used as additional details about the corresponding results.
I'm currently using Ruff through MegaLinter, which provides a webhook mechanism to send the SARIF output. Therefore, I'm concerned about the size of the HTTP request body (of the webhook). |
I wouldn't be too concerned about that. I would suspect that the JSON file, even with all rule metadata, is still less than 1 MB, and HTTP compression should bring that down to 100 KB. |
Given this code snippet: # test.py
import os
def greet():
print('hello {name}'.format()) I just performed a quick test: # Using the official Ruff
$ ruff check --output-format=sarif test.py > official_result.json
$ du -sh official_result.json
2.0M official_result.json
# Using my local Ruff
$ ./target/debug/ruff check --output-format=sarif test.py > local_result.json
$ du -sh local_result.json
8.0K local_result.json Overall, I agree with you that the size of the request body is not a big deal (although smaller is perhaps better). To my understanding of the spec, as described above, I think the rules are primarily used in conjunction with results. I'm not sure if there are scenarios where all rules are needed. |
Thanks for making this contribution. This does sound reasonable to me and we can always revert back to the previous behavior if there's an actual use case for it. |
## Summary Follow-up from #13268, this PR updates the test case to use `assert_snapshot` now that the output is limited to only include the rules with diagnostics. ## Test Plan `cargo insta test`
This PR can be considered as a follow-up of #13180.
Motivation
The current Ruff's SARIF output contains all the 800+ rules, the vast majority of which have no relevance to the errors in results. IMO, this results in a large output and a lot of noise.
Proposal
Per the SARIF example, other linters such as ESLint only emits rules relevant to the errors in results. I think this could be an approach worth considering.