-
-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lint backend to run semgrep #18593
Conversation
Thanks for writing this up!
Based on your description: yes, I think that it does make sense to have a
Both of these come come down (I think) to whether the
With approach 1, the answer to you first question ("Relatedly") would be that the On the other hand, with approach 2,
I think that because of this point (Pants already applying semantic meaning to files), and to avoid having multiple owning targets for files, I'd lean toward approach 2: inferred dependencies on actual Pants targets.
We can definitely punch a hole in the default excludes for this (I think that a |
Thanks for the input. I'll look at it more detail over the next week or so (pretty busy with other things, unfortunately). |
I've been chipping away at this over the last little while. I've pushed a work-in-progress state as a 'back-up', still need to work through tests and the issues they discover, and a few other "small" issues (in the sense of having done 80% of the work, so only got the remaining 80% to go). It's not ready for any sort of review. |
Let me know if you want further effort to make this easier to review: I'm hoping I've structured this as a 'standard' plugin, with the |
Okay, I've simplified this a lot: the new target types are unnecessary for a first attempt. This now just finds config files by globbing (similar to other linting backends), and runs the linting. The diff still looks big, but most of that is the lock file, and more than half the rest is tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good first start. I may not get the semgrep-ness of it all, but the Pants aspects look good.
Honestly, I think we should start really considering outsourcing plugins outside of core "Pants", and this would be perfect, but I also know we're just not "there" yet.
# addition/exclusion), but that can only infer 'full' dependencies and it is wrong (e.g. JVM | ||
# things break) for real code files to depend on this sort of non-code linter config; requires | ||
# dependency scopes or similar (https://github.com/pantsbuild/pants/issues/12794) | ||
spec = Path(address.spec_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be PurePath
? Generally we prefer the Pure
variants since rules shouldn't be doing file I/O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this got resolved?
It's the only one I feel strongly about.
Admittedly we should issue a lint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, dear, sorry, I fixed but pushed to the wrong branch! Thanks for flagging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly we should issue a lint.
@@ -68,9 +69,12 @@ def create( | |||
*, | |||
strip_chroot_path: bool = False, | |||
report: Digest = EMPTY_DIGEST, | |||
strip_formatting: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"strip formatting" doesn't really imply stripping colors to me.
Call a spade a spade? strip_colors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the name of the arg we use elsewhere too:
strip_formatting: bool = False, |
I'd prefer to keep the current pattern for local consistency, and you can rename globally as separate work if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In addition, I think the colours.strip_color
function strips other terminal formatting escape codes, like bold and underline too, despite its name.)
name = "Semgrep" | ||
options_scope = "semgrep" | ||
help = softwrap( | ||
"""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to escape the newline for softwrap'ed long strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Accidentally pushed to the wrong branch. Fixed in #19685)
5 months later, I'm going to merge this. Thanks everyone for helping me work through this, and for encouraging cutting it down to be minimal. It's |
Two fixes from #18593 that I pushed to the wrong branch, and didn't notice before merging that PR: - Use `PurePath` instead of `Path`: #18593 (comment) - Avoid escaping the newline for `softwrap`: #18593 (comment)
This refreshes the new semgrep backend (added in #18593, in 2.19) to use the current latest published version (`1.46.0`, instead of `1.37.0`). The newer version of Semgrep is slightly stricter about the formatting of the global settings file, so this inserts a dummy `anonymous_user_id` value.
…21126) Three things: Allow configuring the config location; because of how the digest is built, running `semgrep` via pants doesn't actually allow you to configure this by passing `--config`, like the tool does. There is now a `config_name` option which will be auto-discovered and hierarchically applied to each relevant partition. This will potentially evolve into a `config_paths` option which will allow users to specify a list of globs. `semgrep` searches for rules recursively within a rules directory, so reflect that in the globs. This isn't obvious in the documentation, but it's the way it behaves, and also is explicit in the code. https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/cli/src/semgrep/config_resolver.py#L402 Finally, I flattened the glob for `.semgrepignore`. Currently, the glob recursively matches all matching file names, but ultimately `semgrep` just reads only what's in the current dir. In my testing of pants+semgrep I did indeed notice the same behavior, that the `.semgrepignore`s in lower dirs were not being respected. An integration test confirmed the same.[1] Supporting multi-level `.semgrepignores` is going to be a significant effort, since it would involve actually parsing and interpreting the config files, and also navigating the pitfalls of managing overlays of ignore patterns. Instead, I chose to force it to being a singular file path. Here're some historical discussions about .semgrepignore: huonw#1, #18593 (comment) https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/targeting/Semgrepignore.ml#L124 https://semgrep.dev/docs/ignoring-files-folders-code#reference-summary [1] The integration test which proves hierarchical semgrepignore doesn't work: ```py def test_semgrepignore_not_partitioned(rule_runner: RuleRunner) -> None: rule_runner.write_files({ f"{DIR}/BUILD": "", f"{DIR}/.semgrep.yml": RULES, f"{DIR}/.semgrepignore": FILE, f"{DIR}/a/.semgrepignore": FILE, f"{DIR}/a/BUILD": SINGLE_FILE_BUILD, f"{DIR}/a/{FILE}": BAD_FILE, }) tgt = rule_runner.get_target(Address(f"{DIR}/a", target_name="f")) # run with these two semgrepignores in place results = run_semgrep(rule_runner, [tgt]) assert len(results) == 1 result = results[0] # doesn't get ignored assert "Ran 1 rule on 1 file: 1 finding" in result.stderr # run with a root semgrepignore rule_runner.write_files({ ".semgrepignore": FILE, }) results = run_semgrep(rule_runner, [tgt]) assert len(results) == 1 result = results[0] # finally, it's ignored assert "Ran 1 rule on 0 files: 0 findings" in result.stderr ``` --------- Co-authored-by: Huon Wilson <[email protected]>
This adds an
pants.backend.experimental.tools.semgrep
backend that runs the Semgrep (https://semgrep.dev) OSS engine (https://semgrep.dev/docs/getting-started/) as a linter.Background
Semgrep is a multilanguage linter, that allows defining in-repo per-language rules similar to a PyLint plugin for Python, but a source-like syntax for defining the rules (rather than manual AST traversal). It has explicit support for many languages, and even a
generic
language that works on any (text) file.We use this for a whole bunch of checks like:
logging.info(...)
in Python, rather than using the appropriate method on alogging.Logger
instanceA random open source example: https://github.com/aiven/terraform-provider-aiven/blob/ae42e99db2ee1e56a327df7bd6d8465996951842/.semgrep.yml
What's implemented
The backend implemented here is minimal, but enough to be functional:
pants lint --only=semgrep ...
in our repo ~identical to running it externally (except for more caching). Here's some features:.semgrep.yml
,.semgrep.yaml
, or in a.semgrep/
directorysource
, relying on Semgrep's own file walking to apply the right rules to the right files within each sandbox.a/b/c/some_file.py
will be linted using all of the rules configured in//
,a/
,a/b/
anda/b/c/
, if available)..semgrepignore
(at the top level) is obeyed (NB. there's noskip_semgrep
per-target field).--semgrep-force
for disabling caching: Semgrep supports downloading and running external rules (e.g.semgrep -c p/python
includes a bunch of Python-specific rules). However, these are mutable and cannot be pinned, so caching results may be misleading. With this flag,pants lint --only=semgrep --semgrep-force --semgrep-args='-c=p/python' ::
will do a whole fresh run, as desired.Limitations/future work
There's various bits of potential improvements that I already know about, but I think that this backend is useful even without them, and that they can be iterated on.
Rules can have auto-fix-ups configured, and applied via
semgrep scan --autofix ...
, which would fit neatly intopants fix ...
. I started writing this before Ruff integration in pantsbuild only fails on auto-fixable errors #18430 was fixed, so this doesn't support that.More precise identification of which files need to be run against which rules, to increase cache reuse. Currently, every single file that's a descendent of a directory with a
semgrep_rule_source()
target is put into a semgrep partition/sandbox, but this could be tightened by parsing/doing dependency inference on the actual rules:python
rules, there's no need to run it against Go/PHP/yaml/... files in that directory..semgrepignore
file)This doesn't include support for testing rules: https://semgrep.dev/docs/writing-rules/testing-rules/ (this may require rule YAML files to have explicit targets)
Ignoring/disabling semgrep for specific files/targets works okay, but is a bit awkward:
.semgrepignore
files, but only the one in the root is used (Semgrep doesn't support nested.semgrepignore
files:.semgrepignore
not processed outside of working directory semgrep/semgrep#5669)skip_semgrep
field similar toskip_pylint
etc., but this would need to be added to every target with a source field, some howThis appears to be significantly slower than running semgrep outside of pants. For instance, the table below shows that the time for running within pants (13 invocations of semgrep linting subsets of the repo) is significantly slower than running once on all files outside of pants. Part of this is set-up time of semgrep (each invocation takes at least 0.6s just to start semgrep), but that doesn't seem to explain it all. However, semgrep is fast enough that I think this is fine for now, and the finer-grained caching will recover this time for very large repos.
-j1
-j8
(* The pants timing is estimated by checking
--stats-log -ldebug
logs: I'm usingPANTS_SOURCE=.... pants ...
which doesn't usepantsd
, and thus using the raw timings would be unfairly penalising pants set-up time that would disappear.)Depending on where this PR goes, I'll file follow-up issues for some of these.