Add lint backend to run semgrep #18593

huonw · 2023-03-26T06:20:12Z

This adds an pants.backend.experimental.tools.semgrep backend that runs the Semgrep (https://semgrep.dev) OSS engine (https://semgrep.dev/docs/getting-started/) as a linter.

Background

Semgrep is a multilanguage linter, that allows defining in-repo per-language rules similar to a PyLint plugin for Python, but a source-like syntax for defining the rules (rather than manual AST traversal). It has explicit support for many languages, and even a generic language that works on any (text) file.

We use this for a whole bunch of checks like:

every GitHub Actions job has a timeout configured in the workflow YAML file
erroring if we call logging.info(...) in Python, rather than using the appropriate method on a logging.Logger instance

A random open source example: https://github.com/aiven/terraform-provider-aiven/blob/ae42e99db2ee1e56a327df7bd6d8465996951842/.semgrep.yml

What's implemented

The backend implemented here is minimal, but enough to be functional: pants lint --only=semgrep ... in our repo ~identical to running it externally (except for more caching). Here's some features:

Automatic configuration finding, by looking for files called .semgrep.yml, .semgrep.yaml, or in a .semgrep/ directory
Flexible 'targetless'-ness (kinda):
- semgrep is invoked against all targets with a source, relying on Semgrep's own file walking to apply the right rules to the right files within each sandbox.
Multiple/scoped configuration:
- Each target is linted using all the configurations in its ancestor directories (a/b/c/some_file.py will be linted using all of the rules configured in //, a/, a/b/ and a/b/c/, if available).
- Targets are partitioned by the sets of rules that apply to them.
Ignore support: .semgrepignore (at the top level) is obeyed (NB. there's no skip_semgrep per-target field).
--semgrep-force for disabling caching: Semgrep supports downloading and running external rules (e.g. semgrep -c p/python includes a bunch of Python-specific rules). However, these are mutable and cannot be pinned, so caching results may be misleading. With this flag, pants lint --only=semgrep --semgrep-force --semgrep-args='-c=p/python' :: will do a whole fresh run, as desired.

Limitations/future work

There's various bits of potential improvements that I already know about, but I think that this backend is useful even without them, and that they can be iterated on.

Rules can have auto-fix-ups configured, and applied via semgrep scan --autofix ..., which would fit neatly into pants fix .... I started writing this before Ruff integration in pantsbuild only fails on auto-fixable errors #18430 was fixed, so this doesn't support that.
More precise identification of which files need to be run against which rules, to increase cache reuse. Currently, every single file that's a descendent of a directory with a semgrep_rule_source() target is put into a semgrep partition/sandbox, but this could be tightened by parsing/doing dependency inference on the actual rules:
- by language: if a directory only has python rules, there's no need to run it against Go/PHP/yaml/... files in that directory.
- by path: rules can define inclusion/exclusion rules based on file paths (and similarly the global ignores implied by a .semgrepignore file)
This doesn't include support for testing rules: https://semgrep.dev/docs/writing-rules/testing-rules/ (this may require rule YAML files to have explicit targets)
Ignoring/disabling semgrep for specific files/targets works okay, but is a bit awkward:
- currently this rule makes sure to include any .semgrepignore files, but only the one in the root is used (Semgrep doesn't support nested .semgrepignore files: .semgrepignore not processed outside of working directory semgrep/semgrep#5669)
- it'd be good to have a skip_semgrep field similar to skip_pylint etc., but this would need to be added to every target with a source field, some how
This appears to be significantly slower than running semgrep outside of pants. For instance, the table below shows that the time for running within pants (13 invocations of semgrep linting subsets of the repo) is significantly slower than running once on all files outside of pants. Part of this is set-up time of semgrep (each invocation takes at least 0.6s just to start semgrep), but that doesn't seem to explain it all. However, semgrep is fast enough that I think this is fine for now, and the finer-grained caching will recover this time for very large repos.

method real CPU

pants* 4s? 24s?

standalone -j1 2.5s 2.5s

standalone -j8 1.6s 4.4s

(* The pants timing is estimated by checking --stats-log -ldebug logs: I'm using PANTS_SOURCE=.... pants ... which doesn't use pantsd, and thus using the raw timings would be unfairly penalising pants set-up time that would disappear.)

Depending on where this PR goes, I'll file follow-up issues for some of these.

stuhood · 2023-03-29T03:47:03Z

Thanks for writing this up!

Does it make sense to have tool configuration as explicit targets? This generally doesn't seem to be how other linters/tools work (e.g. black/mypy/... configuration in pyproject.toml doesn't require an explicit target).

Based on your description: yes, I think that it does make sense to have a semgrep_rule_source target type, which could be tailored for the matching file types.

Relatedly, is it okay for pants lint path/to/dir:: to run against all targets within path/to/dir, but using the configuration semgrep_rule_sources targets from the parent directories (e.g. in path/ or path/to)?

Semgrep supports many languages (https://semgrep.dev/docs/supported-languages/), so it seems rather tedious to define a set of rules for "run Semgrep against Python", "run Semgrep against Go", "run Semgrep against YAML", .... It looks like semgrep reasons in terms of file extensions (https://semgrep.dev/docs/writing-rules/rule-syntax/#language-extensions-and-tags), so maybe just PathGlobs(..)/FilespecMatcher-style globbing to match, is appropriate...

Both of these come come down (I think) to whether the semgrep_rule_source target:

directly uses globs without trying to find file owners, or
infers dependencies on the Pants targets which own files.

With approach 1, the answer to you first question ("Relatedly") would be that the semgrep_rule_source target in the parent directory would be picked up and run. If the semgrep_rule_source target was actually a target generator (semgrep_rule_sources), then it could run for only the files in the child directories.

On the other hand, with approach 2, pants lint path/to/dir:: would do nothing to discover a dependent target in path/ or path/to: you would need to use pants --changed-dependents=.. lint, or some other dependent-walking step to find them.

that feels a bit weird, e.g. a files() target that pulls in a .py file would be semgrep'd as Python, even though pants doesn't agree (and vice versa a python_source(source="foo.secret_python_extension") file would not be semgrep'd as Python). I don't think this is a major issue, but might cause confusion. Thoughts?

NB. Other 'languages' supported by semgrep aren't necessarily natively supported by pants, but could still be sensibly linted (e.g. a .php file could be imported by files()). language: generic rules apply to all files (mediated by any path inclusion/exclusions): https://semgrep.dev/docs/writing-rules/generic-pattern-matching/

I think that because of this point (Pants already applying semantic meaning to files), and to avoid having multiple owning targets for files, I'd lean toward approach 2: inferred dependencies on actual Pants targets.

How to manage the default config .semgrep/ directory vs. pants_ignore's default .*/ value?

We can definitely punch a hole in the default excludes for this (I think that a !.semgrep/ would be sufficient...?) While that would be backwards incompatible, anyone using semgrep and pants together would likely be interested to know that this backend existed.

huonw · 2023-03-30T00:01:23Z

Thanks for the input. I'll look at it more detail over the next week or so (pretty busy with other things, unfortunately).

huonw · 2023-04-14T11:30:27Z

I've been chipping away at this over the last little while. I've pushed a work-in-progress state as a 'back-up', still need to work through tests and the issues they discover, and a few other "small" issues (in the sense of having done 80% of the work, so only got the remaining 80% to go).

It's not ready for any sort of review.

huonw · 2023-05-05T09:48:30Z

Let me know if you want further effort to make this easier to review: I'm hoping I've structured this as a 'standard' plugin, with the subsystem.py, tailor.py and rules.py reviewable somewhat independently (and in vaguely that order, I guess).

huonw · 2023-08-28T11:00:14Z

Okay, I've simplified this a lot: the new target types are unnecessary for a first attempt. This now just finds config files by globbing (similar to other linting backends), and runs the linting.

The diff still looks big, but most of that is the lock file, and more than half the rest is tests.

thejcannon

Good first start. I may not get the semgrep-ness of it all, but the Pants aspects look good.

Honestly, I think we should start really considering outsourcing plugins outside of core "Pants", and this would be perfect, but I also know we're just not "there" yet.

thejcannon · 2023-08-28T18:11:16Z

src/python/pants/backend/tools/semgrep/rules.py

+        # addition/exclusion), but that can only infer 'full' dependencies and it is wrong (e.g. JVM
+        # things break) for real code files to depend on this sort of non-code linter config; requires
+        # dependency scopes or similar (https://github.com/pantsbuild/pants/issues/12794)
+        spec = Path(address.spec_path)


Should this be PurePath? Generally we prefer the Pure variants since rules shouldn't be doing file I/O

I don't think this got resolved?

It's the only one I feel strongly about.

Admittedly we should issue a lint.

Oh, dear, sorry, I fixed but pushed to the wrong branch! Thanks for flagging.

#19685

Admittedly we should issue a lint.

#19686

thejcannon · 2023-08-28T18:12:22Z

src/python/pants/core/goals/lint.py

@@ -68,9 +69,12 @@ def create(
        *,
        strip_chroot_path: bool = False,
        report: Digest = EMPTY_DIGEST,
+        strip_formatting: bool = False,


"strip formatting" doesn't really imply stripping colors to me.

Call a spade a spade? strip_colors?

This is the name of the arg we use elsewhere too:

pants/src/python/pants/core/goals/check.py

Line 55 in b1fdb7b

strip_formatting: bool = False,

I'd prefer to keep the current pattern for local consistency, and you can rename globally as separate work if you feel strongly about it.

(In addition, I think the colours.strip_color function strips other terminal formatting escape codes, like bold and underline too, despite its name.)

kaos · 2023-08-28T19:19:34Z

src/python/pants/backend/tools/semgrep/subsystem.py

+    name = "Semgrep"
+    options_scope = "semgrep"
+    help = softwrap(
+        """\


no need to escape the newline for softwrap'ed long strings.

(Accidentally pushed to the wrong branch. Fixed in #19685)

huonw · 2023-08-28T23:42:12Z

5 months later, I'm going to merge this. Thanks everyone for helping me work through this, and for encouraging cutting it down to be minimal. It's experimental, so it should be lower impact to make improvements (or pull it out) in future.

Two fixes from #18593 that I pushed to the wrong branch, and didn't notice before merging that PR: - Use `PurePath` instead of `Path`: #18593 (comment) - Avoid escaping the newline for `softwrap`: #18593 (comment)

This refreshes the new semgrep backend (added in #18593, in 2.19) to use the current latest published version (`1.46.0`, instead of `1.37.0`). The newer version of Semgrep is slightly stricter about the formatting of the global settings file, so this inserts a dummy `anonymous_user_id` value.

…21126) Three things: Allow configuring the config location; because of how the digest is built, running `semgrep` via pants doesn't actually allow you to configure this by passing `--config`, like the tool does. There is now a `config_name` option which will be auto-discovered and hierarchically applied to each relevant partition. This will potentially evolve into a `config_paths` option which will allow users to specify a list of globs. `semgrep` searches for rules recursively within a rules directory, so reflect that in the globs. This isn't obvious in the documentation, but it's the way it behaves, and also is explicit in the code. https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/cli/src/semgrep/config_resolver.py#L402 Finally, I flattened the glob for `.semgrepignore`. Currently, the glob recursively matches all matching file names, but ultimately `semgrep` just reads only what's in the current dir. In my testing of pants+semgrep I did indeed notice the same behavior, that the `.semgrepignore`s in lower dirs were not being respected. An integration test confirmed the same.[1] Supporting multi-level `.semgrepignores` is going to be a significant effort, since it would involve actually parsing and interpreting the config files, and also navigating the pitfalls of managing overlays of ignore patterns. Instead, I chose to force it to being a singular file path. Here're some historical discussions about .semgrepignore: huonw#1, #18593 (comment) https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/targeting/Semgrepignore.ml#L124 https://semgrep.dev/docs/ignoring-files-folders-code#reference-summary [1] The integration test which proves hierarchical semgrepignore doesn't work: ```py def test_semgrepignore_not_partitioned(rule_runner: RuleRunner) -> None: rule_runner.write_files({ f"{DIR}/BUILD": "", f"{DIR}/.semgrep.yml": RULES, f"{DIR}/.semgrepignore": FILE, f"{DIR}/a/.semgrepignore": FILE, f"{DIR}/a/BUILD": SINGLE_FILE_BUILD, f"{DIR}/a/{FILE}": BAD_FILE, }) tgt = rule_runner.get_target(Address(f"{DIR}/a", target_name="f")) # run with these two semgrepignores in place results = run_semgrep(rule_runner, [tgt]) assert len(results) == 1 result = results[0] # doesn't get ignored assert "Ran 1 rule on 1 file: 1 finding" in result.stderr # run with a root semgrepignore rule_runner.write_files({ ".semgrepignore": FILE, }) results = run_semgrep(rule_runner, [tgt]) assert len(results) == 1 result = results[0] # finally, it's ignored assert "Ran 1 rule on 0 files: 0 findings" in result.stderr ``` --------- Co-authored-by: Huon Wilson <[email protected]>

huonw added 10 commits March 26, 2023 14:44

Basics working

c041769

Switch to more explicit names

f4bde7f

Refine env vars, refine settings

8955c50

require glob matches

a2fa07b

rm unused imports

56dccf9

Glob better

955538e

issue link

68fc026

partitioning by config

8be9b21

revert spurious changes

594d1ea

stop using semgrep in pants itself

790fe51

tdyas added the category:new feature label Mar 28, 2023

huonw added 14 commits April 1, 2023 12:09

Add target types

4f33a5d

mypy fixes

89c5b4d

tailor semgrep_rule_source(s)

25733f3

register target types

2bb0672

implement dependency inference: NB. full dependency

323f693

Run Semgrep using FieldSet/targets, partitioning by config

d08611f

avoid generating partitions with no configs

798ad6c

ignore files, tweaks

a381711

allow ignore files to work by not passig all args

987b484

remove now-unused glob files

613fc83

Add --force option

d0f5d4e

register tailor rules

b86ba09

Start sketching tests

3efeac6

minor clean-up

a405f7b

huonw added 2 commits April 19, 2023 21:26

basic test passing

1fad45e

a bunch more tests, xfail --force test

f71db25

huonw added 4 commits May 5, 2023 19:15

Merge remote-tracking branch 'upstream/main' into feature/semgrep

5f7fdaa

pathlib for ancestor_targets

195d090

Remove unnecessary lockfile juggling

9c68431

Remove fancy ignore file handling for now

ace84a6

huonw mentioned this pull request May 5, 2023

WIP: Flag nested .semgrepignore files that aren't used huonw/pants#1

Draft

huonw added 9 commits August 27, 2023 13:53

Merge remote-tracking branch 'upstream/main' into feature/semgrep

409e73c

Tweaks for upstream changes

25c4f48

Rewrite to just use pathglobs, no new targets

385666a

Remove now-dead code

5db7e97

Test some pure code

16f22c4

Tweak integration tests

ed0687a

Restore accidental deletion

4593fb2

Update semgrep lockfile

db6bc2e

Tweak subsystem

7e7135f

huonw force-pushed the feature/semgrep branch from e468533 to 7e7135f Compare August 28, 2023 09:45

huonw requested review from thejcannon and kaos August 28, 2023 11:01

thejcannon approved these changes Aug 28, 2023

View reviewed changes

kaos reviewed Aug 28, 2023

View reviewed changes

huonw merged commit cb63bba into pantsbuild:main Aug 28, 2023

This was referenced Aug 29, 2023

Add missed fixes for semgrep backend #19685

Merged

Demonstrate using semgrep to detect pathlib.Path (and textwrap.dedent) #19686

Closed

huonw mentioned this pull request Oct 26, 2023

Update semgrep to current latest, by default #20102

Merged

huonw mentioned this pull request Jul 4, 2024

Allow customizing semgrep configurations; correct rule matching glob #21126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lint backend to run semgrep #18593

Add lint backend to run semgrep #18593

huonw commented Mar 26, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading

huonw commented Mar 30, 2023

huonw commented Apr 14, 2023

huonw commented May 5, 2023

huonw commented Aug 28, 2023

thejcannon left a comment

thejcannon Aug 28, 2023

thejcannon Aug 28, 2023

huonw Aug 29, 2023

huonw Aug 29, 2023

thejcannon Aug 28, 2023

huonw Aug 28, 2023

huonw Aug 28, 2023

kaos Aug 28, 2023

huonw Aug 29, 2023

huonw commented Aug 28, 2023

method	real	CPU
pants*	4s?	24s?
standalone `-j1`	2.5s	2.5s
standalone `-j8`	1.6s	4.4s

Add lint backend to run semgrep #18593

Add lint backend to run semgrep #18593

Conversation

huonw commented Mar 26, 2023 • edited Loading

Background

What's implemented

Limitations/future work

stuhood commented Mar 29, 2023 • edited Loading

huonw commented Mar 30, 2023

huonw commented Apr 14, 2023

huonw commented May 5, 2023

huonw commented Aug 28, 2023

thejcannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huonw commented Aug 28, 2023

huonw commented Mar 26, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading