Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lint backend to run semgrep #18593

Merged
merged 54 commits into from
Aug 28, 2023
Merged

Add lint backend to run semgrep #18593

merged 54 commits into from
Aug 28, 2023

Conversation

huonw
Copy link
Contributor

@huonw huonw commented Mar 26, 2023

This adds an pants.backend.experimental.tools.semgrep backend that runs the Semgrep (https://semgrep.dev) OSS engine (https://semgrep.dev/docs/getting-started/) as a linter.

Background

Semgrep is a multilanguage linter, that allows defining in-repo per-language rules similar to a PyLint plugin for Python, but a source-like syntax for defining the rules (rather than manual AST traversal). It has explicit support for many languages, and even a generic language that works on any (text) file.

We use this for a whole bunch of checks like:

  • every GitHub Actions job has a timeout configured in the workflow YAML file
  • erroring if we call logging.info(...) in Python, rather than using the appropriate method on a logging.Logger instance

A random open source example: https://github.com/aiven/terraform-provider-aiven/blob/ae42e99db2ee1e56a327df7bd6d8465996951842/.semgrep.yml

What's implemented

The backend implemented here is minimal, but enough to be functional: pants lint --only=semgrep ... in our repo ~identical to running it externally (except for more caching). Here's some features:

  • Automatic configuration finding, by looking for files called .semgrep.yml, .semgrep.yaml, or in a .semgrep/ directory
  • Flexible 'targetless'-ness (kinda):
    • semgrep is invoked against all targets with a source, relying on Semgrep's own file walking to apply the right rules to the right files within each sandbox.
  • Multiple/scoped configuration:
    • Each target is linted using all the configurations in its ancestor directories (a/b/c/some_file.py will be linted using all of the rules configured in //, a/, a/b/ and a/b/c/, if available).
    • Targets are partitioned by the sets of rules that apply to them.
  • Ignore support: .semgrepignore (at the top level) is obeyed (NB. there's no skip_semgrep per-target field).
  • --semgrep-force for disabling caching: Semgrep supports downloading and running external rules (e.g. semgrep -c p/python includes a bunch of Python-specific rules). However, these are mutable and cannot be pinned, so caching results may be misleading. With this flag, pants lint --only=semgrep --semgrep-force --semgrep-args='-c=p/python' :: will do a whole fresh run, as desired.

Limitations/future work

There's various bits of potential improvements that I already know about, but I think that this backend is useful even without them, and that they can be iterated on.

  1. Rules can have auto-fix-ups configured, and applied via semgrep scan --autofix ..., which would fit neatly into pants fix .... I started writing this before Ruff integration in pantsbuild only fails on auto-fixable errors #18430 was fixed, so this doesn't support that.

  2. More precise identification of which files need to be run against which rules, to increase cache reuse. Currently, every single file that's a descendent of a directory with a semgrep_rule_source() target is put into a semgrep partition/sandbox, but this could be tightened by parsing/doing dependency inference on the actual rules:

    • by language: if a directory only has python rules, there's no need to run it against Go/PHP/yaml/... files in that directory.
    • by path: rules can define inclusion/exclusion rules based on file paths (and similarly the global ignores implied by a .semgrepignore file)
  3. This doesn't include support for testing rules: https://semgrep.dev/docs/writing-rules/testing-rules/ (this may require rule YAML files to have explicit targets)

  4. Ignoring/disabling semgrep for specific files/targets works okay, but is a bit awkward:

  5. This appears to be significantly slower than running semgrep outside of pants. For instance, the table below shows that the time for running within pants (13 invocations of semgrep linting subsets of the repo) is significantly slower than running once on all files outside of pants. Part of this is set-up time of semgrep (each invocation takes at least 0.6s just to start semgrep), but that doesn't seem to explain it all. However, semgrep is fast enough that I think this is fine for now, and the finer-grained caching will recover this time for very large repos.

    method real CPU
    pants* 4s? 24s?
    standalone -j1 2.5s 2.5s
    standalone -j8 1.6s 4.4s

    (* The pants timing is estimated by checking --stats-log -ldebug logs: I'm using PANTS_SOURCE=.... pants ... which doesn't use pantsd, and thus using the raw timings would be unfairly penalising pants set-up time that would disappear.)

Depending on where this PR goes, I'll file follow-up issues for some of these.

@stuhood
Copy link
Member

stuhood commented Mar 29, 2023

Thanks for writing this up!

  • Does it make sense to have tool configuration as explicit targets? This generally doesn't seem to be how other linters/tools work (e.g. black/mypy/... configuration in pyproject.toml doesn't require an explicit target).

Based on your description: yes, I think that it does make sense to have a semgrep_rule_source target type, which could be tailored for the matching file types.

  • Relatedly, is it okay for pants lint path/to/dir:: to run against all targets within path/to/dir, but using the configuration semgrep_rule_sources targets from the parent directories (e.g. in path/ or path/to)?
  • Semgrep supports many languages (https://semgrep.dev/docs/supported-languages/), so it seems rather tedious to define a set of rules for "run Semgrep against Python", "run Semgrep against Go", "run Semgrep against YAML", .... It looks like semgrep reasons in terms of file extensions (https://semgrep.dev/docs/writing-rules/rule-syntax/#language-extensions-and-tags), so maybe just PathGlobs(..)/FilespecMatcher-style globbing to match, is appropriate...

Both of these come come down (I think) to whether the semgrep_rule_source target:

  1. directly uses globs without trying to find file owners, or
  2. infers dependencies on the Pants targets which own files.

With approach 1, the answer to you first question ("Relatedly") would be that the semgrep_rule_source target in the parent directory would be picked up and run. If the semgrep_rule_source target was actually a target generator (semgrep_rule_sources), then it could run for only the files in the child directories.

On the other hand, with approach 2, pants lint path/to/dir:: would do nothing to discover a dependent target in path/ or path/to: you would need to use pants --changed-dependents=.. lint, or some other dependent-walking step to find them.

  1. that feels a bit weird, e.g. a files() target that pulls in a .py file would be semgrep'd as Python, even though pants doesn't agree (and vice versa a python_source(source="foo.secret_python_extension") file would not be semgrep'd as Python). I don't think this is a major issue, but might cause confusion. Thoughts?
  2. NB. Other 'languages' supported by semgrep aren't necessarily natively supported by pants, but could still be sensibly linted (e.g. a .php file could be imported by files()). language: generic rules apply to all files (mediated by any path inclusion/exclusions): https://semgrep.dev/docs/writing-rules/generic-pattern-matching/

I think that because of this point (Pants already applying semantic meaning to files), and to avoid having multiple owning targets for files, I'd lean toward approach 2: inferred dependencies on actual Pants targets.

  • How to manage the default config .semgrep/ directory vs. pants_ignore's default .*/ value?

We can definitely punch a hole in the default excludes for this (I think that a !.semgrep/ would be sufficient...?) While that would be backwards incompatible, anyone using semgrep and pants together would likely be interested to know that this backend existed.

@huonw
Copy link
Contributor Author

huonw commented Mar 30, 2023

Thanks for the input. I'll look at it more detail over the next week or so (pretty busy with other things, unfortunately).

@huonw
Copy link
Contributor Author

huonw commented Apr 14, 2023

I've been chipping away at this over the last little while. I've pushed a work-in-progress state as a 'back-up', still need to work through tests and the issues they discover, and a few other "small" issues (in the sense of having done 80% of the work, so only got the remaining 80% to go).

It's not ready for any sort of review.

@huonw
Copy link
Contributor Author

huonw commented May 5, 2023

Let me know if you want further effort to make this easier to review: I'm hoping I've structured this as a 'standard' plugin, with the subsystem.py, tailor.py and rules.py reviewable somewhat independently (and in vaguely that order, I guess).

@huonw
Copy link
Contributor Author

huonw commented Aug 28, 2023

Okay, I've simplified this a lot: the new target types are unnecessary for a first attempt. This now just finds config files by globbing (similar to other linting backends), and runs the linting.

The diff still looks big, but most of that is the lock file, and more than half the rest is tests.

@huonw huonw requested review from thejcannon and kaos August 28, 2023 11:01
Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first start. I may not get the semgrep-ness of it all, but the Pants aspects look good.

Honestly, I think we should start really considering outsourcing plugins outside of core "Pants", and this would be perfect, but I also know we're just not "there" yet.

# addition/exclusion), but that can only infer 'full' dependencies and it is wrong (e.g. JVM
# things break) for real code files to depend on this sort of non-code linter config; requires
# dependency scopes or similar (https://github.com/pantsbuild/pants/issues/12794)
spec = Path(address.spec_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be PurePath? Generally we prefer the Pure variants since rules shouldn't be doing file I/O

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this got resolved?

It's the only one I feel strongly about.

Admittedly we should issue a lint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, dear, sorry, I fixed but pushed to the wrong branch! Thanks for flagging.

#19685

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly we should issue a lint.

#19686

@@ -68,9 +69,12 @@ def create(
*,
strip_chroot_path: bool = False,
report: Digest = EMPTY_DIGEST,
strip_formatting: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"strip formatting" doesn't really imply stripping colors to me.

Call a spade a spade? strip_colors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the name of the arg we use elsewhere too:

strip_formatting: bool = False,

I'd prefer to keep the current pattern for local consistency, and you can rename globally as separate work if you feel strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In addition, I think the colours.strip_color function strips other terminal formatting escape codes, like bold and underline too, despite its name.)

name = "Semgrep"
options_scope = "semgrep"
help = softwrap(
"""\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to escape the newline for softwrap'ed long strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Accidentally pushed to the wrong branch. Fixed in #19685)

@huonw
Copy link
Contributor Author

huonw commented Aug 28, 2023

5 months later, I'm going to merge this. Thanks everyone for helping me work through this, and for encouraging cutting it down to be minimal. It's experimental, so it should be lower impact to make improvements (or pull it out) in future.

@huonw huonw merged commit cb63bba into pantsbuild:main Aug 28, 2023
huonw added a commit that referenced this pull request Aug 29, 2023
Two fixes from #18593 that I pushed to the wrong branch, and didn't
notice before merging that PR:

- Use `PurePath` instead of `Path`:
#18593 (comment)
- Avoid escaping the newline for `softwrap`:
#18593 (comment)
huonw added a commit that referenced this pull request Oct 31, 2023
This refreshes the new semgrep backend (added in #18593, in 2.19) to use
the current latest published version (`1.46.0`, instead of `1.37.0`).

The newer version of Semgrep is slightly stricter about the formatting
of the global settings file, so this inserts a dummy `anonymous_user_id`
value.
huonw added a commit that referenced this pull request Jul 23, 2024
…21126)

Three things:

Allow configuring the config location; because of how the digest is 
built, running `semgrep` via pants doesn't actually  allow you to 
configure this by passing `--config`, like the tool does. There is now
a `config_name` option which will be auto-discovered and hierarchically
applied to each relevant partition. This will potentially evolve into
a `config_paths` option which will allow users to specify a list of
globs.

`semgrep` searches for rules recursively within a rules directory, so 
reflect that in the globs. This isn't obvious in the documentation, but
it's the way it behaves, and also is explicit in the code.

https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/cli/src/semgrep/config_resolver.py#L402


Finally, I flattened the glob for `.semgrepignore`. Currently, the glob
recursively matches all matching file names, but ultimately `semgrep` 
just reads only what's in the current dir. In my testing of
pants+semgrep
I did indeed notice the same behavior, that the `.semgrepignore`s in
lower
dirs were not being respected. An integration test confirmed the
same.[1]
Supporting multi-level `.semgrepignores` is going to be
a significant effort, since it would involve actually parsing and
interpreting
the config files, and also navigating the pitfalls of managing overlays
of
ignore patterns. Instead, I chose to force it to being a singular file 
path.

Here're some historical discussions about .semgrepignore:
huonw#1,
#18593 (comment)

https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/targeting/Semgrepignore.ml#L124
https://semgrep.dev/docs/ignoring-files-folders-code#reference-summary

[1] The integration test which proves hierarchical semgrepignore doesn't
work:
```py
def test_semgrepignore_not_partitioned(rule_runner: RuleRunner) -> None:
    rule_runner.write_files({
        f"{DIR}/BUILD": "",
        f"{DIR}/.semgrep.yml": RULES,
        f"{DIR}/.semgrepignore": FILE,
        
        f"{DIR}/a/.semgrepignore": FILE,
        f"{DIR}/a/BUILD": SINGLE_FILE_BUILD,
        f"{DIR}/a/{FILE}": BAD_FILE,
    })
    tgt = rule_runner.get_target(Address(f"{DIR}/a", target_name="f"))

    # run with these two semgrepignores in place
    results = run_semgrep(rule_runner, [tgt])
    assert len(results) == 1
    result = results[0]
    # doesn't get ignored
    assert "Ran 1 rule on 1 file: 1 finding" in result.stderr

    # run with a root semgrepignore
    rule_runner.write_files({
        ".semgrepignore": FILE,
    })
    results = run_semgrep(rule_runner, [tgt])
    assert len(results) == 1
    result = results[0]
    # finally, it's ignored
    assert "Ran 1 rule on 0 files: 0 findings" in result.stderr
```

---------

Co-authored-by: Huon Wilson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants