Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customizing semgrep configurations; correct rule matching glob #21126

Merged
merged 13 commits into from
Jul 23, 2024

Conversation

purajit
Copy link
Contributor

@purajit purajit commented Jul 3, 2024

Three things:

Allow configuring the config location; because of how the digest is
built, running semgrep via pants doesn't actually allow you to
configure this by passing --config, like the tool does. There is now
a config_name option which will be auto-discovered and hierarchically
applied to each relevant partition. This will potentially evolve into
a config_paths option which will allow users to specify a list of globs.

semgrep searches for rules recursively within a rules directory, so
reflect that in the globs. This isn't obvious in the documentation, but
it's the way it behaves, and also is explicit in the code.
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/cli/src/semgrep/config_resolver.py#L402

Finally, I flattened the glob for .semgrepignore. Currently, the glob
recursively matches all matching file names, but ultimately semgrep
just reads only what's in the current dir. In my testing of pants+semgrep
I did indeed notice the same behavior, that the .semgrepignores in lower
dirs were not being respected. An integration test confirmed the same.[1]
Supporting multi-level .semgrepignores is going to be
a significant effort, since it would involve actually parsing and interpreting
the config files, and also navigating the pitfalls of managing overlays of
ignore patterns. Instead, I chose to force it to being a singular file
path.

Here're some historical discussions about .semgrepignore: huonw#1, #18593 (comment)
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/targeting/Semgrepignore.ml#L124
https://semgrep.dev/docs/ignoring-files-folders-code#reference-summary

[1] The integration test which proves hierarchical semgrepignore doesn't work:

def test_semgrepignore_not_partitioned(rule_runner: RuleRunner) -> None:
    rule_runner.write_files({
        f"{DIR}/BUILD": "",
        f"{DIR}/.semgrep.yml": RULES,
        f"{DIR}/.semgrepignore": FILE,
        
        f"{DIR}/a/.semgrepignore": FILE,
        f"{DIR}/a/BUILD": SINGLE_FILE_BUILD,
        f"{DIR}/a/{FILE}": BAD_FILE,
    })
    tgt = rule_runner.get_target(Address(f"{DIR}/a", target_name="f"))

    # run with these two semgrepignores in place
    results = run_semgrep(rule_runner, [tgt])
    assert len(results) == 1
    result = results[0]
    # doesn't get ignored
    assert "Ran 1 rule on 1 file: 1 finding" in result.stderr

    # run with a root semgrepignore
    rule_runner.write_files({
        ".semgrepignore": FILE,
    })
    results = run_semgrep(rule_runner, [tgt])
    assert len(results) == 1
    result = results[0]
    # finally, it's ignored
    assert "Ran 1 rule on 0 files: 0 findings" in result.stderr

@purajit purajit force-pushed the 20240703-semgrep-configurable branch from a8b06ff to f1f0e3c Compare July 3, 2024 23:03
@purajit purajit force-pushed the 20240703-semgrep-configurable branch from f1f0e3c to c55500a Compare July 4, 2024 07:54
@purajit
Copy link
Contributor Author

purajit commented Jul 4, 2024

Thoughts on working towards deprecating all default values for the
config globs?

The default values were deprecated in 1.38.0, which is coming up on 1
year now (Aug 2023). When I run the semgrep CLI for these versions I
even get the message

Versions prior to 1.43.0 will cease to work with Semgrep.dev after March 11th 2024, please upgrade.

Newer versions even actively warn against it.
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/CHANGELOG.md?plain=1#L2400-L2402
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/osemgrep/core/Migration.ml#L12-L14

The main argument to keeping it would be that the issue brought up in
the changelog doesn't apply here - people are only running semgrep
through pants within repos. However, I don't think pants should be
adding too much magic on top of a tool; it should mostly just help
integrate the system together and leave everything else up to the
underlying tool.

Would definitely require an actual deprecation process, though I dunno
how strict that is given semgrep is experimental. Definitely out of scope
of this PR, but I'm willing to do a follow-up.

@purajit
Copy link
Contributor Author

purajit commented Jul 4, 2024

And a second thought:

We might want to entirely do away with the multi-level config setup. It's
adds complexity purely because the pants backend defines this config
differently from semgrep.

semgrep doesn't behave hierarchically by nature; the pants backend works
because it calculates and injects the explicit --config params as needed.
It's also going to make the future of this backend more complicated for
two reasons:

  1. Multiple config paths can be provided, and they can be directories or files. Currently we don't support files; the hierarchical nature adds a few steps to this process.
  2. It can even be a URL, or name of a registry, which is currently unsupported in
    the pants backend, and it would make it increasingly confusing if we said "files and dirs automatically get grouped hierarchically, but not URLs or registries", and we'd just be diverging from the main tool.

Some of the complexity can already be seen in this PR with the glob management.

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I don't know this tool, but the changes look sane to me.

Being a experimental backend, we don't need deprecation cycles to change defaults etc.

docs/notes/2.23.x.md Outdated Show resolved Hide resolved
@kaos kaos requested a review from huonw July 4, 2024 08:25
@purajit
Copy link
Contributor Author

purajit commented Jul 4, 2024

Nice. I don't know this tool, but the changes look sane to me.

Being a experimental backend, we don't need deprecation cycles to change defaults etc.

Gotcha, will work on that next PR!

@purajit purajit marked this pull request as ready for review July 4, 2024 09:43
@@ -51,6 +57,20 @@ class SemgrepSubsystem(PythonToolBase):
register_lockfile = True
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock")

config_paths = StrListOption(
default=[".semgrep"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going, but I wonder if we should go a bit further!

It feels weird to (a) just use this as a directory name, and (b) not allow customising the single-file search.

I see three options:

  • old code: no customisation, just attempt to hard-code the now-deprecated auto-search rules that semgrep has (**/.semgrep.yml,**/.semgrep.yaml, **/.semgrep/*.yml,**/.semgrep/*.yaml, with the magic around .semgrep/ directories)
  • this PR: partial customisation, allow changing the name of the directory in the last two globs there (and fix the recursive search within the folder, but that's not the thrust of the topic here)
  • proposal: put all the glob definitions into pants.toml
Suggested change
default=[".semgrep"],
default=["**/.semgrep.yml", "**/.semgrep.yaml", "**/.semgrep"],

And have the code append "", "**/*.yml" and "**/*.yaml" when expanding globs.

That is, run something equivalent to: PathGlobs(["**/.semgrep.yml", "**/.semgrep.yaml", "**/.semgrep", "**/.semgrep.yml/**/*.yml", "**/.semgrep.yaml/**/*.yml", "**/.semgrep/**/*.yml", "**/.semgrep.yml/**/*.yaml", "**/.semgrep.yaml/**/*.yaml", "**/.semgrep/**/*.yaml"])

(The fact this includes entries like **/.semgrep and "**/.semgrep.yml/**/*.yml" is a bit weird, but I think it's okay? They'll generally just be ignored.)

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I was going to mention this in the pr but must've overwritten it.

I didn't do that precisely for the same reason you mentioned - didn't want to add too much at once and then deal with the globs. I was thinking of first just allowing customizing the dir, and then allowing files in the next pr.

I even had this named config_dir at first, and then changed the name so that allowing files could just be a simple release note rather than a breaking change.

I can add it to this pr itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the way this config works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this; this looks like a good incremental improvement, and I'm definitely sympathic to trying to keep the changes smaller.

didn't want to add too much at once

For the next steps, if you keep working on this, I'd still be interested in moving most of the path/globbing logic into the option in pants.toml.

For example, evolving to config_paths that takes a list of paths/globs (or maybe inspired by root_patterns https://www.pantsbuild.org/2.21/docs/using-pants/key-concepts/source-roots), instead of the single config_name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily a big deal, but I wanted to keep it restricted to config_name since it allows us to treat the repo structure more uniformly. I also wonder how many people configure semgrep in completely different ways for each project.

I don't mind following up with that though. Probably at the same time as deprecating defaults, since that'll make things easier too.

Comment on lines 69 to 70
ignore_config_path = StrOption(
default=".semgrepignore",
Copy link
Contributor

@huonw huonw Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... sorry about forcing you to re-investigate everything about these ignore files. This was known in the original addition of the backend, and cut out of scope: #18593 (comment)

I even referenced huonw#1 "so I don't forget about it", but look how far that's gone. 🙈

In any case, I think these two things are true, and still true as of this PR:

  • pants runs semgrep in the sandbox root (i.e. working directory is the root of the repository)
  • semgrep only looks for .semgrepignore in its working directory

Given this, running semgrep via pants will only ever successfully read //.semgrepignore. Thus, someone who sets ignore_config_path = "some/sub/directory/.semgrepignore will just have semgrep not using the ignore. It'll be included in the sandbox, but not used!

(The thinking behind the globbing behaviour was "future proofing": if a new version of semgrep started reading the nested .semgrepignores, it'll automatically work with Pants, because Pants is including them in the sandboxes.)


Two questions:

  1. Are you changing the old behaviour (of including all files called .semgrepignore) because it was causing problems? Can you be specific about what the problems are?
  2. Do you have an example of how someone would customise this path (and why)?

Copy link
Contributor Author

@purajit purajit Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, I'm on my phone right now but I'll look at those PRs later.

No, it wasn't causing problems. Basically when I was adding the description for the semgrepignore customization, I initially had a note that it was hierarchical. That got me curious, so I tried to use it, and realized it did not work as expected. Probably should've done a git blame to see why it was done, but I decided to mention explicitly that it had to be at the root-based path (since this is different from the rules dir behavior), and remove the relevant code.

Shall I add it back with a note about context? Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :includes into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, just realized semgrepignore is not actually customizable, so should just remove this and go back to the hard-coded value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 hilarious that we ended up with nearly the same integration tests for showing this behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this configurability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :includes into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?

Yes, I agree that Pants should try to avoid having too much "smarts" tailored to each tool.

My thinking with the continuing work related to .semgrepignore (i.e. huonw#1) was just to make it more obvious when something is ignored, to reduce confusion. Not "invent" a new file.

Comment on lines 75 to 82
# Rules like foo/bar/.semgrep/baz.yaml and foo/bar/.semgrep/baz/qux.yaml should apply to the
# project at foo/bar
config_directory = path.parent
for config_path in config_paths:
if config_path not in path.parts:
continue
config_directory = PurePath(*path.parts[: path.parts.index(config_path)])
break
Copy link
Contributor

@huonw huonw Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested loop jumps out to me as a bit concerning, due to performance:

  • this loop was previously approximately O(len(all_paths))
  • I think it's now O(len(all_paths) * average number of parts of each path * len(config_paths)) = O(sum(len(path.parts) for path in all_paths) * len(config_paths))

That is, after this PR, it's proportional to the number of config files and the number of parts (e.g. a, b etc. in a/b/...) of all files in the whole repository. I imagine the number of config files is often 1, but could easily be more (with a .semgrep directory)... but the number of path components could be very much higher than len(all_paths), depending how deeply nested the repo is!

I haven't yet thought of any alternatives, and how much effort to spend here will be guided by the top-level conversation about globbing etc.

Copy link
Contributor Author

@purajit purajit Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I'm comfortable forcing just one configuration option here.

For the iteration, I was really hoping for glob utils that could give me paths as match groups based on the nature of your glob, in which case this would be 0 work and would go back to an O(1) loop. I didn't see any, did I miss anything?

If not, I could improve it by doing files and dirs separately. This fn can take those two as separate args and

  • for config files, just use the parent
  • for config dirs, walk through the ancestry line. That would make it O(semgrep dir depth). I'm assuming that semgrep directories are not deeply nested, and that there are fewer path parts after the config dir than before. That should typically be a non-factor and amortize out to be constant. In the case of our monorepo, we just have a single dir level to store vendored rules.

Copy link
Contributor Author

@purajit purajit Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, did a bit of restructure based on the above.

This loop will be much simpler once we deprecate the default option and stop supporting a mix of file configs and folder configs.

Currently, the only issue is that in the default case, I have two separate globs for files and dirs, which would result in two separate glob expansions. Once we deprecate the default case, that goes away. And a lot of other bits of logic will get easier too.

@huonw
Copy link
Contributor

huonw commented Jul 4, 2024

Woohoo, thanks for taking the time to contribute!

semgrep searches for rules recursively within a rules directory, so
reflect that in the globs. This isn't obvious in the documentation, but
it's the way it behaves, and also is explicit in the code.

Nice catch 👍

Thoughts on working towards deprecating all default values for the
config globs?

Sounds good, given your research with semgrep itself moving away from the conventions 👍

Being a experimental backend, we don't need deprecation cycles to change defaults etc.

If it's easy to deprecate for a release and thus make it easier for people to upgrade, that'd be good, even with the experimental tag. If it's not easy, that's okay.

I'm not at a "real" computer to be able to find the details to help you, but #18390 is an example where we deprecated the default value for an option and then removed it (#18627).

In any case, doing it as follow-up sounds good.

We might want to entirely do away with the multi-level config setup. It's
adds complexity purely because the pants backend defines this config
differently from semgrep.

Do you have thoughts on what would be better to do?

For instance, ignoring the details of Pants for now, what semgrep invocations should happen for a repo like?

a/.semgrep.yml
a/x.py
b/.semgrep.yml
b/y.py

It can even be a URL, or name of a registry, which is currently unsupported in
the pants backend

This is somewhat explicitly not supported, because reading external non-versioned resources interacts badly with caching: see discussion on the force option.

@purajit
Copy link
Contributor Author

purajit commented Jul 4, 2024

For instance, ignoring the details of Pants for now, what semgrep invocations should happen for a repo like?

I think this gets a bit complicated - depending on how people organize their repos there could be a few answers. And I admit I'm making a lot of assumptions here and potentially projecting my opinions onto others.

In a world without pants, I'd say people are probably chdir'ing to each subproject root and running semgrep, and that people likely do not have nested semgrep configs. Since semgrep natively does not support this nesting, it would be quite a bit of effort for people to write tools/queries/wrappers to fetch nested rule dirs and add then as CLI arguments.

Also given the nature of semgrep, I would assume people have a consistent set of rules they're universally applying, and ignoring files/dirs when needed needed, rather than having configs residing in specific directories, and that at most this will apply to high-level subprojects and no deeper.

I would personally prefer to enforce that people continue using semgrep in the "expected" way, and that they handle the cases where the merge repos and manage it as part of the migration. Supporting the subproject-specific runs feels like supporting an anti-pattern, but I'm not sure if there's already been a broader discussion about this general circumstance (tools meant to be run on a monolith that that may be run in different contexts or directories depending on specific repo setups).

This is somewhat explicitly not supported, because reading external non-versioned resources interacts badly with caching

Will check that discussion out. Not important to me to allow them since I don't use them for the same reason (even without caching, I don't want to be pulling random files from the internet); I just noticed the gap and found it interesting and wondered what the future plans are. In other tools that have something similar, is the explicit decision to always leave it unsupported? Or to try to enforce things with explicit known checksums, like with known_versions?

@huonw
Copy link
Contributor

huonw commented Jul 8, 2024

(Thanks for waiting. I haven't forgotten about this, but just a bit busy at work for a few days. Please feel free to ping me if I haven't re-reviewed by Thursday (any timezone 😄 ).)

@purajit
Copy link
Contributor Author

purajit commented Jul 8, 2024

(Thanks for waiting. I haven't forgotten about this, but just a bit busy at work for a few days. Please feel free to ping me if I haven't re-reviewed by Thursday (any timezone 😄 ).)

No worries! It was the weekend, and I anyway have a couple changes to work on. And also realized I had a bunch of typos because I was half-awake on my phone while leaving some of the comments, fixed those :P

@huonw
Copy link
Contributor

huonw commented Jul 10, 2024

Also given the nature of semgrep, I would assume people have a consistent set of rules they're universally applying, and ignoring files/dirs when needed needed, rather than having configs residing in specific directories, and that at most this will apply to high-level subprojects and no deeper.

I agree that this is likely the default and most common (it's certainly how I use it in my work repo)... but...

I would personally prefer to enforce that people continue using semgrep in the "expected" way, and that they handle the cases where the merge repos and manage it as part of the migration. Supporting the subproject-specific runs feels like supporting an anti-pattern, but I'm not sure if there's already been a broader discussion about this general circumstance (tools meant to be run on a monolith that that may be run in different contexts or directories depending on specific repo setups).

... my learning from other tools (e.g. black) is that "partitioning by config" is desirable. For instance:

Thus, I'm pretty reticent about removing existing partitioning logic.

I just noticed the gap and found it interesting and wondered what the future plans are. In other tools that have something similar, is the explicit decision to always leave it unsupported? Or to try to enforce things with explicit known checksums, like with known_versions?

Yeah, some sort of checksum/content-addressing. I don't know of any other tools similar to semgrep that pants already supports. Arguably installing dependencies from PyPI is like this, but we encourage people to use a lockfile that encodes checksums etc, to be reproducible. I don't know of a lockfile for semgrep configs (i.e. "I want this semgrep.dev rule-set of this particular version")?

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is looking better now too!

docs/notes/2.23.x.md Outdated Show resolved Hide resolved
Comment on lines 69 to 70
ignore_config_path = StrOption(
default=".semgrepignore",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :includes into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?

Yes, I agree that Pants should try to avoid having too much "smarts" tailored to each tool.

My thinking with the continuing work related to .semgrepignore (i.e. huonw#1) was just to make it more obvious when something is ignored, to reduce confusion. Not "invent" a new file.

async def find_all_semgrep_configs() -> AllSemgrepConfigs:
all_paths = await path_globs_to_paths(
PathGlobs([f"**/{file_glob}" for file_glob in _RULES_FILES_GLOBS])
async def find_all_semgrep_configs(semgrep: SemgrepSubsystem) -> AllSemgrepConfigs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new approach a lot more! Thanks.

@@ -51,6 +57,20 @@ class SemgrepSubsystem(PythonToolBase):
register_lockfile = True
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock")

config_paths = StrListOption(
default=[".semgrep"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this; this looks like a good incremental improvement, and I'm definitely sympathic to trying to keep the changes smaller.

didn't want to add too much at once

For the next steps, if you keep working on this, I'd still be interested in moving most of the path/globbing logic into the option in pants.toml.

For example, evolving to config_paths that takes a list of paths/globs (or maybe inspired by root_patterns https://www.pantsbuild.org/2.21/docs/using-pants/key-concepts/source-roots), instead of the single config_name.

Co-authored-by: Huon Wilson <[email protected]>
@purajit purajit requested a review from huonw July 10, 2024 23:47
@purajit
Copy link
Contributor Author

purajit commented Jul 11, 2024

Yeah, I ditched the idea of undoing the config-by-partition.

I don't know of a lockfile for semgrep configs (i.e. "I want this semgrep.dev rule-set of this particular version")?

Nothing like that really exists today; at work we just vendor out all the rules we want. rules typically don't need
updates or bug fixes so we're comfortable just leaving them in that state until we explicitly refresh them.

Do you have any other comments you want me to address in this PR? Or are we good now? I'll follow up with
two PRs - one for deprecating the default values, and one for expanding config_name.

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very close, other than the test.

Also, the pull request description becomes the final commit message. It looks like the current description may be out of date with where this PR ended up (with .semgrepignore especially). Could you update it? Thanks

@@ -51,6 +51,18 @@ class SemgrepSubsystem(PythonToolBase):
register_lockfile = True
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock")

config_name = StrOption(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, one more thing. I think there's no test that currently exercises this option and its new code-path. Could you expand src/python/pants/backend/tools/semgrep/rules_integration_test.py to include one? Let me know if you'd like some assistance choosing where to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on it

@purajit
Copy link
Contributor Author

purajit commented Jul 15, 2024

The .semgrepignore part should still be accurate - the glob is flattened (https://github.com/pantsbuild/pants/pull/21126/files#diff-543817713d41666b0c246395e84c35bdb1a60facbf4c1f9948b444c24d274265R183 only captures the root-level file). I can update it with the links to your PR to maintain the trail?

@purajit purajit requested a review from huonw July 18, 2024 07:29
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! thanks for waiting

@huonw huonw merged commit 71749c0 into pantsbuild:main Jul 23, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants