-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow customizing semgrep configurations; correct rule matching glob #21126
Allow customizing semgrep configurations; correct rule matching glob #21126
Conversation
a8b06ff
to
f1f0e3c
Compare
f1f0e3c
to
c55500a
Compare
Thoughts on working towards deprecating all default values for the The default values were deprecated in 1.38.0, which is coming up on 1
Newer versions even actively warn against it. The main argument to keeping it would be that the issue brought up in Would definitely require an actual deprecation process, though I dunno |
And a second thought: We might want to entirely do away with the multi-level config setup. It's
Some of the complexity can already be seen in this PR with the glob management. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I don't know this tool, but the changes look sane to me.
Being a experimental backend, we don't need deprecation cycles to change defaults etc.
Gotcha, will work on that next PR! |
@@ -51,6 +57,20 @@ class SemgrepSubsystem(PythonToolBase): | |||
register_lockfile = True | |||
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock") | |||
|
|||
config_paths = StrListOption( | |||
default=[".semgrep"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going, but I wonder if we should go a bit further!
It feels weird to (a) just use this as a directory name, and (b) not allow customising the single-file search.
I see three options:
- old code: no customisation, just attempt to hard-code the now-deprecated auto-search rules that
semgrep
has (**/.semgrep.yml
,**/.semgrep.yaml
,**/.semgrep/*.yml
,**/.semgrep/*.yaml
, with the magic around.semgrep/
directories) - this PR: partial customisation, allow changing the name of the directory in the last two globs there (and fix the recursive search within the folder, but that's not the thrust of the topic here)
- proposal: put all the glob definitions into
pants.toml
default=[".semgrep"], | |
default=["**/.semgrep.yml", "**/.semgrep.yaml", "**/.semgrep"], |
And have the code append ""
, "**/*.yml"
and "**/*.yaml"
when expanding globs.
That is, run something equivalent to: PathGlobs(["**/.semgrep.yml", "**/.semgrep.yaml", "**/.semgrep", "**/.semgrep.yml/**/*.yml", "**/.semgrep.yaml/**/*.yml", "**/.semgrep/**/*.yml", "**/.semgrep.yml/**/*.yaml", "**/.semgrep.yaml/**/*.yaml", "**/.semgrep/**/*.yaml"])
(The fact this includes entries like **/.semgrep
and "**/.semgrep.yml/**/*.yml"
is a bit weird, but I think it's okay? They'll generally just be ignored.)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I was going to mention this in the pr but must've overwritten it.
I didn't do that precisely for the same reason you mentioned - didn't want to add too much at once and then deal with the globs. I was thinking of first just allowing customizing the dir, and then allowing files in the next pr.
I even had this named config_dir
at first, and then changed the name so that allowing files could just be a simple release note rather than a breaking change.
I can add it to this pr itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the way this config works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this; this looks like a good incremental improvement, and I'm definitely sympathic to trying to keep the changes smaller.
didn't want to add too much at once
For the next steps, if you keep working on this, I'd still be interested in moving most of the path/globbing logic into the option in pants.toml
.
For example, evolving to config_paths
that takes a list of paths/globs (or maybe inspired by root_patterns
https://www.pantsbuild.org/2.21/docs/using-pants/key-concepts/source-roots), instead of the single config_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily a big deal, but I wanted to keep it restricted to config_name
since it allows us to treat the repo structure more uniformly. I also wonder how many people configure semgrep in completely different ways for each project.
I don't mind following up with that though. Probably at the same time as deprecating defaults, since that'll make things easier too.
ignore_config_path = StrOption( | ||
default=".semgrepignore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... sorry about forcing you to re-investigate everything about these ignore files. This was known in the original addition of the backend, and cut out of scope: #18593 (comment)
I even referenced huonw#1 "so I don't forget about it", but look how far that's gone. 🙈
In any case, I think these two things are true, and still true as of this PR:
- pants runs
semgrep
in the sandbox root (i.e. working directory is the root of the repository) - semgrep only looks for
.semgrepignore
in its working directory
Given this, running semgrep via pants will only ever successfully read //.semgrepignore
. Thus, someone who sets ignore_config_path = "some/sub/directory/.semgrepignore
will just have semgrep not using the ignore. It'll be included in the sandbox, but not used!
(The thinking behind the globbing behaviour was "future proofing": if a new version of semgrep started reading the nested .semgrepignore
s, it'll automatically work with Pants, because Pants is including them in the sandboxes.)
Two questions:
- Are you changing the old behaviour (of including all files called
.semgrepignore
) because it was causing problems? Can you be specific about what the problems are? - Do you have an example of how someone would customise this path (and why)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, I'm on my phone right now but I'll look at those PRs later.
No, it wasn't causing problems. Basically when I was adding the description for the semgrepignore customization, I initially had a note that it was hierarchical. That got me curious, so I tried to use it, and realized it did not work as expected. Probably should've done a git blame to see why it was done, but I decided to mention explicitly that it had to be at the root-based path (since this is different from the rules dir behavior), and remove the relevant code.
Shall I add it back with a note about context? Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :include
s into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yeah, just realized semgrepignore is not actually customizable, so should just remove this and go back to the hard-coded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂 hilarious that we ended up with nearly the same integration tests for showing this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this configurability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :includes into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?
Yes, I agree that Pants should try to avoid having too much "smarts" tailored to each tool.
My thinking with the continuing work related to .semgrepignore
(i.e. huonw#1) was just to make it more obvious when something is ignored, to reduce confusion. Not "invent" a new file.
# Rules like foo/bar/.semgrep/baz.yaml and foo/bar/.semgrep/baz/qux.yaml should apply to the | ||
# project at foo/bar | ||
config_directory = path.parent | ||
for config_path in config_paths: | ||
if config_path not in path.parts: | ||
continue | ||
config_directory = PurePath(*path.parts[: path.parts.index(config_path)]) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested loop jumps out to me as a bit concerning, due to performance:
- this loop was previously approximately O(
len(all_paths)
) - I think it's now O(
len(all_paths)
* average number ofparts
of each path *len(config_paths)
) = O(sum(len(path.parts) for path in all_paths)
*len(config_paths)
)
That is, after this PR, it's proportional to the number of config files and the number of parts (e.g. a
, b
etc. in a/b/...
) of all files in the whole repository. I imagine the number of config files is often 1, but could easily be more (with a .semgrep
directory)... but the number of path components could be very much higher than len(all_paths)
, depending how deeply nested the repo is!
I haven't yet thought of any alternatives, and how much effort to spend here will be guided by the top-level conversation about globbing etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I'm comfortable forcing just one configuration option here.
For the iteration, I was really hoping for glob utils that could give me paths as match groups based on the nature of your glob, in which case this would be 0 work and would go back to an O(1)
loop. I didn't see any, did I miss anything?
If not, I could improve it by doing files and dirs separately. This fn can take those two as separate args and
- for config files, just use the parent
- for config dirs, walk through the ancestry line. That would make it
O(semgrep dir depth)
. I'm assuming that semgrep directories are not deeply nested, and that there are fewer path parts after the config dir than before. That should typically be a non-factor and amortize out to be constant. In the case of our monorepo, we just have a single dir level to store vendored rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, did a bit of restructure based on the above.
This loop will be much simpler once we deprecate the default option and stop supporting a mix of file configs and folder configs.
Currently, the only issue is that in the default case, I have two separate globs for files and dirs, which would result in two separate glob expansions. Once we deprecate the default case, that goes away. And a lot of other bits of logic will get easier too.
Woohoo, thanks for taking the time to contribute!
Nice catch 👍
Sounds good, given your research with semgrep itself moving away from the conventions 👍
If it's easy to deprecate for a release and thus make it easier for people to upgrade, that'd be good, even with the experimental tag. If it's not easy, that's okay. I'm not at a "real" computer to be able to find the details to help you, but #18390 is an example where we deprecated the default value for an option and then removed it (#18627). In any case, doing it as follow-up sounds good.
Do you have thoughts on what would be better to do? For instance, ignoring the details of Pants for now, what semgrep invocations should happen for a repo like?
This is somewhat explicitly not supported, because reading external non-versioned resources interacts badly with caching: see discussion on the |
I think this gets a bit complicated - depending on how people organize their repos there could be a few answers. And I admit I'm making a lot of assumptions here and potentially projecting my opinions onto others. In a world without pants, I'd say people are probably chdir'ing to each subproject root and running semgrep, and that people likely do not have nested semgrep configs. Since semgrep natively does not support this nesting, it would be quite a bit of effort for people to write tools/queries/wrappers to fetch nested rule dirs and add then as CLI arguments. Also given the nature of semgrep, I would assume people have a consistent set of rules they're universally applying, and ignoring files/dirs when needed needed, rather than having configs residing in specific directories, and that at most this will apply to high-level subprojects and no deeper. I would personally prefer to enforce that people continue using semgrep in the "expected" way, and that they handle the cases where the merge repos and manage it as part of the migration. Supporting the subproject-specific runs feels like supporting an anti-pattern, but I'm not sure if there's already been a broader discussion about this general circumstance (tools meant to be run on a monolith that that may be run in different contexts or directories depending on specific repo setups).
Will check that discussion out. Not important to me to allow them since I don't use them for the same reason (even without caching, I don't want to be pulling random files from the internet); I just noticed the gap and found it interesting and wondered what the future plans are. In other tools that have something similar, is the explicit decision to always leave it unsupported? Or to try to enforce things with explicit known checksums, like with |
(Thanks for waiting. I haven't forgotten about this, but just a bit busy at work for a few days. Please feel free to ping me if I haven't re-reviewed by Thursday (any timezone 😄 ).) |
No worries! It was the weekend, and I anyway have a couple changes to work on. And also realized I had a bunch of typos because I was half-awake on my phone while leaving some of the comments, fixed those :P |
I agree that this is likely the default and most common (it's certainly how I use it in my work repo)... but...
... my learning from other tools (e.g.
Thus, I'm pretty reticent about removing existing partitioning logic.
Yeah, some sort of checksum/content-addressing. I don't know of any other tools similar to semgrep that pants already supports. Arguably installing dependencies from PyPI is like this, but we encourage people to use a lockfile that encodes checksums etc, to be reproducible. I don't know of a lockfile for semgrep configs (i.e. "I want this semgrep.dev rule-set of this particular version")? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is looking better now too!
ignore_config_path = StrOption( | ||
default=".semgrepignore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I do think as it stands pants taking this on will be a huge burden in taking over tool responsibility stuff. The simplest way I can think to get it done would be to to inject :includes into the root semgrepignore during the pants run. Is there any precedent to alter file digests like that?
Yes, I agree that Pants should try to avoid having too much "smarts" tailored to each tool.
My thinking with the continuing work related to .semgrepignore
(i.e. huonw#1) was just to make it more obvious when something is ignored, to reduce confusion. Not "invent" a new file.
async def find_all_semgrep_configs() -> AllSemgrepConfigs: | ||
all_paths = await path_globs_to_paths( | ||
PathGlobs([f"**/{file_glob}" for file_glob in _RULES_FILES_GLOBS]) | ||
async def find_all_semgrep_configs(semgrep: SemgrepSubsystem) -> AllSemgrepConfigs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new approach a lot more! Thanks.
@@ -51,6 +57,20 @@ class SemgrepSubsystem(PythonToolBase): | |||
register_lockfile = True | |||
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock") | |||
|
|||
config_paths = StrListOption( | |||
default=[".semgrep"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this; this looks like a good incremental improvement, and I'm definitely sympathic to trying to keep the changes smaller.
didn't want to add too much at once
For the next steps, if you keep working on this, I'd still be interested in moving most of the path/globbing logic into the option in pants.toml
.
For example, evolving to config_paths
that takes a list of paths/globs (or maybe inspired by root_patterns
https://www.pantsbuild.org/2.21/docs/using-pants/key-concepts/source-roots), instead of the single config_name
.
Co-authored-by: Huon Wilson <[email protected]>
Yeah, I ditched the idea of undoing the config-by-partition.
Nothing like that really exists today; at work we just vendor out all the rules we want. rules typically don't need Do you have any other comments you want me to address in this PR? Or are we good now? I'll follow up with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is very close, other than the test.
Also, the pull request description becomes the final commit message. It looks like the current description may be out of date with where this PR ended up (with .semgrepignore
especially). Could you update it? Thanks
@@ -51,6 +51,18 @@ class SemgrepSubsystem(PythonToolBase): | |||
register_lockfile = True | |||
default_lockfile_resource = ("pants.backend.tools.semgrep", "semgrep.lock") | |||
|
|||
config_name = StrOption( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, one more thing. I think there's no test that currently exercises this option and its new code-path. Could you expand src/python/pants/backend/tools/semgrep/rules_integration_test.py
to include one? Let me know if you'd like some assistance choosing where to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working on it
The |
…rajit/pants into 20240703-semgrep-configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! thanks for waiting
Three things:
Allow configuring the config location; because of how the digest is
built, running
semgrep
via pants doesn't actually allow you toconfigure this by passing
--config
, like the tool does. There is nowa
config_name
option which will be auto-discovered and hierarchicallyapplied to each relevant partition. This will potentially evolve into
a
config_paths
option which will allow users to specify a list of globs.semgrep
searches for rules recursively within a rules directory, soreflect that in the globs. This isn't obvious in the documentation, but
it's the way it behaves, and also is explicit in the code.
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/cli/src/semgrep/config_resolver.py#L402
Finally, I flattened the glob for
.semgrepignore
. Currently, the globrecursively matches all matching file names, but ultimately
semgrep
just reads only what's in the current dir. In my testing of pants+semgrep
I did indeed notice the same behavior, that the
.semgrepignore
s in lowerdirs were not being respected. An integration test confirmed the same.[1]
Supporting multi-level
.semgrepignores
is going to bea significant effort, since it would involve actually parsing and interpreting
the config files, and also navigating the pitfalls of managing overlays of
ignore patterns. Instead, I chose to force it to being a singular file
path.
Here're some historical discussions about .semgrepignore: huonw#1, #18593 (comment)
https://github.com/semgrep/semgrep/blob/45699ace83dd61f19d6ea1e7452f32a0c1996735/src/targeting/Semgrepignore.ml#L124
https://semgrep.dev/docs/ignoring-files-folders-code#reference-summary
[1] The integration test which proves hierarchical semgrepignore doesn't work: