Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzz: Add a roundtrip regex fuzz harness #959

Closed
wants to merge 1 commit into from

Conversation

nathaniel-brough
Copy link
Contributor

This change adds an optional dependency on 'arbitrary' for regex-syntax. This allows us to generate arbitrary high-level intermediate representations (HIR). Using this generated HIR we convert this back to a regex string and exercise the regex matching code under src. Using this approach we can generate arbitrary well-formed regex strings, allowing the fuzzer to penetrate deeper into the regex code.

@nathaniel-brough nathaniel-brough marked this pull request as draft February 20, 2023 17:58
This change adds an optional dependency on 'arbitrary' for
regex-syntax. This allows us to generate arbitrary high-level
intermediate representations (HIR). Using this generated HIR
we convert this back to a regex string and exercise the regex
matching code under src. Using this approach we can generate
arbitrary well-formed regex strings, allowing the fuzzer to
penetrate deeper into the regex code.
@nathaniel-brough nathaniel-brough marked this pull request as ready for review February 20, 2023 18:08
@BurntSushi
Copy link
Member

A similar thing was done in #848, and like there, it added arbitrary as an optional dependency of regex-syntax. I do not want to do that. Is there another way to make this work?

@nathaniel-brough
Copy link
Contributor Author

A similar thing was done in #848, and like there, it added arbitrary as an optional dependency of regex-syntax. I do not want to do that. Is there another way to make this work?

Oh true, not sure how I missed that 🤦. I think a structured fuzzer using optional derivations on arbitrary is the most straightforward and easiest to maintain method for fuzzing deep into regex-syntax. But you make some valid points, and your resistance is justified. There are at least a couple of alternatives;

  1. Wrap every AST/HIR type new types and then manually implement arbitrary for each node type in the AST/HIR. This is kind of a daunting task, and might not be worth the effort to develop and maintain.
  2. Use a custom mutator. This would allow for some structure-aware fuzzing but requires a little more manual effort and isn't as expressive or performant but it could be made to work without any intrusive changes to any of the regex crates.

I'll try and write a structure-aware fuzzer using custom mutators, I'll ping your once I'm ready for another review :)

@BurntSushi
Copy link
Member

There are also perhaps hackier solutions to this problem. For example, maybe it's possible for the fuzzer build to copy the entire regex-syntax crate wholesale into its own compilation unit, and then it could derive Arbitrary for whatever type it wants. I think as long as the build process itself handles everything and there are no manual changes required to keep its copy of regex-syntax up to date, then that might work? It could be a little gnarly to make work, but presumably once it's done it's done.

@nathaniel-brough
Copy link
Contributor Author

nathaniel-brough commented Feb 20, 2023

Yeah that's true. Well I'll try a custom mutator first and see how far that get's me.

This might not satisfy your concerns. But perhaps another hacked solution would be remap the name of arbitrary -> private-internal-use-only-arbitrary. It would still technically be public. But it'd be private by convention. Someone would have to enable that feature knowing that it might be unstable. Something like;

# Cargo.toml
private-internal-use-only-arbitrary = { version = "1.0", package = "arbitrary", optional=true }

@BurntSushi
Copy link
Member

BurntSushi commented Feb 20, 2023

Yeah I know convention might help things here, but I still don't want to do it. I want to keep regex-syntax dependency free. If I can help it, I just do not want a third party dependency on the critical path. It has all sorts of effects that are difficult to see without practice with this sort of thing. It might impact Cargo dependency resolution. It will show up in audit tools. In effect, decisions made by arbitrary may become inherited by regex, and I don't want that. It's a supply chain problem, and making arbitrary a dependency just so we can get better fuzzing does not seem like the right answer to me. If it were a dev-dependency, that'd be something I'd be okay with because it doesn't infect normal builds as much. There's a clear separation.

Apologies for being hand wavy, but this is just a hard blocker for me. Particularly since I know it's possible for hackier solutions to exist, such as the one I mentioned above. It's undoubtedly annoying, but it's an internal and hopefully one-time cost to be paid.

@nathaniel-brough
Copy link
Contributor Author

Yeah I know convention might help things here, but I still don't want to do it. I want to keep regex-syntax dependency free. If I can help it, I just do not want a third party dependency on the critical path. It has all sorts of effects that are difficult to see without practice with this sort of thing. It might impact Cargo dependency resolution. It will show up in audit tools. In effect, decisions made by arbitrary may become inherited by regex, and I don't want that. It's a supply chain problem, and making arbitrary a dependency just so we can get better fuzzing does not seem like the right answer to me. If it were a dev-dependency, that'd be something I'd be okay with because it doesn't infect normal builds as much. There's a clear separation.

No worries, that all sounds reasonable to me. Regarding the dev-dependencies approach this currently isn't compatible with cargo-fuzz which requires access to a public API. There is an issue related to that rust-fuzz/cargo-fuzz#156. I might have a look at what it would take to solve rust-fuzz/cargo-fuzz#156 as that would likely enable the dev-dependency solution.

I think using a custom mutator (example here) might be a good option as that it is a fairly good way to keep fuzzer on track by giving you control over the mutator. It just requires some effort to reconstruct the underlying structure rather than just mutating raw bytes. But on the plus side it doesn't require any intrusive changes to any of the regex code.

Apologies for being hand wavy, but this is just a hard blocker for me. Particularly since I know it's possible for hackier solutions to exist, such as the one I mentioned above. It's undoubtedly annoying, but it's an internal and hopefully one-time cost to be paid.

Not a problem at all. I don't have any experience managing core libraries. So if you say something isn't a great I idea I'm inclined to trust your intuition :)

@BurntSushi
Copy link
Member

Also note if you're going down a path that requires dealing with the HIR, it might make sense to wait until #656 is done. That will bring in some changes to the Hir. They aren't earth shattering, so if you think it will be easy to migrate then I think it's fine. Just wanted to give you a heads up. You can see the current revisions for the new HIR API here: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-syntax/regex_syntax/hir/index.html

@nathaniel-brough
Copy link
Contributor Author

Good to know I'll take a look :)

@BurntSushi
Copy link
Member

I'm going to close this PR out because I think it's largely a duplicate of #848, and it's unlikely to get merged in its current form. If you have other ideas for how to get a fuzzer like this working without adding new dependencies to regex-syntax, then I'd be happy to entertain them. For my part, it still seems that something like this could work:

$ cp regex-syntax/**/*.rs fuzz/fuzz_targets/regex_syntax/
$ patch diff-that-adds-derive-arbitrary.diff
$ cargo fuzz ...

It's super hacky, but it seems like it could feasibly work given that the patch is just about adding some derive clauses in places.

@nathaniel-brough
Copy link
Contributor Author

@BurntSushi So I was thinking about this a little bit more, what would your thoughts be on creating a separate branch for fuzzing. There would be some overhead involved e.g. periodically syncing main->fuzzing. But at least to me this feels less cumbersome than dealing with checked in patch files. Anyway food for thought :) I might take another swing at this later this week, and just experiment to see how the patch approach would work vs just a seperate branch.

@BurntSushi
Copy link
Member

I caved to just adding an optional dependency on arbitrary in my ag/regex-automata branch. That will be merged to master for the 1.9 release. There is another PR with more discussion, but I'm on mobile or else I would link it. It's a recent PR.

@jasikpark
Copy link

#848 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants