-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuzz: Add a roundtrip regex fuzz harness #959
Conversation
This change adds an optional dependency on 'arbitrary' for regex-syntax. This allows us to generate arbitrary high-level intermediate representations (HIR). Using this generated HIR we convert this back to a regex string and exercise the regex matching code under src. Using this approach we can generate arbitrary well-formed regex strings, allowing the fuzzer to penetrate deeper into the regex code.
b48c57b
to
85ec3c8
Compare
A similar thing was done in #848, and like there, it added |
Oh true, not sure how I missed that 🤦. I think a structured fuzzer using optional derivations on arbitrary is the most straightforward and easiest to maintain method for fuzzing deep into regex-syntax. But you make some valid points, and your resistance is justified. There are at least a couple of alternatives;
I'll try and write a structure-aware fuzzer using custom mutators, I'll ping your once I'm ready for another review :) |
There are also perhaps hackier solutions to this problem. For example, maybe it's possible for the fuzzer build to copy the entire regex-syntax crate wholesale into its own compilation unit, and then it could derive |
Yeah that's true. Well I'll try a custom mutator first and see how far that get's me. This might not satisfy your concerns. But perhaps another hacked solution would be remap the name of
|
Yeah I know convention might help things here, but I still don't want to do it. I want to keep Apologies for being hand wavy, but this is just a hard blocker for me. Particularly since I know it's possible for hackier solutions to exist, such as the one I mentioned above. It's undoubtedly annoying, but it's an internal and hopefully one-time cost to be paid. |
No worries, that all sounds reasonable to me. Regarding the I think using a custom mutator (example here) might be a good option as that it is a fairly good way to keep fuzzer on track by giving you control over the mutator. It just requires some effort to reconstruct the underlying structure rather than just mutating raw bytes. But on the plus side it doesn't require any intrusive changes to any of the regex code.
Not a problem at all. I don't have any experience managing core libraries. So if you say something isn't a great I idea I'm inclined to trust your intuition :) |
Also note if you're going down a path that requires dealing with the HIR, it might make sense to wait until #656 is done. That will bring in some changes to the |
Good to know I'll take a look :) |
I'm going to close this PR out because I think it's largely a duplicate of #848, and it's unlikely to get merged in its current form. If you have other ideas for how to get a fuzzer like this working without adding new dependencies to
It's super hacky, but it seems like it could feasibly work given that the patch is just about adding some |
@BurntSushi So I was thinking about this a little bit more, what would your thoughts be on creating a separate branch for fuzzing. There would be some overhead involved e.g. periodically syncing main->fuzzing. But at least to me this feels less cumbersome than dealing with checked in patch files. Anyway food for thought :) I might take another swing at this later this week, and just experiment to see how the patch approach would work vs just a seperate branch. |
I caved to just adding an optional dependency on |
This change adds an optional dependency on 'arbitrary' for regex-syntax. This allows us to generate arbitrary high-level intermediate representations (HIR). Using this generated HIR we convert this back to a regex string and exercise the regex matching code under src. Using this approach we can generate arbitrary well-formed regex strings, allowing the fuzzer to penetrate deeper into the regex code.