Add 3 structure-aware fuzzers #848

addisoncrump · 2022-03-08T22:48:35Z

These are the fuzzers used to find CVE-2022-24713.

Effectively, it just marks all AST members as Arbitrary when the fuzzer feature is enabled, then generates ASTs which are then converted back to statements in regex. Then, the fuzzer parses the generated string and executes against an input.

5225225 · 2022-03-14T19:11:54Z

I was using

fuzz_target!(|data: (&str, &str, [bool; 7])| {
    let (
        pattern,
        input,
        [case_insensitive, multi_line, dot_matches_new_line, swap_greed, ignore_whitespace, unicode, octal],
    ) = data;

    let r = regex::RegexBuilder::new(pattern)
        .case_insensitive(case_insensitive)
        .multi_line(multi_line)
        .dot_matches_new_line(dot_matches_new_line)
        .swap_greed(swap_greed)
        .ignore_whitespace(ignore_whitespace)
        .unicode(unicode)
        .octal(octal)
        .dfa_size_limit(1024)
        .size_limit(1024)
        .build();

    if let Ok(re) = r {
        re.is_match(input);
    }
});

Looks like the only option that shows up as an option in the RegexBuilder that doesn't show up in the flags is octal. Might be worth fuzzing that option as well?

BurntSushi · 2022-03-14T19:16:51Z

Thanks so much!

I am not sure if I am a fan of adding dependencies to regex-syntax, even optional ones. Because they now become part of the API of the crate and require careful management when it comes to release maintenance, particularly given that it create a public dependency on another crate.

I'd really strongly prefer that we don't introduce new dependencies to regex-syntax. Adding a dev-dependency would be quite all right though. Is there a way to structure the fuzzer to make that possible I wonder?

addisoncrump · 2022-03-14T19:20:02Z

Looks like the only option that shows up as an option in the RegexBuilder that doesn't show up in the flags is octal. Might be worth fuzzing that option as well?

@5225225, I saw your PR earlier (it was actually what inspired me to write this one!) and would be happy to add some of the testing capabilities yours demonstrates.

I'd really strongly prefer that we don't introduce new dependencies to regex-syntax. Adding a dev-dependency would be quite all right though. Is there a way to structure the fuzzer to make that possible I wonder?

@BurntSushi, let me see what I can do. 🙂

addisoncrump · 2022-03-14T19:43:34Z

@BurntSushi seems that this option isn't really available: rust-fuzz/cargo-fuzz#256

Unfortunately, it would be necessary to make it a feature. However, it is possible to test if it's being used in a fuzzer environment, so we can set up something like the following:

#[cfg(any(all(feature = "fuzzer", not(fuzzing)), all(feature = "arbitrary", not(fuzzing))))]
compile_error!("You cannot enable the fuzzer feature without fuzzing!");

This prevents people from pulling in the fuzzer features or arbitrary without fuzzing. Does this suffice?

CityOfLight77 · 2022-06-05T00:39:34Z

@VTCAKAVSMoACE with this fuzzers how long it takes to find cve-2022-24713? I'm tested locally from your repo I can't trigger the crash. I have 50ish corpus and using regex dictionary

addisoncrump · 2022-06-14T15:41:32Z

@VTCAKAVSMoACE with this fuzzers how long it takes to find cve-2022-24713? I'm tested locally from your repo I can't trigger the crash. I have 50ish corpus and using regex dictionary

Sorry for the late response. You need to set a timeout of a second or two, and you do not want to use a corpus.

BurntSushi · 2023-05-01T21:43:11Z

I've given this some thought, and I'm just not okay with adding a public dependency on arbitrary for regex-syntax. I could stomach it as a dev-dependency, but even one that's optional and difficult to enable is too risky for me. And doing it just to enable some convenient testing does not feel like the right trade off to me. The problem it being in the dependency chain is that it's still considered part of the main set of dependencies and I believe there are various things that can impact building regex-syntax based on things that arbitrary does. (Perhaps the setting of rust-version, IIRC.)

BurntSushi · 2023-05-01T21:51:46Z

I'm going to close this out for largely similar reasons as stated here: #959 (comment)

addisoncrump · 2023-05-04T08:17:57Z

Hm, this seems a bit of a shame to me. Grammar fuzzing discovers hard-to-find bugs very fast in regex (e.g. reproducing the ReDoS in <1 minute); it seems a bit more than convenient testing. I'll think about alternative implementation strategies here since I think it's important, but I don't have much time to do this at the moment. I will try to nerd snipe some my friends into this 😉

BurntSushi · 2023-05-04T12:01:20Z

Yeah I agree it's a shame. And honestly, my suggested work-around sucks. It blows chunks. I don't like it and I don't want it, haha.

This really does feel like a flaw in the cargo fuzz tooling. Basically, whenever it needs some additional infrastructure for testing, it mandates that it be added as a non-dev dependency. That's just not good at all. It might be acceptable for some crates to do that, but this is really what dev-dependencies are supposed to be for. So I really think cargo fuzz needs to figure out a way to support them.

I'll see if I can open up an issue with them and see if I can get anywhere that way.

BurntSushi · 2023-05-04T12:20:33Z

I'm also reconsidering the other work-around suggested, which was mirroring the full Ast type in the fuzzer targets, and making them impl Arbitrary and then defining a From impl. All of the Ast types have a public representation, so this is possible to do. It is painful. But the Ast types very rarely change. So it might be feasible. At least, I think I like this better than my idea of copying and patching the entire regex-syntax crate.

BurntSushi · 2023-05-04T12:27:52Z

I've added a request here: rust-fuzz/cargo-fuzz#338

Manishearth · 2023-05-04T14:35:08Z

I closed the cargo-fuzz issue as unimplementable, but repeating some of my suggestions from there:

For this specific case, I'd really just recommend adding an arbitrary feature the way a lot of crates have serde features: other people may also want to include your types in their fuzzers. I recognize that this is similar to what has been proposed here and has already been rejected.

The hack for getting this to work is to instead manually define fuzz targets under examples/ using the libfuzzer crate, and figure out the sanitizer flags cargo fuzz passes in and pass them yourself using RUSTFLAGS=foo cargo rustc --examples bar -- <more flags>. That's basically all cargo-fuzz does: it creates a crate that depends on libfuzzer, sets up boilerplate, and then builds with the appropriate flags.

BurntSushi · 2023-05-04T14:39:45Z

@Manishearth gotya, thanks. I think the RUSTFLAGS hack might not work, but I'm not sure. In particular, this project is hooked up to OSS-fuzz, and I'm not sure whether it's possible in that context to say, "do RUSTFLAGS=foo cargo rustc ... instead of cargo fuzz."

addisoncrump · 2023-05-04T14:46:06Z

Regarding OSS-Fuzz: this is something that you would put into the config file. Quite possible 🙂

BurntSushi · 2023-05-04T14:46:34Z

Ah! There's a lot that I don't know apparently. Haha.

BurntSushi · 2023-05-04T15:01:57Z

OK, so I'm going to move forward with just adding arbitrary as an optional dependency to regex-syntax. And it will be an advertised feature. Things that changed my mind:

There's just not really any good alternative given the cargo fuzz setup today.
This drastically speeds up fuzzing. And given that regex 1.9 is going to be a complete rewrite, getting it under a good fuzzer before a release sounds like a very responsible thing to do.
I reviewed arbitrary itself and it looks like a solid crate with good dependency hygiene itself.
I added it as an optional dependency to regex-syntax locally and played around with it. Everything still builds on Rust 1.60.
If this ends up not working out, I can always back out the change because I can do new breaking change releases of regex-syntax at a higher frequency than regex itself.

addisoncrump · 2023-05-04T15:18:24Z

So, what do you need from me? 🙂 Rebase and test?

BurntSushi · 2023-05-04T15:25:06Z

Ah right, nope, you'll all set. I'm rolling this into #978. (I have other fuzz changes in that PR too, in order to improve coverage across various things.)

addisoncrump · 2023-05-04T15:33:10Z

Okay. I think potentially something worthwhile is to also perform differential testing between the rewrite and the old version. This should be fairly quick to set up and I'll create a separate repository for this since it doesn't seem appropriate here.

BurntSushi · 2023-05-04T15:37:50Z

@addisoncrump Sounds great. What does it entail? (Or feel free to avoid indulging my curiosity and just show me whenever the new repo is ready. :-))

addisoncrump · 2023-05-04T15:39:03Z

I should be done in like 5 minutes? I'll just push it here since this PR is encapsulated by other changes.

addisoncrump · 2023-05-04T15:44:38Z

Aha, yup, this immediately finds some differences. Test with cargo fuzz run -s none diff_new_fuzzer and enjoy your failing testcases 😉

addisoncrump · 2023-05-04T15:47:29Z

Note that currently, it uses generated ASTs from the old implementation. If there are new features present in the rewrite, this won't test them, or it will test them wrong.

BurntSushi · 2023-05-04T15:52:33Z

Yeah unfortunately not only are there new features, but there are bug fixes in match semantics too.

I'll take a look at some of the failing cases to confirm they fall in those buckets, but this might make differential testing in this way tricky. (I've done my own form of differential non-fuzz testing by ensuring that the old implementation passes the new test suite and that the new implementation passes the old test suite.) The new test suite is much bigger than the old one, and tests each individual internal engine much more thoroughly. It's not fuzz testing, but it's not nothing either. :)

addisoncrump · 2023-05-04T16:24:04Z

And the differential fuzzer also just repro'd the old CVE on the new implementation with: (?:){250000}?... Seems to be in parse-time, again, too.

BurntSushi · 2023-05-04T16:30:42Z

@addisoncrump Yeah I caught that one a few days ago. I might not have pushed my branch since then. But it's in the new NFA compiler, not parser. Happened for a different reason.

BurntSushi · 2023-05-05T15:19:09Z

@addisoncrump Hmmm. Probably the simplest thing is to copy the properties from this section of UNICODE.md: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl12-properties

But that doesn't include each of the general category or script values. For example, neither Letter (a general category) nor Greek (a script) are listed there.

Otherwise there really isn't anything in regex-syntax (internal or otherwise) that makes such a thing convenient because it isn't an operation that is ever really needed. Technically everything you need are in the Unicode property name/alias tables, but you'd have to dig around in there. But none of that is exposed in a way that lets your fuzz target take advantage of it.

One thing worth mentioning is that the AST parser doesn't check whether the thing inside a \p{...} is valid or not. That's only done when the AST is translated to an HIR. The AST parser might do some basic validation in terms of which characters are allowed in \p{...}. I can't remember. But it specifically does not, for example, reject \p{foo}:

$ regex-cli debug ast '\p{foo}'                                        
parse time:  54.929µs

Class(
    Unicode(
        ClassUnicode {
            span: Span(Position(o: 0, l: 1, c: 1), Position(o: 7, l: 1, c: 8)),
            negated: false,
            kind: Named(
                "foo",
            ),
        },
    ),
)

addisoncrump · 2023-05-05T15:29:33Z

I'll find a way 😁 But not this evening.

addisoncrump · 2023-05-06T01:12:36Z

So, I looked a little further. I believe that using -use_cmp=0 will also improve the performance of the fuzzers that generate ASTs by arbitrary, as "cmp" will attempt to identify subsequences in the input to replace with the strings that are discovered dynamically. This will unfortunately poison the corpus with undesirable values now that the unicode class arbitrary is implemented, and even in previous versions this wouldn't work because of how arbitrary generates these values.

@Manishearth you might also wish to know this; maybe worth documenting in the rust fuzz book. I believe -use_cmp=1 (enabled by default) may be actively harmful for fuzz campaigns which utilise arbitrary as it will cause libfuzzer to suggest useless mutations for arbitrary-generated data. It will only be useful if libfuzzer is able to find data which is used as-is without transformation (such as arbitrary => fixed string); in the case of the PR above, this never happens, so the inferred mutation injects data which is not useful for the fuzz campaign.