-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
byte regex can produce empty matches between UTF-8 code units #484
Comments
To make my current intention for this bug clear, my plan at this point is to see if this can be fixed in the compiler when that part of the crate gets refactored. |
I've looked into this as part of #656. And it has in particular motivated the ability to configure lower level matching iterators to change whether character advancement is "one Unicode scalar value" or "one byte." So we could fix this as I hinted at above by simply moving the byte oriented iterators over to using "one Unicode scalar value" instead of "one byte." The problem here is that we really don't want to enforce that behavior all the time. So, we could either presume this behavior under the current "Unicode" flag, or we could add a new flag, or we could mark this as I'm currently leaning towards I do not want to rely on the current Unicode flag, since I'd like to keep that in sync with the syntactic We also could add a new flag that changes how the iterators work. But it would likely have to be opt-in. I think we could do that at any point, so I'd prefer to wait until there is a solid use case for it. |
OK, so I think that overall this behavior by default is correct and consistent. Namely, the As a separate issue, we might consider exposing the "allow invalid UTF-8" flag in a |
Consider this program:
its output is
Also, consider this program, which is a different manifestation of the same underlying bug:
its output is:
In particular, the empty pattern matches everything, including the locations between UTF-8 code units and otherwise invalid UTF-8.
A related note here is that
find_iter
is implemented slightly differently inbytes::Regex
when compared withRegex
. Namely, upon observing an empty match, the iterator forcefully advances its current position by a single character. For Unicode regexes, a character is a Unicode codepoint. For byte oriented regexes, a character is any single byte. The problem here is that thebytes::Regex
iterator always assumes the byte oriented definition, even when Unicode mode is enabled for the entire regex (which is the default).We could fix part of this issue by making the
bytes::Regex
iterator respect the value of theunicode
flag when set viabytes::RegexBuilder
. Namely, we could make the iterator advance one Unicode codepoint in the case of an empty match when Unicode mode is enabled for the entire regex. The problem here is the behavior in the second example, when Unicode mode is enabled, but we match at invalid UTF-8 boundaries. In that case, "skipping ahead one Unicode codepoint" doesn't really make sense, because it kind of assumes valid UTF-8. This is why thebytes::Regex
iterator works the way it does. The intention was to rely on the matching semantics themselves to preserve the UTF-8 guarantee.I guess ideally, the empty regex shouldn't match at locations that aren't valid UTF-8 boundaries when Unicode mode is enabled. This would completely fix the entire issue. I'm not entirely sure what the best way to implement this would be though.
This bug was initially reported as a bug in ripgrep in BurntSushi/ripgrep#937.
The text was updated successfully, but these errors were encountered: