Skip to content

Commit

Permalink
syntax: remove guarantees in the HIR related to 'u' flag
Browse files Browse the repository at this point in the history
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
  • Loading branch information
BurntSushi committed Oct 9, 2023
1 parent 17d9c1c commit 536cf70
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 5 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ TBD
* [BUG #1046](https://github.com/rust-lang/regex/issues/1046):
Fix a bug that could result in incorrect match spans when using a Unicode word
boundary and searching non-ASCII strings.
* [BUG(regex-syntax) #1088](https://github.com/rust-lang/regex/issues/1088):
Remove guarantees in the API that connect the `u` flag with a specific HIR
representation.


1.9.6 (2023-09-30)
Expand Down
16 changes: 11 additions & 5 deletions regex-syntax/src/hir/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -797,13 +797,18 @@ impl core::fmt::Debug for Literal {
/// The high-level intermediate representation of a character class.
///
/// A character class corresponds to a set of characters. A character is either
/// defined by a Unicode scalar value or a byte. Unicode characters are used
/// by default, while bytes are used when Unicode mode (via the `u` flag) is
/// disabled.
/// defined by a Unicode scalar value or a byte.
///
/// A character class, regardless of its character type, is represented by a
/// sequence of non-overlapping non-adjacent ranges of characters.
///
/// There are no guarantees about which class variant is used. Generally
/// speaking, the Unicode variat is used whenever a class needs to contain
/// non-ASCII Unicode scalar values. But the Unicode variant can be used even
/// when Unicode mode is disabled. For example, at the time of writing, the
/// regex `(?-u:a|\xc2\xa0)` will compile down to HIR for the Unicode class
/// `[a\u00A0]` due to optimizations.
///
/// Note that `Bytes` variant may be produced even when it exclusively matches
/// valid UTF-8. This is because a `Bytes` variant represents an intention by
/// the author of the regular expression to disable Unicode mode, which in turn
Expand Down Expand Up @@ -1326,8 +1331,9 @@ impl ClassUnicodeRange {
}
}

/// A set of characters represented by arbitrary bytes (where one byte
/// corresponds to one character).
/// A set of characters represented by arbitrary bytes.
///
/// Each byte corresponds to one character.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct ClassBytes {
set: IntervalSet<ClassBytesRange>,
Expand Down

0 comments on commit 536cf70

Please sign in to comment.