Skip to content

Commit

Permalink
Documentation for regular expressions. (#1195)
Browse files Browse the repository at this point in the history
* Documentation for regular expressions.

* Update for regex documentation and improved matching detection of regex names.

* Fixes for line endings in msvc-2022.

* Pass through regex docs

Mainly turn code lists into tables

* Small fixes after update from Herb.

* Fix for regression tests.

* Quote tweak

Signed-off-by: Herb Sutter <[email protected]>

---------

Signed-off-by: Herb Sutter <[email protected]>
Co-authored-by: Herb Sutter <[email protected]>
  • Loading branch information
MaxSagebaum and hsutter authored Aug 6, 2024
1 parent 630e61a commit c618ed5
Show file tree
Hide file tree
Showing 6 changed files with 330 additions and 34 deletions.
82 changes: 82 additions & 0 deletions docs/cpp2/metafunctions.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,88 @@ main: () = {
```


### For computational and functional types


#### `regex`

A `regex` type has data members that are regular expression objects. This metafunction replaces all of the type's data members named `regex` or `regex_*` with regular expression objects of the same type. For example:

``` cpp title="Regular expression example" hl_lines="1 3 4 16 17 19 27 30 31"
name_matcher: @regex type
= {
regex := R"((\w+) (\w+))"; // for example: Margaret Hamilton
regex_no_case := R"(/(ab)+/i)"; // case insensitive match of "ab"+
}

main: (args) = {
m: name_matcher = ();

data: std::string = "Donald Duck";
if args.ssize() >= 2 {
data = args[1];
}

// regex.match requires matches to match the entire string, from start to end
result := m.regex.match(data);
if result.matched {
// We found a match; reverse the order of the substrings
std::cout << "Hello (result.group(2))$, (result.group(1))$!\n";
}
else {
std::cout << "I only know names of the form: <name> <family name>.\n";
}

// regex.search finds a match anywhere within the target string
std::cout << "Case insensitive match: "
"(m.regex_no_case.search(\"blubabABblah\").group(0))$\n";
}
// Prints:
// Hello Duck, Donald!
// Case insensitive match: abAB
```

The `@regex` metafunction currently supports most of [Perl regex syntax](https://perldoc.perl.org/perlre), except for Unicode characters and the syntax tokens associated with them. See [Supported regular expression features](../notes/regex_status.md) for a list of regex options.

Each regex object has the type `cpp2::regex::regular_expression`, which is defined in `include/cpp2regex.h2`. The member functions are:

``` cpp title="Member functions for regular expressions"
// .match() requires matches to match the entire string, from start to end
// .search() finds a match anywhere within the target string

match : (this, str: std::string_view) -> search_return;
search: (this, str: std::string_view) -> search_return;

match : (this, str: std::string_view, start) -> search_return;
search: (this, str: std::string_view, start) -> search_return;

match : (this, str: std::string_view, start, length) -> search_return;
search: (this, str: std::string_view, start, length) -> search_return;

match : <Iter> (this, start: Iter, end: Iter) -> search_return;
search: <Iter> (this, start: Iter, end: Iter) -> search_return;
```

The return type `search_return` is defined in `cpp2::regex::regular_expression`. It has these members:

``` cpp title="Members of a regular expression result"
matched: bool;
pos: int;

// Functions to access groups by number
group_number: (this) -> size_t;;
group: (this, g: int) -> std::string;
group_start: (this, g: int) -> int;
group_end: (this, g: int) -> int;

// Functions to access groups by name
group: (this, g: bstring<CharT>) -> std::string;
group_start: (this, g: bstring<CharT>) -> int;
group_end: (this, g: bstring<CharT>) -> int;
```



### Helpers and utilities


Expand Down
233 changes: 233 additions & 0 deletions docs/notes/regex_status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# Supported regular expression features

The listings are taken from the [Perl regex docs](https://perldoc.perl.org/perlre). Regular expressions are applied via the [`regex` metafunction](../cpp2/metafunctions.md#regex).


## Currently supported or planned features


### Modifiers

| Modifier | Notes | Status |
| --- | --- | --- |
| **`i`** | Do case-insensitive pattern matching. For example, "A" will match "a" under `/i`. | <span style="color:green">Supported</span> |
| **`m`** | Treat the string being matched against as multiple lines. That is, change `^` and `$` from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string. | <span style="color:green">Supported</span> |
| **`s`** | Treat the string as single line. That is, change `.` to match any character whatsoever, even a newline, which normally it would not match. | <span style="color:green">Supported</span> |
| ***`x` and `xx`** | Extend your pattern's legibility by permitting whitespace and comments. For details see: [Perl regex docs: `/x` and `/xx`](https://perldoc.perl.org/perlre#/x-and-/xx). | <span style="color:green">Supported</span> |
| **`n`** | Prevent the grouping metacharacters `(` and `)` from capturing. This modifier will stop `$1`, `$2`, etc. from being filled in. | <span style="color:green">Supported</span> |
| **`c`** | Keep the current position during repeated matching. | <span style="color:gray">Planned</span> |


### Escape sequences __(Complete)__

| Escape sequence | Notes | Status |
| --- | --- | --- |
| **`\t`** | Tab (HT, TAB)X | <span style="color:green">Supported</span> |
| **`\n`** | Newline (LF, NL) | <span style="color:green">Supported</span> |
| **`\r`** | Return (CR) | <span style="color:green">Supported</span> |
| **`\f`** | Form feed (FF) | <span style="color:green">Supported</span> |
| **`\a`** | Alarm (bell) (BEL) | <span style="color:green">Supported</span> |
| **`\e`** | Escape (think troff) (ESC) | <span style="color:green">Supported</span> |
| **`\x{}`, `\x00`** | Character whose ordinal is the given hexadecimal number | <span style="color:green">Supported</span> |
| **`\o{}`, `\000`** | Character whose ordinal is the given octal number | <span style="color:green">Supported</span> |


### Quantifiers __(Complete)__

| Quantifier | Notes | Status |
| --- | --- | --- |
| **`*`** | Match 0 or more times | <span style="color:green">Supported</span> |
| **`+`** | Match 1 or more times | <span style="color:green">Supported</span> |
| **`?`** | Match 1 or 0 times | <span style="color:green">Supported</span> |
| **`{n}`** | Match exactly n times | <span style="color:green">Supported</span> |
| **`{n,}`** | Match at least n times | <span style="color:green">Supported</span> |
| **`{,n}`** | Match at most n times | <span style="color:green">Supported</span> |
| **`{n,m}`** | Match at least n but not more than m times | <span style="color:green">Supported</span> |
| | | |
| **`*?`** | Match 0 or more times, not greedily | <span style="color:green">Supported</span> |
| **`+?`** | Match 1 or more times, not greedily | <span style="color:green">Supported</span> |
| **`??`** | Match 0 or 1 time, not greedily | <span style="color:green">Supported</span> |
| **`{n}?`** | Match exactly n times, not greedily (redundant) | <span style="color:green">Supported</span> |
| **`{n,}?`** | Match at least n times, not greedily | <span style="color:green">Supported</span> |
| **`{,n}?`** | Match at most n times, not greedily | <span style="color:green">Supported</span> |
| **`{n,m}?`** | Match at least n but not more than m times, not greedily | <span style="color:green">Supported</span> |
| | | |
| **`*+`** | Match 0 or more times and give nothing back | <span style="color:green">Supported</span> |
| **`++`** | Match 1 or more times and give nothing back | <span style="color:green">Supported</span> |
| **`?+`** | Match 0 or 1 time and give nothing back | <span style="color:green">Supported</span> |
| **`{n}+`** | Match exactly n times and give nothing back (redundant) | <span style="color:green">Supported</span> |
| **`{n,}+`** | Match at least n times and give nothing back | <span style="color:green">Supported</span> |
| **`{,n}+`** | Match at most n times and give nothing back | <span style="color:green">Supported</span> |
| **`{n,m}+`** | Match at least n but not more than m times and give nothing back | <span style="color:green">Supported</span> |


### Character Classes and other Special Escapes __(Complete)__

| Feature | Notes | Status |
| --- | --- | --- |
| **`[`...`]`** | Match a character according to the rules of the bracketed character class defined by the "...". Example: `[a-z]` matches "a" or "b" or "c" ... or "z" | <span style="color:green">Supported</span> |
| **`[[:`...`:]]`** | Match a character according to the rules of the POSIX character class "..." within the outer bracketed character class. Example: `[[:upper:]]` matches any uppercase character. | <span style="color:green">Supported</span> |
| **`\g1`** or **`\g{-1}`** | Backreference to a specific or previous group. The number may be negative indicating a relative previous group and may optionally be wrapped in curly brackets for safer parsing. | <span style="color:green">Supported</span> |
| **`\g{name}`** | Named backreference | <span style="color:green">Supported</span> |
| **`\k<name>`** | Named backreference | <span style="color:green">Supported</span> |
| **`\k'name'`** | Named backreference | <span style="color:green">Supported</span> |
| **`\k{name}`** | Named backreference | <span style="color:green">Supported</span> |
| **`\w`** | Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks) | <span style="color:green">Supported</span> |
| **`\W`** | Match a non-"word" character | <span style="color:green">Supported</span> |
| **`\s`** | Match a whitespace character | <span style="color:green">Supported</span> |
| **`\S`** | Match a non-whitespace character | <span style="color:green">Supported</span> |
| **`\d`** | Match a decimal digit character | <span style="color:green">Supported</span> |
| **`\D`** | Match a non-digit character | <span style="color:green">Supported</span> |
| **`\v`** | Vertical whitespace | <span style="color:green">Supported</span> |
| **`\V`** | Not vertical whitespace | <span style="color:green">Supported</span> |
| **`\h`** | Horizontal whitespace | <span style="color:green">Supported</span> |
| **`\H`** | Not horizontal whitespace | <span style="color:green">Supported</span> |
| **`\1`** | Backreference to a specific capture group or buffer. '1' may actually be any positive integer. | <span style="color:green">Supported</span> |
| **`\N`** | Any character but \n. Not affected by /s modifier | <span style="color:green">Supported</span> |
| **`\K`** | Keep the stuff left of the \K, don't include it in $& | <span style="color:green">Supported</span> |


### Assertions

| Assertion | Notes | Status |
| --- | --- | --- |
| **`\b`** | Match a \w\W or \W\w boundary | <span style="color:green">Supported</span> |
| **`\B`** | Match except at a \w\W or \W\w boundary | <span style="color:green">Supported</span> |
| **`\A`** | Match only at beginning of string | <span style="color:green">Supported</span> |
| **`\Z`** | Match only at end of string, or before newline at the end | <span style="color:green">Supported</span> |
| **`\z`** | Match only at end of string | <span style="color:green">Supported</span> |
| **`\G`** | Match only at pos() (e.g. at the end-of-match position of prior m//g) | <span style="color:gray">Planned</span> |


### Capture groups __(Complete)__

| Feature | Status |
| --- | --- |
| **`(`...`)`** | <span style="color:green">Supported</span> |


### Quoting metacharacters __(Complete)__

| Feature | Status |
| --- | --- |
| **For `^.[]$()*{}?+|\`** | <span style="color:green">Supported</span> |


### Extended Patterns

| Extended pattern | Notes | Status |
| --- | --- | --- |
| **`(?<NAME>pattern)`** | Named capture group | <span style="color:green">Supported</span> |
| **`(?#text)`** | Comments | <span style="color:green">Supported</span> |
| **`(?adlupimnsx-imnsx)`** | Modification for surrounding context | <span style="color:green">Supported</span> |
| **`(?^alupimnsx)`** | Modification for surrounding context | <span style="color:green">Supported</span> |
| **`(?:pattern)`** | Clustering, does not generate a group index. | <span style="color:green">Supported</span> |
| **`(?adluimnsx-imnsx:pattern)`** | Clustering, does not generate a group index and modifications for the cluster. | <span style="color:green">Supported</span> |
| **`(?^aluimnsx:pattern)`** | Clustering, does not generate a group index and modifications for the cluster. | <span style="color:green">Supported</span> |
| **`(?`<code>&#124;</code>`pattern)`** | Branch reset | <span style="color:green">Supported</span> |
| **`(?'NAME'pattern)`** | Named capture group | <span style="color:green">Supported</span> |
| **`(?(condition)yes-pattern`<code>&#124;</code>`no-pattern)`** | Conditional patterns. | <span style="color:gray">Planned</span> |
| **`(?(condition)yes-pattern)`** | Conditional patterns. | <span style="color:gray">Planned</span> |
| **`(?>pattern)`** | Atomic patterns. (Disable backtrack.) | <span style="color:gray">Planned</span> |
| **`(*atomic:pattern)`** | Atomic patterns. (Disable backtrack.) | <span style="color:gray">Planned</span> |


### Lookaround Assertions

| Lookaround assertion | Notes | Status |
| --- | --- | --- |
| **`(?=pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
| **`(*pla:pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
| **`(*positive_lookahead:pattern)`** | Positive look ahead. | <span style="color:green">Supported</span> |
| **`(?!pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
| **`(*nla:pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
| **`(*negative_lookahead:pattern)`** | Negative look ahead. | <span style="color:green">Supported</span> |
| **`(?<=pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
| **`(*plb:pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
| **`(*positive_lookbehind:pattern)`** | Positive look behind. | <span style="color:gray">Planned</span> |
| **`(?<!pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |
| **`(*nlb:pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |
| **`(*negative_lookbehind:pattern)`** | Negative look behind. | <span style="color:gray">Planned</span> |


### Special Backtracking Control Verbs

| Backtracking control verb | Notes | Status |
| --- | --- | --- |
| **`(*SKIP) (*SKIP:NAME)`** | Start next search here. | <span style="color:gray">Planned</span> |
| **`(*PRUNE) (*PRUNE:NAME)`** | No backtracking over this point. | <span style="color:gray">Planned</span> |
| **`(*MARK:NAME) (*:NAME)`** | Place a named mark. | <span style="color:gray">Planned</span> |
| **`(*THEN) (*THEN:NAME)`** | Like PRUNE. | <span style="color:gray">Planned</span> |
| **`(*COMMIT) (*COMMIT:arg)`** | Stop searching. | <span style="color:gray">Planned</span> |
| **`(*FAIL) (*F) (*FAIL:arg)`** | Fail the pattern/branch. | <span style="color:gray">Planned</span> |
| **`(*ACCEPT) (*ACCEPT:arg)`** | Accept the pattern/subpattern. | <span style="color:gray">Planned</span> |


## Not planned (Mainly because of Unicode or perl specifics)

### Modifiers

| Modifier | Notes | Status |
| --- | --- | --- |
| `p` | Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching. | <span style="color:darkred">Not planned</span> |
| `a`, `d`, `l`, and `u` | These modifiers affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers". | <span style="color:darkred">Not planned</span> |
| `g` | globally match the pattern repeatedly in the string | <span style="color:darkred">Not planned</span> |
| `e` | evaluate the right-hand side as an expression | <span style="color:darkred">Not planned</span> |
| `ee` | evaluate the right side as a string then eval the result | <span style="color:darkred">Not planned</span> |
| `o` | pretend to optimize your code, but actually introduce bugs | <span style="color:darkred">Not planned</span> |
| `r` | perform non-destructive substitution and return the new value | <span style="color:darkred">Not planned</span> |


### Escape sequences

| Escape sequence | Notes | Status |
| --- | --- | --- |
| `\cK` | control char (example: VT) | <span style="color:darkred">Not planned</span> |
| `\N{name}` | named Unicode character or character sequence | <span style="color:darkred">Not planned</span> |
| `\N{U+263D}` | Unicode character (example: FIRST QUARTER MOON) | <span style="color:darkred">Not planned</span> |
| `\l` | lowercase next char (think vi) | <span style="color:darkred">Not planned</span> |
| `\u` | uppercase next char (think vi) | <span style="color:darkred">Not planned</span> |
| `\L` | lowercase until \E (think vi) | <span style="color:darkred">Not planned</span> |
| `\U` | uppercase until \E (think vi) | <span style="color:darkred">Not planned</span> |
| `\Q` | quote (disable) pattern metacharacters until \E | <span style="color:darkred">Not planned</span> |
| `\E` | end either case modification or quoted section, think vi | <span style="color:darkred">Not planned</span> |


### Character Classes and other Special Escapes

| Character class or escape | Notes | Status |
| --- | --- | --- |
| `(?[...])` | Extended bracketed character class | <span style="color:darkred">Not planned</span> |
| `\pP` | Match P, named property. Use \p{Prop} for longer names | <span style="color:darkred">Not planned</span> |
| `\PP` | Match non-P | <span style="color:darkred">Not planned</span> |
| `\X` | Match Unicode "eXtended grapheme cluster" | <span style="color:darkred">Not planned</span> |
| `\R` | Linebreak | <span style="color:darkred">Not planned</span> |


### Assertions

| Assertion | Notes | Status |
| --- | --- | --- |
| `\b{}` | Match at Unicode boundary of specified type | <span style="color:darkred">Not planned</span> |
| `\B{}` | Match where corresponding \b{} doesn't match | <span style="color:darkred">Not planned</span> |

### Extended Patterns


| Extended pattern | Notes | Status |
| --- | --- | --- |
| `(?{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
| `(*{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
| `(??{ code })` | Perl code execution. | <span style="color:darkred">Not planned</span> |
| `(?PARNO)` `(?-PARNO)` `(?+PARNO)` `(?R)` `(?0)` | Recursive subpattern. | <span style="color:darkred">Not planned</span> |
| `(?&NAME)` | Recursive subpattern. | <span style="color:darkred">Not planned</span> |


### Script runs

| Script runs | Notes | Status |
| --- | --- | --- |
| `(*script_run:pattern)` | All chars in pattern need to be of the same script. | <span style="color:darkred">Not planned</span> |
| `(*sr:pattern)` | All chars in pattern need to be of the same script. | <span style="color:darkred">Not planned</span> |
| `(*atomic_script_run:pattern)` | Without backtracking. | <span style="color:darkred">Not planned</span> |
| `(*asr:pattern)` | Without backtracking. | <span style="color:darkred">Not planned</span> |
Loading

0 comments on commit c618ed5

Please sign in to comment.