Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing performance #56

Closed
dalance opened this issue Jan 16, 2023 · 19 comments
Closed

Parsing performance #56

dalance opened this issue Jan 16, 2023 · 19 comments
Assignees

Comments

@dalance
Copy link

dalance commented Jan 16, 2023

At veryl, parsing performance is about 64KiB/s now:

     Running benches/benchmark.rs (target/release/deps/benchmark-dbdacafbf8409011)
Benchmarking throughput/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 19.1s, or reduce sample count to 20.
throughput/parse        time:   [190.85 ms 195.85 ms 202.96 ms]
                        thrpt:  [62.594 KiB/s 64.866 KiB/s 66.565 KiB/s]
                 change:
                        time:   [+37.181% +40.784% +45.521%] (p = 0.00 < 0.05)
                        thrpt:  [-31.281% -28.969% -27.104%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe

In my expexience, the performance is affected by the number of terminals.
For example, the performance is degraded by increasing reserverd keyword.

The flamegraph is below. It shows almost all time is consumed at regex::pikevm in parol's lexer.
This issue may be the same as rust-lang/regex#787.
BurntSushi seems to try to implement new regex engine to improve it.
If my prediction is correct, it is required that removeing regex from lexer or waiting the new regex engine to improve the performance of parol.

flamegraph

@jsinger67
Copy link
Owner

jsinger67 commented Jan 16, 2023

@dalance
First of all thank you for reporting this 👍

Removing regex from parol would be a major drawback, because users possibly use all features they've got from the underlying regex crate in their terminals. This is actually a big selling point.
I could of course generate own DFAs from the terminals but it would be hard to resemble all features of a fully equipped regex engine.
What I could imagine is a better support for keywords in parol's input syntax.
Keywords are usually no regexes but plain strings (except the word boundaries attached). For those I could generate either a special regex part with the form \b(k1|k2|k2|..)\b or I could generate a trie for them and use an own scanning algorithm using the trie as DFA. The latter needs interception of the parol_runtime::TokenStream which is basically a regex::iterator and thus not easily to intercept.

For veryl's grammar the generated regex is now exactly this:

(?P<G1>\r\n|\r|\n)|(?P<G2>[\s--\r\n]+)|(?P<G5>(?:(?:(?://.*(?:\r\n|\r|\n|$))|(?:(?ms)/\u{2a}.*?\u{2a}/))\s*)+)|(?P<G6>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*[eE][+-]?[0-9]+(?:_[0-9]+)*)|(?P<G7>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*)|(?P<G8>[0-9]+(?:_[0-9]+)*'[bodh][0-9a-fA-FxzXZ]+(?:_[0-9a-fA-FxzXZ]+)*)|(?P<G9>[0-9]+(?:_[0-9]+)*)|(?P<G10>'[01xzXZ])|(?P<G11>\-:)|(?P<G12>\->)|(?P<G13>\+:)|(?P<G14>\+=|-=|\*=|/=|%=|&=|\|=|\^=|<<=|>>=|<<<=|>>>=)|(?P<G15>\*\*)|(?P<G16>/|%)|(?P<G17>\+|-)|(?P<G18><<<|>>>|<<|>>)|(?P<G19><=|>=|<|>)|(?P<G20>===|==\?|!==|!=\?|==|!=)|(?P<G21>&&)|(?P<G22>\|\|)|(?P<G23>&)|(?P<G24>\^~|\^|~\^)|(?P<G25>\|)|(?P<G26>~&|~\||!|~)|(?P<G27>::)|(?P<G28>:)|(?P<G29>,)|(?P<G30>\$)|(?P<G31>\.\.)|(?P<G32>\.)|(?P<G33>=)|(?P<G34>\#)|(?P<G35>\{)|(?P<G36>\[)|(?P<G37>\()|(?P<G38>\})|(?P<G39>\])|(?P<G40>\))|(?P<G41>;)|(?P<G42>\*)|(?P<G43>\balways_comb\b)|(?P<G44>\balways_ff\b)|(?P<G45>\bassign\b)|(?P<G46>\basync_high\b)|(?P<G47>\basync_low\b)|(?P<G48>\bas\b)|(?P<G49>\bbit\b)|(?P<G50>\bcase\b)|(?P<G51>\bdefault\b)|(?P<G52>\belse\b)|(?P<G53>\benum\b)|(?P<G54>\bexport\b)|(?P<G55>\bf32\b)|(?P<G56>\bf64\b)|(?P<G57>\bfor\b)|(?P<G58>\bfunction\b)|(?P<G59>\bi32\b)|(?P<G60>\bi64\b)|(?P<G61>\bif_reset\b)|(?P<G62>\bif\b)|(?P<G63>\bimport\b)|(?P<G64>\binout\b)|(?P<G65>\binput\b)|(?P<G66>\binst\b)|(?P<G67>\binterface\b)|(?P<G68>\bin\b)|(?P<G69>\blocalparam\b)|(?P<G70>\blogic\b)|(?P<G71>\bmodport\b)|(?P<G72>\bmodule\b)|(?P<G73>\bnegedge\b)|(?P<G74>\boutput\b)|(?P<G75>\bpackage\b)|(?P<G76>\bparameter\b)|(?P<G77>\bposedge\b)|(?P<G78>\bref\b)|(?P<G79>\brepeat\b)|(?P<G80>\breturn\b)|(?P<G81>\bstep\b)|(?P<G82>\bstruct\b)|(?P<G83>\bsync_high\b)|(?P<G84>\bsync_low\b)|(?P<G85>\btri\b)|(?P<G86>\bu32\b)|(?P<G87>\bu64\b)|(?P<G88>\bvar\b)|(?P<G89>[a-zA-Z_][0-9a-zA-Z_]*)|(?P<G90>.)

As you can see all these inner \b could be factored out in first place. Like this:

(?P<G1>\r\n|\r|\n)|(?P<G2>[\s--\r\n]+)|(?P<G5>(?:(?:(?://.*(?:\r\n|\r|\n|$))|(?:(?ms)/\u{2a}.*?\u{2a}/))\s*)+)|(?P<G6>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*[eE][+-]?[0-9]+(?:_[0-9]+)*)|(?P<G7>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*)|(?P<G8>[0-9]+(?:_[0-9]+)*'[bodh][0-9a-fA-FxzXZ]+(?:_[0-9a-fA-FxzXZ]+)*)|(?P<G9>[0-9]+(?:_[0-9]+)*)|(?P<G10>'[01xzXZ])|(?P<G11>\-:)|(?P<G12>\->)|(?P<G13>\+:)|(?P<G14>\+=|-=|\*=|/=|%=|&=|\|=|\^=|<<=|>>=|<<<=|>>>=)|(?P<G15>\*\*)|(?P<G16>/|%)|(?P<G17>\+|-)|(?P<G18><<<|>>>|<<|>>)|(?P<G19><=|>=|<|>)|(?P<G20>===|==\?|!==|!=\?|==|!=)|(?P<G21>&&)|(?P<G22>\|\|)|(?P<G23>&)|(?P<G24>\^~|\^|~\^)|(?P<G25>\|)|(?P<G26>~&|~\||!|~)|(?P<G27>::)|(?P<G28>:)|(?P<G29>,)|(?P<G30>\$)|(?P<G31>\.\.)|(?P<G32>\.)|(?P<G33>=)|(?P<G34>\#)|(?P<G35>\{)|(?P<G36>\[)|(?P<G37>\()|(?P<G38>\})|(?P<G39>\])|(?P<G40>\))|(?P<G41>;)|(?P<G42>\*)|(\b(?P<G43>always_comb)|(?P<G44>always_ff)|(?P<G45>assign)|(?P<G46>async_high)|(?P<G47>async_low)|(?P<G48>as)|(?P<G49>bit)|(?P<G50>case)|(?P<G51>default)|(?P<G52>else)|(?P<G53>enum)|(?P<G54>export)|(?P<G55>f32)|(?P<G56>f64)|(?P<G57>for)|(?P<G58>function)|(?P<G59>i32)|(?P<G60>i64)|(?P<G61>if_reset)|(?P<G62>if)|(?P<G63>import)|(?P<G64>inout)|(?P<G65>input)|(?P<G66>inst)|(?P<G67>interface)|(?P<G68>in)|(?P<G69>localparam)|(?P<G70>logic)|(?P<G71>modport)|(?P<G72>module)|(?P<G73>negedge)|(?P<G74>output)|(?P<G75>package)|(?P<G76>parameter)|(?P<G77>posedge)|(?P<G78>ref)|(?P<G79>repeat)|(?P<G80>return)|(?P<G81>step)|(?P<G82>struct)|(?P<G83>sync_high)|(?P<G84>sync_low)|(?P<G85>tri)|(?P<G86>u32)|(?P<G87>u64)|(?P<G88>var)\b)|(?P<G89>[a-zA-Z_][0-9a-zA-Z_]*)|(?P<G90>.)

(Capture groups G43-G88 are enclosed in a pair of parentheses with word boundaries inside)

I think I will do some measurements to be able to decide further steps.

@dalance
Copy link
Author

dalance commented Jan 17, 2023

According to my previous measurements, complexity of each regex is almost not affected to the performance.
For example, removing all \b from keyword, simplifing comment's regex.

The most affected event in my development was introducing operator precedence at veryl-lang/veryl@96943d4.

from

OperatorTerm: "[all operators concatenated by `|`]"

to

Operator11Term: "\*\*"
Operator10Term: "/|%"
...

Benchmark result is

from

     Running benches/benchmark.rs (target/release/deps/benchmark-9adb330bf6a525e1)
Benchmarking throughput/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.6s, or reduce sample count to 60.
throughput/parse        time:   [74.871 ms 74.962 ms 75.059 ms]
                        thrpt:  [122.64 KiB/s 122.80 KiB/s 122.95 KiB/s]
Found 11 outliers among 100 measurements (11.00%)
  10 (10.00%) high mild
  1 (1.00%) high severe

to

     Running benches/benchmark.rs (target/release/deps/benchmark-9adb330bf6a525e1)
Benchmarking throughput/parse: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, or reduce sample count to 50.
throughput/parse        time:   [86.840 ms 87.008 ms 87.193 ms]
                        thrpt:  [105.57 KiB/s 105.80 KiB/s 106.00 KiB/s]
                 change:
                        time:   [+15.800% +16.069% +16.354%] (p = 0.00 < 0.05)
                        thrpt:  [-14.055% -13.844% -13.644%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

At first, I wrote all operators as dependent term like below.
Afterwards, I found it degrades performance. So I changed it to regex concatenation.

PlusTerm: '+'
MinusTerm: '-'
...

Operator: PlusTerm | MinusTerm | ...

@jsinger67
Copy link
Owner

I added some benches to parol_runtime where I tokenize a file (generated by parol generate for the veryl grammar) that has a size of 58.525 Bytes.
I only measured the throughput of the regex and the thin wrapper used in parol_runtime.

The average time to tokenize the input text for the originally generated regex (tokenize_1) about 100ms.
58.525 Bytes/0.1s is about 585 kByte per second.

The average time to tokenize the input text with the manually modified regex that factors out the word boundaries (tokenize_1) about 115ms.
58.525 Bytes/0.115s is about 509 kByte per second.

The output of bench on my machine was:

Benchmarking tokenize_1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.5s, or reduce sample count to 40.
tokenize_1              time:   [98.493 ms 100.74 ms 103.11 ms]
                        change: [-14.168% -10.130% -5.8496%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking tokenize_2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.8s, or reduce sample count to 40.
tokenize_2              time:   [112.57 ms 115.74 ms 119.09 ms]
                        change: [-11.023% -7.4467% -3.7361%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

My conclusion is that the regex with factored out word boundaries doesn't improve the performance. It is quite the contrary, the performance is about 15% worse.

Generally I consider this performance as not too bad, although I don't have a performance comparison at hand.

In all these measurements no parsing is taken into account. This might be a subject of further performance analyses.

@jsinger67
Copy link
Owner

jsinger67 commented Jan 22, 2023

I'm thinking about using logos as an alternative scanner framework.
I have to evaluate a lot to ensure such functionalities like

Edit (Jan. 29th 2023):

  • regex compatibility
    There are some restrictions like
    • No support for
      • anchors, like word boundaries, ^, $
      • Non-greedy matching of + and *
      • No explicit repetition ranges {m, n}?
    • Further investigation is necessary to evaluate the impact
  • look-ahead support
    • Works together with TokenStream
  • scanner switching
    • Should work, needs to be evaluated
  • the actual performance gain
    • Is huged (75ms vs. 0.3ms, about 250! times faster in a single example)

To be able to measure this all at least a

  • re-implementation of parol_runtime is necessary.
    • Heavy impact on code generation in parol
    • A prototype of parol_runtime exists

I will investigate a bit more to be able to decide reasonably.

@jsinger67 jsinger67 self-assigned this Jan 22, 2023
@jsinger67
Copy link
Owner

Here is my decision with a request for comments.

I will basically keep the current implementation to maintain compatibility and provide a sort of comfort mode in wich all regex provided features can be used. This could be useful for those that don't need very high speed of input processing.

But I will introduce a new mode that could be enabled by

  • a builder configuration or
  • a new directive (e.g. ' %fast_mode' ) in par files

Maybe both ways or either of them.

This new mode (call it fast mode for now) will modify/extend the parol input grammar slightly basically around terminal symbols.
This mode will provide only a restricted set of regex features and may drop the support for %block_comment directive, because this directive utilizes extended regex features like non-greedy matching.
The extend of the supported regex features and whether or not %block_comment directive is dropped depends on the precise scanner library that'll be used.

In this regard I haven't made a decision yet.
I consider two options

  • using logos as scanner framework
  • writing my own lexer generator

What's keeping me from jumping to logos right away is that a significant number of pull requests with useful features have been open for a while, so there could be a risk of this repository going dormant.

I also had a look at rflex.
It's approach is actually nice but it's updated two years ago and uses an own regex parser, i.e. it doesn't use regex-syntax for that. Thus possible extensions are regarded as difficult and error prone. But I didn't rule out rflex completely yet.

I would appreciate feedback to my decisions a lot. Let me know if this is a viable way or whether you have any further ideas or comments.

@dalance
Copy link
Author

dalance commented Feb 13, 2023

Thank you for your investigation.
I'm interesting in the performance gain because the performance of language server affects user experience.

I think the restriction about regex is acceptable.

As a future work, the combination of simplified scanner and full regex scanner may be better.
For example, at the first, tring the simplified scanner, tring the full scanner after failing it.

@BurntSushi
Copy link

If y'all are indeed getting bitten by rust-lang/regex#787, then can you use (?-u:\b) instead of \b? The former is not Unicode-aware.

@jsinger67
Copy link
Owner

Actually the word boundary doesn't seem to be the problem.
The regex generated by parol in @dalance's veryl crate actually consists of 90 top level alternations where a lot of them come from keywords of his language. @BurntSushi, do you think RegexSet could come to our rescue?

@BurntSushi
Copy link

Unlikely. If y'all can give me a program that just depends on regex and approximates the slowness y'all are observing, I could take a closer look.

@jsinger67
Copy link
Owner

Ok, thanks. I can prepare one and let you know.

@jsinger67
Copy link
Owner

@BurntSushi
You can find the example program here.
regex_perf_test
I measure a throughput of about 465 KiB/s with the release version.
Thanks in advance for your effort.

@BurntSushi
Copy link

Thanks, I'm indeed able to reproduce it. The main problem is that in the current version of the regex crate, this regex is big enough that it forces the regex impl into using one of the slower internal engines (specifically, the PikeVM). I am working on a big internal rewrite of the regex crate, and that particular issue is fixed. However, resolving the capture groups is still pretty slow. Throughput does still improve by about an order of magnitude to ~5 MB/s.

With that said, you can make this go quite a bit faster with some lower level APIs. While regex-automata 0.2 isn't fully baked yet, it can handle this use case if you're okay with spending a few milliseconds compiling a full DFA for this. The key insight here is that your regex, while big, has a simplistic structure: it's just a big alternation and capture groups are presumably being used to determine which branch in that alternation matched. Your instincts about using RegexSet for this are correct, but the RegexSet API is crippled. With the regex-automata crate, it supports a much richer "regex set" API where you can iterate over matches, and each match includes not only the position but which pattern matched. So using this API, you don't need capture groups at all. You just need to split each branch out into its own pattern. That's what I did here:

use std::time::Instant;

use regex_automata::{dfa::regex::Regex, SyntaxConfig};

/// A input file taken from `veryl`. Its size is 157.200 Bytes.
const LEXER_INPUT: &str = include_str!("./input.vl");

const PATTERNS: &[&str] = &[
    r"(?P<G1>\r\n|\r|\n)",
    r"(?P<G2>[\s--\r\n]+)",
    r"(?P<G5>(?:(?:(?://.*(?:\r\n|\r|\n|$))|(?:(?ms)/\u{2a}.*?\u{2a}/))\s*)+)",
    r"(?P<G6>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*[eE][+-]?[0-9]+(?:_[0-9]+)*)",
    r"(?P<G7>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*)",
    r"(?P<G8>[0-9]+(?:_[0-9]+)*'[bodh][0-9a-fA-FxzXZ]+(?:_[0-9a-fA-FxzXZ]+)*)",
    r"(?P<G9>[0-9]+(?:_[0-9]+)*)",
    r"(?P<G10>'[01xzXZ])",
    r"(?P<G11>\-:)",
    r"(?P<G12>\->)",
    r"(?P<G13>\+:)",
    r"(?P<G14>\+=|-=|\*=|/=|%=|&=|\|=|\^=|<<=|>>=|<<<=|>>>=)",
    r"(?P<G15>\*\*)",
    r"(?P<G16>/|%)",
    r"(?P<G17>\+|-)",
    r"(?P<G18><<<|>>>|<<|>>)",
    r"(?P<G19><=|>=|<|>)",
    r"(?P<G20>===|==\?|!==|!=\?|==|!=)",
    r"(?P<G21>&&)",
    r"(?P<G22>\|\|)",
    r"(?P<G23>&)",
    r"(?P<G24>\^~|\^|~\^)",
    r"(?P<G25>\|)",
    r"(?P<G26>~&|~\||!|~)",
    r"(?P<G27>::)",
    r"(?P<G28>:)",
    r"(?P<G29>,)",
    r"(?P<G30>\$)",
    r"(?P<G31>\.\.)",
    r"(?P<G32>\.)",
    r"(?P<G33>=)",
    r"(?P<G34>\#)",
    r"(?P<G35>\{)",
    r"(?P<G36>\[)",
    r"(?P<G37>\()",
    r"(?P<G38>\})",
    r"(?P<G39>\])",
    r"(?P<G40>\))",
    r"(?P<G41>;)",
    r"(?P<G42>\*)",
    r"(?P<G43>\balways_comb\b)",
    r"(?P<G44>\balways_ff\b)",
    r"(?P<G45>\bassign\b)",
    r"(?P<G46>\basync_high\b)",
    r"(?P<G47>\basync_low\b)",
    r"(?P<G48>\bas\b)",
    r"(?P<G49>\bbit\b)",
    r"(?P<G50>\bcase\b)",
    r"(?P<G51>\bdefault\b)",
    r"(?P<G52>\belse\b)",
    r"(?P<G53>\benum\b)",
    r"(?P<G54>\bexport\b)",
    r"(?P<G55>\bf32\b)",
    r"(?P<G56>\bf64\b)",
    r"(?P<G57>\bfor\b)",
    r"(?P<G58>\bfunction\b)",
    r"(?P<G59>\bi32\b)",
    r"(?P<G60>\bi64\b)",
    r"(?P<G61>\bif_reset\b)",
    r"(?P<G62>\bif\b)",
    r"(?P<G63>\bimport\b)",
    r"(?P<G64>\binout\b)",
    r"(?P<G65>\binput\b)",
    r"(?P<G66>\binst\b)",
    r"(?P<G67>\binterface\b)",
    r"(?P<G68>\bin\b)",
    r"(?P<G69>\blocalparam\b)",
    r"(?P<G70>\blogic\b)",
    r"(?P<G71>\bmodport\b)",
    r"(?P<G72>\bmodule\b)",
    r"(?P<G73>\bnegedge\b)",
    r"(?P<G74>\boutput\b)",
    r"(?P<G75>\bpackage\b)",
    r"(?P<G76>\bparameter\b)",
    r"(?P<G77>\bposedge\b)",
    r"(?P<G78>\bref\b)",
    r"(?P<G79>\brepeat\b)",
    r"(?P<G80>\breturn\b)",
    r"(?P<G81>\bstep\b)",
    r"(?P<G82>\bstruct\b)",
    r"(?P<G83>\bsync_high\b)",
    r"(?P<G84>\bsync_low\b)",
    r"(?P<G85>\btri\b)",
    r"(?P<G86>\bu32\b)",
    r"(?P<G87>\bu64\b)",
    r"(?P<G88>\bvar\b)",
    r"(?P<G89>[a-zA-Z_][0-9a-zA-Z_]*)",
    r"(?P<G90>.)",
];

fn main() {
    let regex = Regex::builder()
        .syntax(SyntaxConfig::new().unicode(false).utf8(false))
        .build_many(PATTERNS)
        .unwrap();
    let find_iter = regex.find_leftmost_iter(LEXER_INPUT.as_bytes());
    let mut token_count = 0usize;
    let now = Instant::now();
    for _ in find_iter {
        token_count += 1;
    }
    let elapsed_time = now.elapsed();
    println!(
        "Tokenizating {} tokens took {} milliseconds.",
        token_count,
        elapsed_time.as_millis()
    );
}

And now:

$ ./target/release/regex_perf_test_orig
Tokenizating 62400 tokens took 601 milliseconds.
$ ./target/release/regex_perf_test     
Tokenizating 62400 tokens took 1 milliseconds.

I've also added this regex to my benchmark suite (which I will publish in the next few months):

[[bench]]
model = "count-captures"
name = "ascii"
regex = { path = "wild/parol-veryl.txt", per-line = "alternate" }
haystack = { path = "wild/parol-veryl.vl" }
count = 124_800
engines = [
  'rust/regex/meta',
  'rust/regexold',
]

[[bench]]
model = "count-captures"
name = "unicode"
regex = { path = "wild/parol-veryl.txt", per-line = "alternate" }
unicode = true
haystack = { path = "wild/parol-veryl.vl" }
count = 124_800
engines = [
  'rust/regex/meta',
  'rust/regexold',
]

[[bench]]
model = "count"
name = "multi-patternid-ascii"
regex = { path = "wild/parol-veryl.txt", per-line = "pattern" }
haystack = { path = "wild/parol-veryl.vl" }
count = 62_400
engines = [
  'rust/regex/meta',
  'rust/regex/dense',
]

[[bench]]
model = "count-captures"
name = "multi-captures-ascii"
regex = { path = "wild/parol-veryl.txt", per-line = "pattern" }
haystack = { path = "wild/parol-veryl.vl" }
count = 124_800
engines = [
  'rust/regex/meta',
]

And results are:

benchmark                               rust/regex/dense   rust/regex/meta    rust/regexold
---------                               ----------------   ---------------    -------------
wild/parol-veryl/ascii                  -                  5.3 MB/s (1.00x)   241.1 KB/s (22.70x)
wild/parol-veryl/multi-captures-ascii   -                  21.5 MB/s (1.00x)  -
wild/parol-veryl/multi-patternid-ascii  90.9 MB/s (1.00x)  63.0 MB/s (1.44x)  -
wild/parol-veryl/unicode                -                  4.7 MB/s (1.00x)   239.3 KB/s (20.02x)

Where rust/regexold is the status quo and rust/regex/meta is what will replace the regex crate. So if you do nothing, you'll eventually get a nice perf boost here. If you switch to the lower level meta engine with the better regex set API, you'll get to that ~63 MB/s thoughput range without slower regex compilation times. But the fastest is building the DFA ahead of time.

I'll also note that the throughput is somewhat limited here because of the number of matches. Since the match count is so high, a lot of time is spent just churning through regex engine calls. So in theory, you could probably hand-roll something that is faster than this.

Apologies for the somewhat scattershot status quo. Things will be better once rust-lang/regex#656 is done. But thank you for the great test case!

@jsinger67
Copy link
Owner

These are actually good news! 👍
First I tend to use regex_automata::RegexSet in parol.
Your guess that I used the capture groups to determine the matched token type is right.
And using the RegexSet will render them superfluous in the end which will contribute to the overall performance probably too.
Also the effort to change the regex value generation to use RegexSet is minimal.
Again, thanks a lot!

@jsinger67
Copy link
Owner

@dalance
I'm currently testing the changes suggested by @BurntSushi, especially with your grammar in mind, but also with regard to compatibility.
My overall impression is really good and the performance got a huge gain in comparison with the former implementation.
All you have to do is to switch the word boundaries in your grammar to ASCII word boundaries by changing the plain \b to (?-u:\b). That's all. What I can't exactly assess is the impact on your language because you are one of the users that rely on Unicode support a lot, I think. Maybe you can comment on this.
I'll keep you informed about the progress.

@dalance
Copy link
Author

dalance commented Feb 16, 2023

In my language, ASCII word boundaries are acceptable because unicode is not allowed as keyword and identifier.
All non-ASCII unicode is consumed by the regex of comment or string literal.
I'm looking forward to the progress.

@jsinger67
Copy link
Owner

@dalance
I just now released new versions of parol (0.17.0) and parol_runtime (0.13.0) to crates.io.
Please feel free to test these new versions regarding tokenization performance.
If it is acceptable I would like to close this issue.

@dalance
Copy link
Author

dalance commented Feb 17, 2023

Thanks!
I got about x5 performance gain ( 67KiB/s -> 372KiB/s ).
In the flamegraph, call_semantic_action_for_production_number is dominant.
I think this tend is proper as parser.

The detailed information is below

Click to expand
  • parol 0.16.0
throughput/parse        time:   [219.88 ms 220.19 ms 220.53 ms]
                        thrpt:  [67.270 KiB/s 67.375 KiB/s 67.469 KiB/s]
                 change:
                        time:   [-2.1403% -1.9295% -1.7294%] (p = 0.00 < 0.05)
                        thrpt:  [+1.7599% +1.9675% +2.1871%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  11 (11.00%) high mild
  1 (1.00%) high severe

flamegraph

  • parol 0.17.0
throughput/parse        time:   [39.733 ms 39.790 ms 39.853 ms]
                        thrpt:  [372.24 KiB/s 372.83 KiB/s 373.37 KiB/s]
                 change:
                        time:   [-81.968% -81.929% -81.889%] (p = 0.00 < 0.05)
                        thrpt:  [+452.15% +453.37% +454.58%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

flamegraph

@dalance
Copy link
Author

dalance commented Feb 17, 2023

The above result contains Regex::new in an inner loop.
I removed it, and got 454KiB/s.
The current dominant seems to be move_node of id_tree.
I think it is reasonable.

flamegraph

@jsinger67
Copy link
Owner

Thank you for your thorough analyses.
If you don't need the full parse tree in the end, you can enable the feature trim_parse_tree of the crate parol_runtime. This should tweak the parsing performance a bit.
Nevertheless I will have a look at the general parser performance. The use of id_tree is one candidate.
Feel free to open another issue about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants