-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing performance #56
Comments
@dalance Removing For
As you can see all these inner
(Capture groups G43-G88 are enclosed in a pair of parentheses with word boundaries inside) I think I will do some measurements to be able to decide further steps. |
According to my previous measurements, complexity of each regex is almost not affected to the performance. The most affected event in my development was introducing operator precedence at veryl-lang/veryl@96943d4. from
to
Benchmark result is from
to
At first, I wrote all operators as dependent term like below.
|
I added some benches to The average time to tokenize the input text for the originally generated regex (tokenize_1) about 100ms. The average time to tokenize the input text with the manually modified regex that factors out the word boundaries (tokenize_1) about 115ms. The output of bench on my machine was:
My conclusion is that the regex with factored out word boundaries doesn't improve the performance. It is quite the contrary, the performance is about 15% worse. Generally I consider this performance as not too bad, although I don't have a performance comparison at hand. In all these measurements no parsing is taken into account. This might be a subject of further performance analyses. |
I'm thinking about using logos as an alternative scanner framework. Edit (Jan. 29th 2023):
To be able to measure this all at least a
I will investigate a bit more to be able to decide reasonably. |
Here is my decision with a request for comments. I will basically keep the current implementation to maintain compatibility and provide a sort of comfort mode in wich all regex provided features can be used. This could be useful for those that don't need very high speed of input processing. But I will introduce a new mode that could be enabled by
Maybe both ways or either of them. This new mode (call it fast mode for now) will modify/extend the In this regard I haven't made a decision yet.
What's keeping me from jumping to I also had a look at rflex. I would appreciate feedback to my decisions a lot. Let me know if this is a viable way or whether you have any further ideas or comments. |
Thank you for your investigation. I think the restriction about regex is acceptable. As a future work, the combination of simplified scanner and full regex scanner may be better. |
If y'all are indeed getting bitten by rust-lang/regex#787, then can you use |
Actually the word boundary doesn't seem to be the problem. |
Unlikely. If y'all can give me a program that just depends on |
Ok, thanks. I can prepare one and let you know. |
@BurntSushi |
Thanks, I'm indeed able to reproduce it. The main problem is that in the current version of the regex crate, this regex is big enough that it forces the regex impl into using one of the slower internal engines (specifically, the PikeVM). I am working on a big internal rewrite of the regex crate, and that particular issue is fixed. However, resolving the capture groups is still pretty slow. Throughput does still improve by about an order of magnitude to ~5 MB/s. With that said, you can make this go quite a bit faster with some lower level APIs. While use std::time::Instant;
use regex_automata::{dfa::regex::Regex, SyntaxConfig};
/// A input file taken from `veryl`. Its size is 157.200 Bytes.
const LEXER_INPUT: &str = include_str!("./input.vl");
const PATTERNS: &[&str] = &[
r"(?P<G1>\r\n|\r|\n)",
r"(?P<G2>[\s--\r\n]+)",
r"(?P<G5>(?:(?:(?://.*(?:\r\n|\r|\n|$))|(?:(?ms)/\u{2a}.*?\u{2a}/))\s*)+)",
r"(?P<G6>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*[eE][+-]?[0-9]+(?:_[0-9]+)*)",
r"(?P<G7>[0-9]+(?:_[0-9]+)*\.[0-9]+(?:_[0-9]+)*)",
r"(?P<G8>[0-9]+(?:_[0-9]+)*'[bodh][0-9a-fA-FxzXZ]+(?:_[0-9a-fA-FxzXZ]+)*)",
r"(?P<G9>[0-9]+(?:_[0-9]+)*)",
r"(?P<G10>'[01xzXZ])",
r"(?P<G11>\-:)",
r"(?P<G12>\->)",
r"(?P<G13>\+:)",
r"(?P<G14>\+=|-=|\*=|/=|%=|&=|\|=|\^=|<<=|>>=|<<<=|>>>=)",
r"(?P<G15>\*\*)",
r"(?P<G16>/|%)",
r"(?P<G17>\+|-)",
r"(?P<G18><<<|>>>|<<|>>)",
r"(?P<G19><=|>=|<|>)",
r"(?P<G20>===|==\?|!==|!=\?|==|!=)",
r"(?P<G21>&&)",
r"(?P<G22>\|\|)",
r"(?P<G23>&)",
r"(?P<G24>\^~|\^|~\^)",
r"(?P<G25>\|)",
r"(?P<G26>~&|~\||!|~)",
r"(?P<G27>::)",
r"(?P<G28>:)",
r"(?P<G29>,)",
r"(?P<G30>\$)",
r"(?P<G31>\.\.)",
r"(?P<G32>\.)",
r"(?P<G33>=)",
r"(?P<G34>\#)",
r"(?P<G35>\{)",
r"(?P<G36>\[)",
r"(?P<G37>\()",
r"(?P<G38>\})",
r"(?P<G39>\])",
r"(?P<G40>\))",
r"(?P<G41>;)",
r"(?P<G42>\*)",
r"(?P<G43>\balways_comb\b)",
r"(?P<G44>\balways_ff\b)",
r"(?P<G45>\bassign\b)",
r"(?P<G46>\basync_high\b)",
r"(?P<G47>\basync_low\b)",
r"(?P<G48>\bas\b)",
r"(?P<G49>\bbit\b)",
r"(?P<G50>\bcase\b)",
r"(?P<G51>\bdefault\b)",
r"(?P<G52>\belse\b)",
r"(?P<G53>\benum\b)",
r"(?P<G54>\bexport\b)",
r"(?P<G55>\bf32\b)",
r"(?P<G56>\bf64\b)",
r"(?P<G57>\bfor\b)",
r"(?P<G58>\bfunction\b)",
r"(?P<G59>\bi32\b)",
r"(?P<G60>\bi64\b)",
r"(?P<G61>\bif_reset\b)",
r"(?P<G62>\bif\b)",
r"(?P<G63>\bimport\b)",
r"(?P<G64>\binout\b)",
r"(?P<G65>\binput\b)",
r"(?P<G66>\binst\b)",
r"(?P<G67>\binterface\b)",
r"(?P<G68>\bin\b)",
r"(?P<G69>\blocalparam\b)",
r"(?P<G70>\blogic\b)",
r"(?P<G71>\bmodport\b)",
r"(?P<G72>\bmodule\b)",
r"(?P<G73>\bnegedge\b)",
r"(?P<G74>\boutput\b)",
r"(?P<G75>\bpackage\b)",
r"(?P<G76>\bparameter\b)",
r"(?P<G77>\bposedge\b)",
r"(?P<G78>\bref\b)",
r"(?P<G79>\brepeat\b)",
r"(?P<G80>\breturn\b)",
r"(?P<G81>\bstep\b)",
r"(?P<G82>\bstruct\b)",
r"(?P<G83>\bsync_high\b)",
r"(?P<G84>\bsync_low\b)",
r"(?P<G85>\btri\b)",
r"(?P<G86>\bu32\b)",
r"(?P<G87>\bu64\b)",
r"(?P<G88>\bvar\b)",
r"(?P<G89>[a-zA-Z_][0-9a-zA-Z_]*)",
r"(?P<G90>.)",
];
fn main() {
let regex = Regex::builder()
.syntax(SyntaxConfig::new().unicode(false).utf8(false))
.build_many(PATTERNS)
.unwrap();
let find_iter = regex.find_leftmost_iter(LEXER_INPUT.as_bytes());
let mut token_count = 0usize;
let now = Instant::now();
for _ in find_iter {
token_count += 1;
}
let elapsed_time = now.elapsed();
println!(
"Tokenizating {} tokens took {} milliseconds.",
token_count,
elapsed_time.as_millis()
);
} And now:
I've also added this regex to my benchmark suite (which I will publish in the next few months):
And results are:
Where I'll also note that the throughput is somewhat limited here because of the number of matches. Since the match count is so high, a lot of time is spent just churning through regex engine calls. So in theory, you could probably hand-roll something that is faster than this. Apologies for the somewhat scattershot status quo. Things will be better once rust-lang/regex#656 is done. But thank you for the great test case! |
These are actually good news! 👍 |
@dalance |
In my language, ASCII word boundaries are acceptable because unicode is not allowed as keyword and identifier. |
@dalance |
Thanks! The detailed information is below Click to expand
|
Thank you for your thorough analyses. |
At veryl, parsing performance is about 64KiB/s now:
In my expexience, the performance is affected by the number of terminals.
For example, the performance is degraded by increasing reserverd keyword.
The flamegraph is below. It shows almost all time is consumed at
regex::pikevm
in parol's lexer.This issue may be the same as rust-lang/regex#787.
BurntSushi seems to try to implement new regex engine to improve it.
If my prediction is correct, it is required that removeing regex from lexer or waiting the new regex engine to improve the performance of parol.
The text was updated successfully, but these errors were encountered: