-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework RegExp engine and add support for proper unicode matching #3746
Conversation
|
|
Thanks @lauriro, should be fixed now. |
one more
strange thing is, when |
Thanks, fixed as well. |
801c385
to
a8d39aa
Compare
4ea579f
to
caa0edf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capturing group matching seems too complex to me. In a recursive implementation, it is enough to store 3 values on the stack (prev_start, prev_end, current_start), and do an update when the end is reached. I saw local array allocation and similar things in the code. Are they necessary?
*/ | ||
static JERRY_ATTR_NOINLINE const lit_utf8_byte_t * | ||
ecma_regexp_step_back (ecma_regexp_ctx_t *re_ctx_p, | ||
const lit_utf8_byte_t *str_p) /**< reference to string pointer */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this function is exactly used? Word boundary?
If there is a surrogate pair, and match starts from the position between the two characters, ecma_regexp_step_back
may go back the start of the surrogate pair instead of stopping at the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the greedy atom iterator during backtracking.
In unicode mode surrogate pairs are required to be handled as a single code point, matching can never start from the middle of a surrogate pair, and it is required to read the entire surrogate pair when backtracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you are right, The engine overwrites the start offset for surrogate pairs. Is there a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one now.
|
||
/* If zero iterations are allowed, then execute the end opcode which will handle further iterations, | ||
* otherwise run the 1st iteration immediately by executing group bytecode. */ | ||
if (qmin == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this likely? Can it be handled at compile time in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The likelihood would be roughly 50/50 in my opinion, it could be handled compile time, however that would increase code size, and performance impact would be negligible.
JERRY_TRACE_MSG ("fail\n"); | ||
return NULL; /* fail */ | ||
/* Save and clear all nested capturing groups, and try to iterate. */ | ||
JERRY_VLA (const lit_utf8_byte_t *, saved_captures_p, group_p->subcapture_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required.
const uint32_t digit = (uint32_t) (*current_p++ - LIT_CHAR_0); | ||
uint32_t new_value = value * 10 + digit; | ||
|
||
if (JERRY_UNLIKELY (value > UINT32_MAX / 10) || JERRY_UNLIKELY (new_value < value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comes from the standard or an engine limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Engine limit, the value would overflow otherwise.
Technically 2 values would be enough since prev_end always equals to current_start, but only in very simple cases. When there are nested capturing groups in the current group, it's no longer enough to save only these values, and we need to save/clear all the nested captures at the start of each iteration. Take for example this regex: We currently get away with only saving the starting pointer of the captures by not clobbering the end pointer unnecessarily, so in my opinion this is as efficient as it can get, while still conforming to the standard. |
2b13455
to
ae1e96f
Compare
It looks like this is working differently from other engines, but this type of regexp has far less features so they can get away with it. If a regexp has 1000 captures inside an iterator, can it consume the stack easily? By the way, would it be enough to use a capture reset bitset? So instead of storing the previous values (two uint32_t values), just store one bit which says the value is currently undefined (even though the capture vector contains its previous value). This can be a represented by a flag, or a negative number in the capture vector. When the engine enter to a group, just save the flags, and flag all numbers in the ovector. When restore the previous iteration, reset the flags only, not the values. Even using a byte vector is a lot of save compared to 2*4 bytes. |
I don't think a reset bitset would be enough. Currently when there are no nested captures we are using less stack than the previous implementation. For every nested capture there is, we have to save one extra value per iteration, which is unfortunately a necessary evil. We could save these on the heap, but that would be horrible performance wise. If nested captures don't have quantifiers then stack usage is still reasonable, and if they do, that would be bad stack-wise regardless of whether we save captures or not. Also IMO nested captures aren't extremely common, there are usually one or two per regexp at most, so I don't consider them to be major concern. |
I think these ideas are beyond the PR's scope. The rework currently has significant stack/memory consumption improvement however the main reason was to add the proper unicode matching support for the engine so that's a double benefit. So I'd suggest to keep these ideas as the institution for opening a follow up optimization PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice job! I am sure I missed things because the patch is huge, and there are optimization oppportunities, but lets use this as a first step.
This change includes several bugfixes, general improvements, and support for additional features. - Added full support for web compatibility syntax defined in Annex B - Implemented parsing and matching patterns in unicode mode - Fixed capture results when iterating with nested capturing groups - Significantly reduced regexp bytecode size - Reduced stack usage during regexp execution - Improved matching performance JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai [email protected]
Just to mention for the credits: |
This change includes several bugfixes, general improvements, and support
for additional features.