Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Regex HTML_BLOCK_ELEMENT_R because of issue with self closing tags #546

Closed
devbrains-com opened this issue Feb 26, 2024 · 5 comments · Fixed by #570
Closed

Slow Regex HTML_BLOCK_ELEMENT_R because of issue with self closing tags #546

devbrains-com opened this issue Feb 26, 2024 · 5 comments · Fixed by #570

Comments

@devbrains-com
Copy link

devbrains-com commented Feb 26, 2024

We found out, the following regex is very slow and takes up to 50ms with a single self closing tag on the page.

const HTML_BLOCK_ELEMENT_R = /^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)\/{0}>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)<\/\1>\n*/i

The reason for that seems to be a non working check for self closing tags \/{0}.

The final regex would be:

const HTML_BLOCK_ELEMENT_R = /^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?((?:[^>]*[^/])?)>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)<\/\1>\n*/i

Thank you very much

@quantizor
Copy link
Owner

Could you check again with the latest code? There was a sorting issue in the rules that might have contributed to this problem.

I did perf test this particular change and was getting inconclusive results https://jsperf.app/joribi/1/preview

@devbrains-com
Copy link
Author

Thank you very much for looking into it. It seems like the sorting fixes our main concern.

The regex's performance improvement was only visible in large examples with a lot of text after the self-closing element. I tested again, and I couldn't see any performance difference now.

@devbrains-com devbrains-com closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
@Goues
Copy link

Goues commented Mar 15, 2024

Hey, here is a repro https://regex101.com/r/ac4mJP/1

Apparently, self closing tags cause a runaway regex and it just times out eventually if there is enough content. The fix proposed in here does fix this issue. Would you care to reopen the issue?

@quantizor quantizor reopened this Mar 15, 2024
@quantizor
Copy link
Owner

@Goues if you run the adjusted regex against the unit tests it bails too early, but it is a lot faster. Working on finding a happy medium.

Worth noting that the OP regex is not current (there's no \/{0} sequence anymore). The current block HTML regex is:

/^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1\b)[\s\S])*?)<\/\1>(?!<\/\1>)\n*/i

@quantizor
Copy link
Owner

Ok I found a variation that works better

/^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?((?:[^>]*[^/])?)>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1\b)[\s\S])*?)<\/\1>(?!<\/\1>)\n*/i

Thanks all, will get this into v7

quantizor added a commit that referenced this issue Apr 11, 2024
Closes #546

Thank you @devbrains-com for contributing the basis of this fix!
quantizor added a commit that referenced this issue Apr 11, 2024
Closes #546

Thank you @devbrains-com for contributing the basis of this fix!
quantizor added a commit that referenced this issue Apr 11, 2024
Closes #546

Thank you @devbrains-com for contributing the basis of this fix!
quantizor added a commit that referenced this issue Apr 11, 2024
Closes #546

Thank you @devbrains-com for contributing the basis of this fix!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants