Fixes regex backtracking bug detecting leading/trailing whitespace #427

Rohland · 2023-03-13T12:55:36Z

This attempts to fix performance issues with large blocks of text in <code> blocks.

Attempting to convert HTML to markdown on large code blocks would trigger 100% CPU. I believe it's a regex backtracking issue with the following function:

function edgeWhitespace (string) {
  var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/);
    return {
      leading: m[1], // whole string for whitespace-only strings
      leadingAscii: m[2],
      leadingNonAscii: m[3],
      trailing: m[4], // empty for whitespace-only strings
      trailingNonAscii: m[5],
      trailingAscii: m[6]
    }
}

This PR refactors this function to avoid the non-greedy search for content in the middle.

I added a test to repro the issue, and prove the fix.

pavelhoral · 2023-03-13T16:36:05Z

Ah, I love regexp performance :)... The main issue is the content selector. So the regexp can be sped up by simply forcing it to match content to start and end with non-whitespace character:

- var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/)
+ var m = string.match(/^(([ \t\r\n]*)(\s*))(?:(?=\S)[\s\S]*\S)?((\s*?)([ \t\r\n]*))$/)

Also maybe a little bit more readable would be do the matching separately (similar to the code in the PR):

function edgeWhitespace (string) {
    var leadMatch = string.match(/^(([ \t\r\n]*)(\s*))/);
    var textMatch = string.substring(leadMatch[0].length).match(/((?=\S)[\s\S]*\S)?/); 
    var tailMatch = string.substring(leadMatch[0].length + textMatch[0].length).match(/((\s*?)([ \t\r\n]*))$/);
    return {
        leading: leadMatch[1], // whole string for whitespace-only strings
        leadingAscii: leadMatch[2],
        leadingNonAscii: leadMatch[3],
        trailing: tailMatch[1], // empty for whitespace-only strings
        trailingNonAscii: tailMatch[2],
        trailingAscii: tailMatch[3]
    };
}

Both these solutions seem to have the same performance as the code proposed by the PR.

martincizek · 2023-03-19T13:17:51Z

Thank you guys for your contribution. I've created issue #429 to reference the problem.

I've also created a simplified and hopefully more readable version:

function edgeWhitespace (string) {
  var leadingMatch = string.match(/^([ \t\r\n]*)(\s*)/)
  var tailMatch = string.match(/\S(([^\S \t\r\n]*)([ \t\r\n]*))$/) || ['', '', '', '']

  return {
    leading: leadingMatch[0], // whole string for whitespace-only strings
    leadingAscii: leadingMatch[1],
    leadingNonAscii: leadingMatch[2],
    trailing: tailMatch[1], // empty for whitespace-only strings
    trailingNonAscii: tailMatch[2],
    trailingAscii: tailMatch[3]
  }
}

Remarks:

The non-greedy capture group catching the trimmed content (the biggest problem) is avoided using two regexp matchings.
The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression [^\S \t\r\n]*. This solution makes two regexp matchings sufficient.
I've also tried adding a whitespace-only string test and creating a substring for the latter match, but it didn't have any measurable performance effect. So not adding these micro-optimisations for now.

Feedback appreciated.

martincizek · 2023-03-20T11:45:59Z

The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression [^\S \t\r\n]*. This solution makes two regexp matchings sufficient.

OK, this one has a bug. Will probably use a variant of @pavelhoral's single regexp version, as it will most likely perform better on typical inputs (the function is called for each node - paragraphs, headings, list items, ...).

Rohland · 2023-03-22T06:50:16Z

Awesome :) Just something to consider is the optimisation for whitespace inputs. We probably want to avoid running the trailing regex if the leading regex matches and the match is the same length as the input string. I don't believe there is value on progressing further and evaluating the trailing regex.

martincizek · 2023-03-22T11:29:50Z

Released a fix of #429. Thank you for spotting the issue and all the thoughts on the fix.

🐛 fixes regex backtracking bug detecting leading/trailing whitespace

190322c

Rohland mentioned this pull request Mar 13, 2023

Fixes regex backtracking bug detecting leading/trailing whitespace #428

Closed

martincizek mentioned this pull request Mar 19, 2023

Large code blocks would trigger 100% CPU #429

Closed

martincizek closed this Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Rohland commented Mar 13, 2023 •

edited

Loading

pavelhoral commented Mar 13, 2023

martincizek commented Mar 19, 2023

martincizek commented Mar 20, 2023

Rohland commented Mar 22, 2023

martincizek commented Mar 22, 2023

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Conversation

Rohland commented Mar 13, 2023 • edited Loading

pavelhoral commented Mar 13, 2023

martincizek commented Mar 19, 2023

martincizek commented Mar 20, 2023

Rohland commented Mar 22, 2023

martincizek commented Mar 22, 2023

Rohland commented Mar 13, 2023 •

edited

Loading