Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes regex backtracking bug detecting leading/trailing whitespace #427

Closed
wants to merge 1 commit into from

Conversation

Rohland
Copy link

@Rohland Rohland commented Mar 13, 2023

This attempts to fix performance issues with large blocks of text in <code> blocks.

Attempting to convert HTML to markdown on large code blocks would trigger 100% CPU. I believe it's a regex backtracking issue with the following function:

function edgeWhitespace (string) {
  var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/);
    return {
      leading: m[1], // whole string for whitespace-only strings
      leadingAscii: m[2],
      leadingNonAscii: m[3],
      trailing: m[4], // empty for whitespace-only strings
      trailingNonAscii: m[5],
      trailingAscii: m[6]
    }
}

This PR refactors this function to avoid the non-greedy search for content in the middle.

I added a test to repro the issue, and prove the fix.

@pavelhoral
Copy link
Collaborator

Ah, I love regexp performance :)... The main issue is the content selector. So the regexp can be sped up by simply forcing it to match content to start and end with non-whitespace character:

- var m = string.match(/^(([ \t\r\n]*)(\s*))[\s\S]*?((\s*?)([ \t\r\n]*))$/)
+ var m = string.match(/^(([ \t\r\n]*)(\s*))(?:(?=\S)[\s\S]*\S)?((\s*?)([ \t\r\n]*))$/)

Also maybe a little bit more readable would be do the matching separately (similar to the code in the PR):

function edgeWhitespace (string) {
    var leadMatch = string.match(/^(([ \t\r\n]*)(\s*))/);
    var textMatch = string.substring(leadMatch[0].length).match(/((?=\S)[\s\S]*\S)?/); 
    var tailMatch = string.substring(leadMatch[0].length + textMatch[0].length).match(/((\s*?)([ \t\r\n]*))$/);
    return {
        leading: leadMatch[1], // whole string for whitespace-only strings
        leadingAscii: leadMatch[2],
        leadingNonAscii: leadMatch[3],
        trailing: tailMatch[1], // empty for whitespace-only strings
        trailingNonAscii: tailMatch[2],
        trailingAscii: tailMatch[3]
    };
}

Both these solutions seem to have the same performance as the code proposed by the PR.

@martincizek
Copy link
Collaborator

Thank you guys for your contribution. I've created issue #429 to reference the problem.

I've also created a simplified and hopefully more readable version:

function edgeWhitespace (string) {
  var leadingMatch = string.match(/^([ \t\r\n]*)(\s*)/)
  var tailMatch = string.match(/\S(([^\S \t\r\n]*)([ \t\r\n]*))$/) || ['', '', '', '']

  return {
    leading: leadingMatch[0], // whole string for whitespace-only strings
    leadingAscii: leadingMatch[1],
    leadingNonAscii: leadingMatch[2],
    trailing: tailMatch[1], // empty for whitespace-only strings
    trailingNonAscii: tailMatch[2],
    trailingAscii: tailMatch[3]
  }
}

Remarks:

  • The non-greedy capture group catching the trimmed content (the biggest problem) is avoided using two regexp matchings.
  • The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression [^\S \t\r\n]*. This solution makes two regexp matchings sufficient.
  • I've also tried adding a whitespace-only string test and creating a substring for the latter match, but it didn't have any measurable performance effect. So not adding these micro-optimisations for now.

Feedback appreciated.

@martincizek
Copy link
Collaborator

  • The non-greedy capture group catching the trailing non-ASCII whitespace is rewritten as a greedy expression [^\S \t\r\n]*. This solution makes two regexp matchings sufficient.

OK, this one has a bug. Will probably use a variant of @pavelhoral's single regexp version, as it will most likely perform better on typical inputs (the function is called for each node - paragraphs, headings, list items, ...).

@Rohland
Copy link
Author

Rohland commented Mar 22, 2023

Awesome :) Just something to consider is the optimisation for whitespace inputs. We probably want to avoid running the trailing regex if the leading regex matches and the match is the same length as the input string. I don't believe there is value on progressing further and evaluating the trailing regex.

@martincizek
Copy link
Collaborator

Released a fix of #429. Thank you for spotting the issue and all the thoughts on the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants