Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

Closed

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Jun 21, 2024

What changes were proposed in this pull request?

This PR implements multiple performance optimizations for ParserUtils.unescapeSQLString:

  1. Don't use regex: following [SPARK-34263][SQL] Simplify the code for treating unicode/octal/escaped characters in string literals #31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused.
  2. Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping.
  3. Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch).

Why are the changes needed?

unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code:

image

The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that do need escaping, I tested extreme cases where every character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Correctness is covered by existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 21, 2024
@JoshRosen
Copy link
Contributor Author

JoshRosen commented Jun 21, 2024

This needs another self-review to make sure that I haven't made any off-by-one errors and to assess whether we need to add more unit tests to better specify the existing unescapeSQLString behavior, but I'm opening it early to solicit feedback.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jun 24, 2024

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

@JoshRosen
Copy link
Contributor Author

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

I considered this, but I don't think it's necessarily an obvious win: I think that the JIT should do a pretty good job of handling this character-by-character iteration pattern (it probably inlines the call and branch predicts it nicely). Given the overall size of the speedups from this PR so far, I'd prefer to just leave it as-is (I'd need higher resolution benchmarks to measure the charArray difference, I think).

@LuciferYang
Copy link
Contributor

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

I considered this, but I don't think it's necessarily an obvious win: I think that the JIT should do a pretty good job of handling this character-by-character iteration pattern (it probably inlines the call and branch predicts it nicely). Given the overall size of the speedups from this PR so far, I'd prefer to just leave it as-is (I'd need higher resolution benchmarks to measure the charArray difference, I think).

OK ~ Let's leave it as-is ~

@HyukjinKwon
Copy link
Member

Merged to master.

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

This PR implements multiple performance optimizations for `ParserUtils.unescapeSQLString`:

1. Don't use regex: following apache#31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused.
2. Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping.
3. Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch).

### Why are the changes needed?

unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code:

![image](https://github.com/apache/spark/assets/50748/e045d9da-da0f-493c-a634-188acaeab1a9)

The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that _do_ need escaping, I tested extreme cases where _every_ character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Correctness is covered by existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47062 from JoshRosen/unescapeSQLString-optimizations.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants