[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

JoshRosen · 2024-06-21T23:03:36Z

What changes were proposed in this pull request?

This PR implements multiple performance optimizations for ParserUtils.unescapeSQLString:

Don't use regex: following [SPARK-34263][SQL] Simplify the code for treating unicode/octal/escaped characters in string literals #31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused.
Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping.
Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch).

Why are the changes needed?

unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code:

The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that do need escaping, I tested extreme cases where every character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Correctness is covered by existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

JoshRosen · 2024-06-21T23:04:10Z

This needs another self-review to make sure that I haven't made any off-by-one errors and to assess whether we need to add more unit tests to better specify the existing unescapeSQLString behavior, but I'm opening it early to solicit feedback.

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala

LuciferYang · 2024-06-24T06:53:00Z

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

JoshRosen · 2024-06-25T00:15:56Z

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

I considered this, but I don't think it's necessarily an obvious win: I think that the JIT should do a pretty good job of handling this character-by-character iteration pattern (it probably inlines the call and branch predicts it nicely). Given the overall size of the speedups from this PR so far, I'd prefer to just leave it as-is (I'd need higher resolution benchmarks to measure the charArray difference, I think).

LuciferYang · 2024-06-25T04:15:12Z

Should we convert the String to a char[] within the method, so that we can use charArray(i) instead of String.charAt(i) to reduce the number of calls to String.isLatin1() to just once?

I considered this, but I don't think it's necessarily an obvious win: I think that the JIT should do a pretty good job of handling this character-by-character iteration pattern (it probably inlines the call and branch predicts it nicely). Given the overall size of the speedups from this PR so far, I'd prefer to just leave it as-is (I'd need higher resolution benchmarks to measure the charArray difference, I think).

OK ~ Let's leave it as-is ~

HyukjinKwon · 2024-06-25T09:12:57Z

Merged to master.

### What changes were proposed in this pull request? This PR implements multiple performance optimizations for `ParserUtils.unescapeSQLString`: 1. Don't use regex: following apache#31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused. 2. Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping. 3. Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch). ### Why are the changes needed? unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code: ![image](https://github.com/apache/spark/assets/50748/e045d9da-da0f-493c-a634-188acaeab1a9) The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that _do_ need escaping, I tested extreme cases where _every_ character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Correctness is covered by existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47062 from JoshRosen/unescapeSQLString-optimizations. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

JoshRosen added 5 commits June 21, 2024 13:48

New test case.

67639e3

Optimized implementation with no regex usage.

bfca224

Optimization: avoid startsWith.

9c5bcc3

Use Java StringBuilder instead of Scala version.

c895309

Add fast-path case.

d8b3f4c

github-actions bot added the SQL label Jun 21, 2024

HyukjinKwon approved these changes Jun 23, 2024

View reviewed changes

LuciferYang reviewed Jun 24, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala Show resolved Hide resolved

Add test capturing old impl's behavior for invalid unicode escapes.

972cf70

Fix off-by-one error in allCharsAreHex

07f0ed7

LuciferYang approved these changes Jun 25, 2024

View reviewed changes

HyukjinKwon closed this in 51f1103 Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

JoshRosen commented Jun 21, 2024 •

edited

Loading

JoshRosen commented Jun 21, 2024 •

edited

Loading

LuciferYang commented Jun 24, 2024 •

edited

Loading

JoshRosen commented Jun 25, 2024

LuciferYang commented Jun 25, 2024

HyukjinKwon commented Jun 25, 2024

[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062

Conversation

JoshRosen commented Jun 21, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

JoshRosen commented Jun 21, 2024 • edited Loading

LuciferYang commented Jun 24, 2024 • edited Loading

JoshRosen commented Jun 25, 2024

LuciferYang commented Jun 25, 2024

HyukjinKwon commented Jun 25, 2024

JoshRosen commented Jun 21, 2024 •

edited

Loading

JoshRosen commented Jun 21, 2024 •

edited

Loading

LuciferYang commented Jun 24, 2024 •

edited

Loading