-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062
[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString #47062
Conversation
This needs another self-review to make sure that I haven't made any off-by-one errors and to assess whether we need to add more unit tests to better specify the existing |
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkParserUtils.scala
Show resolved
Hide resolved
Should we convert the |
I considered this, but I don't think it's necessarily an obvious win: I think that the JIT should do a pretty good job of handling this character-by-character iteration pattern (it probably inlines the call and branch predicts it nicely). Given the overall size of the speedups from this PR so far, I'd prefer to just leave it as-is (I'd need higher resolution benchmarks to measure the charArray difference, I think). |
OK ~ Let's leave it as-is ~ |
Merged to master. |
### What changes were proposed in this pull request? This PR implements multiple performance optimizations for `ParserUtils.unescapeSQLString`: 1. Don't use regex: following apache#31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused. 2. Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping. 3. Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch). ### Why are the changes needed? unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code: ![image](https://github.com/apache/spark/assets/50748/e045d9da-da0f-493c-a634-188acaeab1a9) The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that _do_ need escaping, I tested extreme cases where _every_ character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Correctness is covered by existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47062 from JoshRosen/unescapeSQLString-optimizations. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This PR implements multiple performance optimizations for
ParserUtils.unescapeSQLString
:Why are the changes needed?
unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code:
The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that do need escaping, I tested extreme cases where every character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Correctness is covered by existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.