forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48686][SQL] Improve performance of ParserUtils.unescapeSQLString
### What changes were proposed in this pull request? This PR implements multiple performance optimizations for `ParserUtils.unescapeSQLString`: 1. Don't use regex: following apache#31362, the existing code uses regexes for parsing escaped character patterns. However, in the worst case (the expected common case of "no escaping needed") it will perform four regex match attempts per input character, resulting in significant garbage creation because the matchers aren't reused. 2. Skip the StringBuilder allocation for raw strings and for strings that don't need any unescaping. 3. Minor: use Java StringBuilder instead of the Scala version: this removes a layer of indirection and may benefit JIT (we've seen positive results in some scenarios from this type of switch). ### Why are the changes needed? unescapeSQLString showed up as a CPU and allocation hotspot in certain testing scenarios. See this flamegraph for an illustration of the relative costs of repeated regex matching in the old code: ![image](https://github.com/apache/spark/assets/50748/e045d9da-da0f-493c-a634-188acaeab1a9) The new code is almost arbitrarily faster (e.g. can show ~arbitrary relative speedups, depending on the choice of input) for strings that don't require unescaping. For strings that _do_ need escaping, I tested extreme cases where _every_ character needs escaping: in these cases I see ~10-20x speedups (depending on the type of escaping). The new code should be faster in every scenario. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Correctness is covered by existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47062 from JoshRosen/unescapeSQLString-optimizations. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
b49479b
commit 51f1103
Showing
2 changed files
with
75 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters