-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48584][SQL] Perf improvement for unescapePathName #46938
Conversation
cc @dongjoon-hyun @cloud-fan @LuciferYang @JoshRosen thanks |
Could you provide some micro benchmark? |
What do you mean by some micro benchmark? Are sql/catalyst/benchmarks/EscapePathBenchmark-results.txt and sql/catalyst/benchmarks/EscapePathBenchmark-jdk21-results.txt not sufficient ? |
Oh, I see. |
It seems that there is a typo in the pr description:
chararaters -> characters |
Thank you @LuciferYang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Thanks @yaooqinn
} | ||
var plaintextEndIdx = path.indexOf('%') | ||
val length = path.length | ||
if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if (plaintextEndIdx == -1 || plaintextEndIdx + 2 > path.length || path.lastIndexOf('%') == 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| path.lastIndexOf('%')
I'm sorry, I'm having a little trouble understanding your suggestion. Can you please clarify?
Thank you @LuciferYang, I added a nit in the last commit and passed the test locally, so I didn't wait for the CI. Merged to master |
### What changes were proposed in this pull request? This PR follows up #46938 and improve the `unescapePathName`. ### Why are the changes needed? Improve the `unescapePathName` by cut off slow path. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? GA. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #46957 from beliefer/SPARK-48584_followup. Authored-by: beliefer <[email protected]> Signed-off-by: beliefer <[email protected]>
…ue (#8793) GlutenURLDecoder.java is copied from OpenJDK and it's under GPL v2 which belongs to Category X, we can't have it in Apache Releases. URLDecoder decode/encode is not fully compatible with the Hive catalog path escaping/unescaping, which Spark also follows. Besides, apache/spark#46938 has improved unescapePathName's speed at the Spark side by ~10x. So This PR also helps gluten gain perf which handles datasets w/ large partition numbers.
What changes were proposed in this pull request?
This PR improves perf for unescapePathName with algorithms briefly described as:
position > path.length-2
, we return the original identity instead of creating a new StringBuilder to append char by charplaintextStartIdx
which starts from 0 and then points to the next char after resolving%xx
, andplaintextEndIdx
which points to the next'%'
.plaintextStartIdx
moves toplaintextEndIdx + 3
if%xx
is valid, or moves toplaintextEndIdx + 1
if%xx
is invalid.Why are the changes needed?
performance improvement for hotspots
Does this PR introduce any user-facing change?
no
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
no