-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48551][SQL] Perf improvement for escapePathName #46894
Conversation
cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any unit tests for escapePathName
? Or only end-to-end tests? I wonder whether it might be worthwhile to add some basic unit test cases in a new ExternalCatalogUtilsSuite
.
} | ||
|
||
builder.toString() | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably apply similar perf. optimizations to the unescapePathName
on the next line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will address this in a separate PR if you don't mind
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Outdated
Show resolved
Hide resolved
spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala Line 1803 in f9542d0
|
Although it's true that those tests indirectly cover these paths, by "unit test" I was referring to a dedicated test of just In addition, the more "end-to-end" tests are slow to run. I think it can still be useful to have relatively more end-to-end tests covering important scenarios (since special characters in paths could cause problems at other layers and we might want more end-to-end coverage for such potential cases), but I think we should augment those tests with dedicated unit tests for the helper functions to assert that their actual output doesn't change. This will be useful if we choose to optimize the int -> hex conversion (which I'll suggest in another comment). Edit: for example: import org.apache.spark.SparkFunSuite
class ExternalCatalogUtilsSuite extends SparkFunSuite {
test("escapePathName") {
import ExternalCatalogUtils.escapePathName
assert(escapePathName("a b") === "a b")
assert(escapePathName("a:b") === "a%3Ab")
assert(escapePathName("a%b") === "a%25b")
assert(escapePathName("a,b") === "a,b")
assert(escapePathName("a/b") === "a%2Fb")
for (c <- ExternalCatalogUtils.charToEscape.stream().toArray) {
// Check parity with old conversion technique:
assert(
escapePathName(c.toChar.toString) === "%" + f"${c}%02X",
s"wrong escaping for ${c}")
}
}
} |
I add some further optimization now the result become
I will add the uts you suggested |
if (needsEscaping(c)) { | ||
builder.append('%') | ||
builder.append(f"${c.asInstanceOf[Int]}%02X") | ||
sb.append('%').append(HEX_CHARS((c & 0xF0) >> 4)).append(HEX_CHARS(c & 0x0F)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Outdated
Show resolved
Hide resolved
…log/ExternalCatalogUtils.scala Co-authored-by: Josh Rosen <[email protected]>
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Outdated
Show resolved
Hide resolved
…log/ExternalCatalogUtils.scala Co-authored-by: Josh Rosen <[email protected]>
This optimized version looks good to me overall. Could you check in the new unit test suite and update the PR description to note the improved test coverage? Let's give this a bit to see if anyone else has feedback. |
sql/core/src/test/scala/org/apache/spark/sql/EscapePathBenchmark.scala
Outdated
Show resolved
Hide resolved
Gentle ping @dongjoon-hyun @cloud-fan, can you take a look at this improvement when you have some time? Thanks in advance. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM excepts some comments
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
Show resolved
Hide resolved
...atalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtilsSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/EscapePathBenchmark.scala
Show resolved
Hide resolved
Merged into master for Spark 4.0. Thanks @yaooqinn and @JoshRosen |
What changes were proposed in this pull request?
This PR improves perf for escapePathName with algorithms briefly described as:
String.format
Add a fast path for storage paths or their parts that do not require escaping to avoid creating a StringBuilder to append per character.
Why are the changes needed?
performance improvement for hotspots
Does this PR introduce any user-facing change?
no
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
no