Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48551][SQL] Perf improvement for escapePathName #46894

Closed
wants to merge 12 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jun 6, 2024

What changes were proposed in this pull request?

This PR improves perf for escapePathName with algorithms briefly described as:

  • If a path contains no special characters, we return the original identity instead of creating a new StringBuilder to append char by char
  • If a path contains special characters, we relocate the IDX of the first special character. Then initialize the StringBuilder with [0, IDX) of the original string, and do heximal padding if necessary starting from IDX.
  • An optimized char-to-hex function replaces the String.format

Add a fast path for storage paths or their parts that do not require escaping to avoid creating a StringBuilder to append per character.

Why are the changes needed?

performance improvement for hotspots

Does this PR introduce any user-facing change?

no

How was this patch tested?

  • new tests in ExternalCatalogUtilsSuite
  • Benchmark results (9x faster)

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 6, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 6, 2024

cc @cloud-fan

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any unit tests for escapePathName? Or only end-to-end tests? I wonder whether it might be worthwhile to add some basic unit test cases in a new ExternalCatalogUtilsSuite.

}

builder.toString()
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably apply similar perf. optimizations to the unescapePathName on the next line below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address this in a separate PR if you don't mind

@yaooqinn yaooqinn changed the title [SPARK-48548][SQL] Perf improvement for escapePathName [SPARK-48551][SQL] Perf improvement for escapePathName Jun 7, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 7, 2024

Do we have any unit tests for escapePathName?

Seq("a b", "a:b", "a%b", "a,b").foreach { specialChars =>

@JoshRosen
Copy link
Contributor

JoshRosen commented Jun 7, 2024

Do we have any unit tests for escapePathName?

Seq("a b", "a:b", "a%b", "a,b").foreach { specialChars =>

Although it's true that those tests indirectly cover these paths, by "unit test" I was referring to a dedicated test of just escapePathString itself which actually asserts over the form of the output. As it stands now, the test in DDLSuite is asserting over some end-to-end functionality to check that the path escaping logic is actually invoked, but that test would not fail if we made breaking or incompatible changes to the escaping logic itself. For example, if we switched from uppercase to lowercase hexadecimal conversions then that change wouldn't cause the DDLSuite test to fail.

In addition, the more "end-to-end" tests are slow to run.

I think it can still be useful to have relatively more end-to-end tests covering important scenarios (since special characters in paths could cause problems at other layers and we might want more end-to-end coverage for such potential cases), but I think we should augment those tests with dedicated unit tests for the helper functions to assert that their actual output doesn't change.

This will be useful if we choose to optimize the int -> hex conversion (which I'll suggest in another comment).

Edit: for example:

import org.apache.spark.SparkFunSuite

class ExternalCatalogUtilsSuite extends SparkFunSuite {
  test("escapePathName") {
    import ExternalCatalogUtils.escapePathName
    assert(escapePathName("a b") === "a b")
    assert(escapePathName("a:b") === "a%3Ab")
    assert(escapePathName("a%b") === "a%25b")
    assert(escapePathName("a,b") === "a,b")
    assert(escapePathName("a/b") === "a%2Fb")


    for (c <- ExternalCatalogUtils.charToEscape.stream().toArray) {
      // Check parity with old conversion technique:
      assert(
        escapePathName(c.toChar.toString) === "%" + f"${c}%02X",
        s"wrong escaping for ${c}")
    }
  }
}

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 7, 2024

I add some further optimization now the result become

[info] OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
[info] Apple M2 Max
[info] Escape Tests:                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Legacy                                             5557           5613          36          0.2        5557.3       1.0X
[info] Kent                                                693            712          36          1.4         692.8       8.0X
[info] Josh                                                878            889          10          1.1         878.4       6.3X

I will add the uts you suggested

if (needsEscaping(c)) {
builder.append('%')
builder.append(f"${c.asInstanceOf[Int]}%02X")
sb.append('%').append(HEX_CHARS((c & 0xF0) >> 4)).append(HEX_CHARS(c & 0x0F))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for performing the hex conversion this way, since the old string formatting was really really expensive:

image

…log/ExternalCatalogUtils.scala

Co-authored-by: Josh Rosen <[email protected]>
…log/ExternalCatalogUtils.scala

Co-authored-by: Josh Rosen <[email protected]>
@JoshRosen
Copy link
Contributor

JoshRosen commented Jun 7, 2024

This optimized version looks good to me overall. Could you check in the new unit test suite and update the PR description to note the improved test coverage?

Let's give this a bit to see if anyone else has feedback.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jun 7, 2024

Gentle ping @dongjoon-hyun @cloud-fan, can you take a look at this improvement when you have some time? Thanks in advance.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM excepts some comments

@LuciferYang
Copy link
Contributor

Merged into master for Spark 4.0. Thanks @yaooqinn and @JoshRosen

@yaooqinn yaooqinn deleted the SPARK-48551 branch June 11, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants