-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50031][SQL] Add the TryParseUrl
expression
#48500
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Jovan, please address the comments I left and we can move this PR out of draft phase.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one more nit comment
Co-authored-by: Mihailo Milosevic <[email protected]>
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST'); | ||
spark.apache.org | ||
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY'); | ||
query=1 | ||
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query'); | ||
1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show an example of the NULL result?
@@ -12912,6 +12912,108 @@ def substr( | |||
return _invoke_function_over_columns("substr", str, pos) | |||
|
|||
|
|||
@_try_remote_functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a one unittest in, e.g., test_functions.py? then the tests will be resued in both Spark connect and classic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jovanm-db .
IIUC, the PR title is misleading because the main contribution will be introducing a new built-in function try_parse_url
at Apache Spark 4.0.0. Improve error messages for ParseURL
looks like a side-effect to me.
The PR (and the commit) had better have one theme. I'd like to recommend to remove the change of common/utils/src/main/resources/error/error-conditions.json
file from this PR and you can make another PR to handle it independently.
@dongjoon-hyun I do agree that the title is not reflecting the right change, but opening a new PR might be an overkill and also cause confusion in development later. The main point of this PR is what ticket suggests |
No, I disagree with your opinion, "opening a new PR might be an overkill and also cause confusion in development later", @jovanm-db . Technically, |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/urlExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please address the comments so we can merge this in
url: "ColumnOrName", partToExtract: "ColumnOrName", key: Optional["ColumnOrName"] = None | ||
) -> Column: | ||
""" | ||
URL function: Extracts a specified part from a URL. If a key is provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update this description to give the similar message as other try functions?
@@ -333,6 +333,14 @@ def test_rand_functions(self): | |||
rndn2 = df.select("key", F.randn(0)).collect() | |||
self.assertEqual(sorted(rndn1), sorted(rndn2)) | |||
|
|||
def test_try_parse_url(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add a case here that returns null? This is to make sure python works as well with invalid urls.
@@ -169,6 +169,37 @@ object ParseUrl { | |||
private val REGEXSUBFIX = "=([^&]*)" | |||
} | |||
|
|||
/** | |||
* Extracts a part from a URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is useless. Let's remove it.
* Extracts a part from a URL | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(url, partToExtract[, key]) - Extracts a part from a URL.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseUrl
has the same description. Could write a few words what destiguish new expression from ParseUrl
.
TryParseUrl
expression
@jovanm-db Please, update PR's description according the recent changes in the PR.
|
+1, LGTM. Merging to master. |
@jovanm-db Congratulations with your first contribution to Apache Spark! |
Thank you, @jovanm-db and @MaxGekk and all. Congratulations to @jovanm-db . |
### What changes were proposed in this pull request? This PR adds `try_parse_url` expression that sets `failOnError` to false by default. ### Why are the changes needed? INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of `try_*` functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient. In this case, try expressions were missing so new expressions were added to patch up the missing implementations. ### Does this PR introduce _any_ user-facing change? Yes, new expression added. ### How was this patch tested? Tests added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48500 from jovanm-db/invalidUrl. Authored-by: Jovan Markovic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This PR updates the suggested fix of `INVALID_URL` error to use `try_parse_url` function added in [this](#48500) PR instead of turning off ANSI mode. ### Why are the changes needed? INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of try_* functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient. In this case, try expressions were missing so new expressions were added to patch up the missing implementations. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? There are tests that check error messages. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48616 from jovanm-db/improvedError. Authored-by: Jovan Markovic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…nnect module ### What changes were proposed in this pull request? Add tests in `PlanGenerationTestSuite` for `try_parse_url` function introduced in [this](#48500) PR and generate the corresponding golden files. ### Why are the changes needed? The added tests ensure the new function behaves as expected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48637 from jovanm-db/goldenFiles. Authored-by: Jovan Markovic <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
This PR adds
try_parse_url
expression that setsfailOnError
to false by default.Why are the changes needed?
INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of
try_*
functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient.In this case, try expressions were missing so new expressions were added to patch up the missing implementations.
Does this PR introduce any user-facing change?
Yes, new expression added.
How was this patch tested?
Tests added.
Was this patch authored or co-authored using generative AI tooling?
No.