Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50031][SQL] Add the TryParseUrl expression #48500

Closed
wants to merge 12 commits into from

Conversation

jovanm-db
Copy link
Contributor

@jovanm-db jovanm-db commented Oct 16, 2024

What changes were proposed in this pull request?

This PR adds try_parse_url expression that sets failOnError to false by default.

Why are the changes needed?

INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of try_* functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient.

In this case, try expressions were missing so new expressions were added to patch up the missing implementations.

Does this PR introduce any user-facing change?

Yes, new expression added.

How was this patch tested?

Tests added.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Jovan, please address the comments I left and we can move this PR out of draft phase.

Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one more nit comment

Co-authored-by: Mihailo Milosevic <[email protected]>
@jovanm-db jovanm-db changed the title [WIP][SPARK-49638][SQL] Improve error messages for ParseUrl #49638 [SPARK-49638][SQL] Improve error messages for ParseUrl Oct 16, 2024
Comment on lines 179 to 184
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST');
spark.apache.org
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY');
query=1
> SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query');
1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show an example of the NULL result?

@@ -12912,6 +12912,108 @@ def substr(
return _invoke_function_over_columns("substr", str, pos)


@_try_remote_functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a one unittest in, e.g., test_functions.py? then the tests will be resued in both Spark connect and classic.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jovanm-db .

IIUC, the PR title is misleading because the main contribution will be introducing a new built-in function try_parse_url at Apache Spark 4.0.0. Improve error messages for ParseURL looks like a side-effect to me.

The PR (and the commit) had better have one theme. I'd like to recommend to remove the change of common/utils/src/main/resources/error/error-conditions.json file from this PR and you can make another PR to handle it independently.

@mihailom-db
Copy link
Contributor

@dongjoon-hyun I do agree that the title is not reflecting the right change, but opening a new PR might be an overkill and also cause confusion in development later. The main point of this PR is what ticket suggests Remove the ANSI config suggestion in INVALID_URL. I would suggest renaming the title, but keeping the changes in one PR, as these changes suggest the reason for removing and also the solution/substitute suggestion we provide for spark users. Separating them would make developer work much harder, and the second PR you suggested would be mostly tests changes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 17, 2024

@dongjoon-hyun I do agree that the title is not reflecting the right change, but opening a new PR might be an overkill and also cause confusion in development later. The main point of this PR is what ticket suggests Remove the ANSI config suggestion in INVALID_URL. I would suggest renaming the title, but keeping the changes in one PR, as these changes suggest the reason for removing and also the solution/substitute suggestion we provide for spark users. Separating them would make developer work much harder, and the second PR you suggested would be mostly tests changes.

No, I disagree with your opinion, "opening a new PR might be an overkill and also cause confusion in development later", @jovanm-db . Technically, adding try_parse_url and recommending try_parse_url is not the same at all. They should have two JIRA IDs. :)

@jovanm-db jovanm-db changed the title [SPARK-49638][SQL] Improve error messages for ParseUrl [SPARK-50031][SQL] Add TryParseAlternative Oct 18, 2024
@jovanm-db jovanm-db changed the title [SPARK-50031][SQL] Add TryParseAlternative [SPARK-50031][SQL] Add TryParseUrl alternative Oct 18, 2024
Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address the comments so we can merge this in

url: "ColumnOrName", partToExtract: "ColumnOrName", key: Optional["ColumnOrName"] = None
) -> Column:
"""
URL function: Extracts a specified part from a URL. If a key is provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this description to give the similar message as other try functions?

@@ -333,6 +333,14 @@ def test_rand_functions(self):
rndn2 = df.select("key", F.randn(0)).collect()
self.assertEqual(sorted(rndn1), sorted(rndn2))

def test_try_parse_url(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a case here that returns null? This is to make sure python works as well with invalid urls.

@@ -169,6 +169,37 @@ object ParseUrl {
private val REGEXSUBFIX = "=([^&]*)"
}

/**
* Extracts a part from a URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is useless. Let's remove it.

* Extracts a part from a URL
*/
@ExpressionDescription(
usage = "_FUNC_(url, partToExtract[, key]) - Extracts a part from a URL.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseUrl has the same description. Could write a few words what destiguish new expression from ParseUrl.

@MaxGekk MaxGekk changed the title [SPARK-50031][SQL] Add TryParseUrl alternative [SPARK-50031][SQL] Add the TryParseUrl expression Oct 22, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2024

@jovanm-db Please, update PR's description according the recent changes in the PR.

This PR improves messages for ANSI related issues for ParseUrl.
new suggestions are added to do try_parse_url.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2024

+1, LGTM. Merging to master.
Thank you, @jovanm-db and @dongjoon-hyun @HyukjinKwon @srielau @mihailom-db for review.

@MaxGekk MaxGekk closed this in abc4986 Oct 22, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2024

@jovanm-db Congratulations with your first contribution to Apache Spark!

@dongjoon-hyun
Copy link
Member

Thank you, @jovanm-db and @MaxGekk and all.

Congratulations to @jovanm-db .

ericm-db pushed a commit to ericm-db/spark that referenced this pull request Oct 22, 2024
### What changes were proposed in this pull request?
This PR adds `try_parse_url` expression that sets `failOnError` to false by default.

### Why are the changes needed?
INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of `try_*` functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient.

In this case, try expressions were missing so new expressions were added to patch up the missing implementations.

### Does this PR introduce _any_ user-facing change?
Yes, new expression added.

### How was this patch tested?
Tests added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48500 from jovanm-db/invalidUrl.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
MaxGekk pushed a commit that referenced this pull request Oct 23, 2024
### What changes were proposed in this pull request?
This PR updates the suggested fix of `INVALID_URL` error to use `try_parse_url` function added in [this](#48500) PR instead of turning off ANSI mode.

### Why are the changes needed?
INVALID_URL contains suggested fix for turning off ANSI mode. Now that in Spark 4.0.0 we have moved to ANSI mode on by default, we want to keep suggestions of this kind to the minimum. There exist implementations of try_* functions which provide safe way to get behavior as for ANSI mode off and suggestions of this kind should be sufficient.

In this case, try expressions were missing so new expressions were added to patch up the missing implementations.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
There are tests that check error messages.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48616 from jovanm-db/improvedError.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
MaxGekk pushed a commit that referenced this pull request Oct 25, 2024
…nnect module

### What changes were proposed in this pull request?
Add tests in `PlanGenerationTestSuite` for `try_parse_url` function introduced in [this](#48500) PR and generate the corresponding golden files.

### Why are the changes needed?
The added tests ensure the new function behaves as expected.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tests added.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #48637 from jovanm-db/goldenFiles.

Authored-by: Jovan Markovic <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants