-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK] fix removePathPattern behaviour #2350
Conversation
9c6311b
to
ca23c7a
Compare
ca23c7a
to
53bfa0e
Compare
Signed-off-by: Pawel Leszczynski <[email protected]>
53bfa0e
to
0510019
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment but lgtm.
import lombok.Value; | ||
|
||
@Value | ||
public class DatasetIdentifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this got moved to java client some time ago
public static final String SPARK_OPENLINEAGE_DATASET_REMOVE_PATH_PATTERN = | ||
"spark.openlineage.dataset.removePath.pattern"; | ||
|
||
public static List<OutputDataset> removeOutputsPathPattern( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe List<? extends Dataset>
and implement this generically with passed dataset builder? Nit, do what you think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried this approach but did not succeed.
Creating OutputDataset
and InputDataset
differs significantly and these have to be two separate methods.
The iterating part, that goes through the list, is common.
So a common generic method would need to call methods per OutputDataset
or InputDataset
, which didn't look as an improvement to me.
Problem
removepath pattern feature is not applied all the time. The method is called when constructing
DatasetIdentifier
throughPathUtils
which is not the case all the time.Closes: #2335
Solution
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).One-line summary:
Checklist
SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project