Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] fix removePathPattern behaviour #2350

Merged
merged 1 commit into from
Jan 8, 2024

Conversation

pawel-big-lebowski
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski commented Jan 4, 2024

Problem

removepath pattern feature is not applied all the time. The method is called when constructing DatasetIdentifier through PathUtils which is not the case all the time.

Closes: #2335

Solution

  • Move removePattern to another place in codebase which is always run,
  • Write integration test for this.

Note: All schema changes require discussion. Please link the issue for context.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

One-line summary:

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • Your comment includes a one-liner for the changelog about the specific purpose of the change (if necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)
  • You've added a header to source files (if relevant)

SPDX-License-Identifier: Apache-2.0
Copyright 2018-2023 contributors to the OpenLineage project

@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/fix-remove-path-pattern branch 3 times, most recently from 9c6311b to ca23c7a Compare January 5, 2024 08:07
@boring-cyborg boring-cyborg bot added the area:documentation Improvements or additions to documentation label Jan 5, 2024
@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review January 5, 2024 08:11
@pawel-big-lebowski pawel-big-lebowski force-pushed the spark/fix-remove-path-pattern branch from ca23c7a to 53bfa0e Compare January 5, 2024 08:14
Copy link
Member

@mobuchowski mobuchowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment but lgtm.

import lombok.Value;

@Value
public class DatasetIdentifier {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got moved to java client some time ago

public static final String SPARK_OPENLINEAGE_DATASET_REMOVE_PATH_PATTERN =
"spark.openlineage.dataset.removePath.pattern";

public static List<OutputDataset> removeOutputsPathPattern(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe List<? extends Dataset> and implement this generically with passed dataset builder? Nit, do what you think is best.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried this approach but did not succeed.
Creating OutputDataset and InputDataset differs significantly and these have to be two separate methods.
The iterating part, that goes through the list, is common.
So a common generic method would need to call methods per OutputDataset or InputDataset, which didn't look as an improvement to me.

@pawel-big-lebowski pawel-big-lebowski merged commit 108b3c0 into main Jan 8, 2024
22 checks passed
@pawel-big-lebowski pawel-big-lebowski deleted the spark/fix-remove-path-pattern branch January 8, 2024 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:documentation Improvements or additions to documentation area:integration/spark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apache Spark Openlineage: removepath pattern doesn't work on input datasets
2 participants