Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): add removing partition pattern in spark lineage #6605

Merged
merged 11 commits into from
Jan 24, 2023

Conversation

ssilb4
Copy link
Contributor

@ssilb4 ssilb4 commented Dec 2, 2022

in aws emr, when I use spark lineage, input table is s3 location. but It include partition location. so It doesn't match other ingestion (like glue ingestion)
Remove partition pattern. (e.g. /partition=\d+) It change database/table/partition=123 to database/table

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

in aws emr, when I use spark lineage, input table is s3 location. but It include partition location. so It doesn't match other ingestion (like glue ingestion)
@ssilb4 ssilb4 changed the title Spark lineage remove partition feat(ingest): add removing partition pattern in spark lineage Dec 2, 2022
@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 6, 2022
@laulpogan laulpogan added the ingestion PR or Issue related to the ingestion of metadata label Dec 6, 2022
@laulpogan laulpogan requested a review from treff7es December 6, 2022 19:22
@github-actions
Copy link

github-actions bot commented Dec 7, 2022

Unit Test Results (build & test)

621 tests  ±0   617 ✔️ ±0   15m 55s ⏱️ +18s
157 suites ±0       4 💤 ±0 
157 files   ±0       0 ±0 

Results for commit f07d2d9. ± Comparison against base commit df96e89.

| spark.datahub.metadata.dataset.env | | PROD | [Supported values](https://datahubproject.io/docs/graphql/enums#fabrictype). In all other cases, will fallback to PROD |
| spark.datahub.metadata.table.hive_platform_alias | | hive | By default, datahub assigns Hive-like tables to the Hive platform. If you are using Glue as your Hive metastore, set this config flag to `glue` |
| spark.datahub.metadata.include_scheme | | true | Include scheme from the path URI (e.g. hdfs://, s3://) in the dataset URN. We recommend setting this value to false, it is set to true for backwards compatibility with previous versions |
| spark.datahub.metadata.remove_partition_pattern | | | Remove partition pattern. (e.g. /partition=\d+) It change database/table/partition=123 to database/table |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the default value! What is the default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what are sample values?

Copy link
Contributor Author

@ssilb4 ssilb4 Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it need to have default value. If it is null, It doesn't change any value.
I thought description is sample value, should I do more? (value: /partition=\d+)


private static String getRemovePartitionPattern(Config datahubConfig) {
return datahubConfig.hasPath(REMOVE_PARTITION_PATTERN) ? datahubConfig.getString(REMOVE_PARTITION_PATTERN)
: "";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Why not return null? Why empty string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed

} else {
return uri.getHost() + uri.getPath();
String uriPath = includeScheme ? uri.toString() : uri.getHost() + uri.getPath();
if (!removePartitionPattern.equals("")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we returned null in previous version we can check whether its not equal to null instead of empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once changes are requested, we will re-review

empty string to null
because default value is null
Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Waiting on final review from @treff7es

@jjoyce0510
Copy link
Collaborator

Looks like build is failing. Going to update to the latest..

@ssilb4
Copy link
Contributor Author

ssilb4 commented Jan 11, 2023

hmm.. should I fix more? I don't understand why smoke test failed.

Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@treff7es treff7es merged commit a6a597c into datahub-project:master Jan 24, 2023
ericyomi pushed a commit to ericyomi/datahub that referenced this pull request Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants