Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): include instance in container dataPlatform when provided #6083

Merged
merged 10 commits into from
Oct 13, 2022

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Sep 28, 2022

Supersedes #4665.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 28, 2022
@github-actions
Copy link

github-actions bot commented Sep 28, 2022

Unit Test Results (metadata ingestion)

       8 files  ±0         8 suites  ±0   54m 20s ⏱️ - 4m 6s
   737 tests +1     735 ✔️ +1  2 💤 ±0  0 ±0 
1 476 runs  +2  1 472 ✔️ +2  4 💤 ±0  0 ±0 

Results for commit 94f5a20. ± Comparison against base commit 6e34cd6.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Sep 28, 2022

Unit Test Results (build & test)

597 tests  ±0   593 ✔️ ±0   11m 55s ⏱️ +7s
147 suites ±0       4 💤 ±0 
147 files   ±0       0 ±0 

Results for commit 94f5a20. ± Comparison against base commit 6e34cd6.

♻️ This comment has been updated with latest results.

@hsheth2 hsheth2 marked this pull request as ready for review September 29, 2022 00:14
@sgomezvillamor
Copy link
Contributor

When I raised this topic:

Being Containers a hierarchy on top of Datasets we feel like natural that Containers are aligned with their children Datasets for the platform_instance and the environment, so they can be filtered in the search too. No need to mention the chance of collision if two Containers have the same name for different platform_instance and/or environment.

This PR addresses the missing platform instance. What's your thought about the environment? It's likely that the eg customers database will exist in eg both DEV and PROD, how can we differentiate them?

Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsheth2
Copy link
Collaborator Author

hsheth2 commented Sep 29, 2022

@sgomezvillamor to distinguish in that case, I'd recommend setting platform instance to prod.customers and dev.customers - it's not ideal since some of the table browse paths might have a bit of redundancy.

Unfortunately we can't add an env field to the container for backwards compat reasons.

@sgomezvillamor
Copy link
Contributor

sgomezvillamor commented Sep 30, 2022

From my point of view that would be even worse because filters would become totally unaligned.
In order to find both containers and datasets it would require to search for:

  • containers with platform instance eg “prod.customers”
  • and datasets with platform instance “customers” and env “prod”

Quite complex IMO and hard to understand for the users.
Filters should be transversal for datasets and containers.

What about defining “env” as optional for containers so it keeps backward compatibility? Ingestion recipes may have a flag to enable “env” field in the containers, otherwise it remains as null and backwards compatible. Alternatively it could be a datahub deployment setting. Just anyway to opt-in and enable the feature.

@hsheth2
Copy link
Collaborator Author

hsheth2 commented Sep 30, 2022

Agree that it's not great. I chatted with the team internally and wanted to explain some of the thinking here.

Our long-term intention is that env is no longer is part the urns, but instead the platform instance is how you identify it. In the current model, "a dataset has env in its urn, and a dataset can be part of a platform instance". We want to move towards a model where "a dataset/container is in a platform instance and hence part of the urn, and the env is just an attribute of the platform instance". Basically, urns should only contain things about identity, and env isn't really identity but platform instance is.

Towards this goal, our plan is to add an env attribute to platform instances soon. Basically, you'd name your platform instance "dev_customers" or whatever, and also set env=DEV. The UI would enable search / filtering on this attribute.

In the short term I realize that it's not a great user experience. A minor correction though: because we currently include the platform instance name as part of the dataset urn, searching for "prod_customers" should surface both containers and datasets. The piece that won't work is that setting the env "prod" filter won't show any containers.

@sgomezvillamor
Copy link
Contributor

Thanks for sharing the long-term plans. The reasoning about dataset identity and so moving env to the platform instance makes sense.

It's unfortunate that we break the alignment during the transition and the cost is the user experience, as you said. The use case my organization is facing now is about glue databases and tables and it's weird not having them aligned and supporting the filters in a homogeneous way.

In order to mitigate that cost, what about adding env to the platform instance as part of the work done in this PR? Wit that,

  • container would support env via platform instance
  • and datasets would support env via both identity and platform instance

With this approach, the env is duplicated in the dataset entity and so the cost is to ensure the consistency and likely more complex graphql queries. So the cost impacts developer experience, not user experience. And also, this would mean moving a step closer to the long-term vision.

Coincidentally, I also have another PR for extending the features of the Platform Instance entity. #5728 Opportunistically sharing it here 🙄 We may easily add Factory (dev, pre, pro, ...) too 😄

@hsheth2
Copy link
Collaborator Author

hsheth2 commented Oct 4, 2022

We could do that, but I worry about making this PR huge in scope. Adding an env attribute to platform instance and/or containers isn't quite enough, since we'd also need to add support for filtering using that on the backend and frontend. I'm not sure I have bandwidth for taking that whole thing on right now, but I'd be happy to hand this PR over if you're interested.

@hsheth2 hsheth2 merged commit 09616ee into datahub-project:master Oct 13, 2022
@hsheth2 hsheth2 deleted the container-platform-instance branch October 13, 2022 18:29
cccs-tom pushed a commit to CybercentreCanada/datahub that referenced this pull request Nov 18, 2022
cccs-tom pushed a commit to CybercentreCanada/datahub that referenced this pull request Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants