Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ui) Add upgrade step to enable CLL impact analysis for existing data #6427

Conversation

chriscollins3456
Copy link
Collaborator

Adds a new upgrade step for restoring indices for the upstreamLineage aspects as well as the inputFields aspect. The upstreamLineage aspect has fineGrainedLineage that needs to be reindexed with our new logic for impact analysis. Same this with inputFields. After this, users will have their graph index set up properly to do impact analysis with all columns.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions
Copy link

github-actions bot commented Nov 14, 2022

Unit Test Results (build & test)

616 tests  +3   612 ✔️ +3   15m 43s ⏱️ + 3m 55s
152 suites +1       4 💤 ±0 
152 files   +1       0 ±0 

Results for commit 21434f1. ± Comparison against base commit 7ea9797.

♻️ This comment has been updated with latest results.


private int getAndRestoreUpstreamLineageIndices(int start, AuditStamp auditStamp)
throws Exception {
final AspectSpec upstreamLineageAspectSpec = _entityRegistry.getEntitySpec(Constants.DATASET_ENTITY_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this AspectSpec? Is there any other method on entityService to produce MCL that doesn't require the AspectSpec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm all the methods I'm seeing from entityService that produce an MCL use an aspect spec. is there something specific you had in mind/know about?

final AspectSpec upstreamLineageAspectSpec = _entityRegistry.getEntitySpec(Constants.DATASET_ENTITY_NAME)
.getAspectSpec(Constants.UPSTREAM_LINEAGE_ASPECT_NAME);
SearchResult datasetResult =
_entitySearchService.search(Constants.DATASET_ENTITY_NAME, "", null, null, start, BATCH_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a single entity type, this will only work up to 10k entities. After 10k, elastic doesn't allow us to paginate by default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something to note. If you want fuller ability to paginate you can try to use the "scroll" method.

I'm also wondering if there's a way to do this that doesn't require that we go to elasticsearch. There is a listLatestAspects method on the EntityService class that would allow you get the latest of a specific aspect for every URN. That may help here...

Also, is there any Field-Level Lineage in DataJob -> Dataset that we need to worry about?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(listLatestAspects will hit MySQL in a slightly more performant query)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's used in the browse path upgrade step as well)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! moved to using listLatestAspects - gonna be faster and more efficient overall


return entityResult.getNumEntities();
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Consider adding @nullable

public RestoreColumnLineageIndices(EntityService entityService, EntitySearchService entitySearchService,
EntityRegistry entityRegistry) {
super(entityService, VERSION, UPGRADE_ID);
_entitySearchService = entitySearchService;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor -> consider adding

_entitySearchService = Objects.requireNonNull(entitySearchService, 'entitySearchService must not be null'

Basically this allows us to fail faster (in the constructor at init time) as opposed to waiting for an NPE to arise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add @nonnull and final annotations on these parameters!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally

final Urn upgradeEntityUrn = UrnUtils.getUrn(COLUMN_LINEAGE_UPGRADE_URN);
com.linkedin.upgrade.DataHubUpgradeRequest upgradeRequest = new com.linkedin.upgrade.DataHubUpgradeRequest().setVersion(version);
Map<String, EnvelopedAspect> upgradeRequestAspects = new HashMap<>();
upgradeRequestAspects.put(Constants.DATA_HUB_UPGRADE_REQUEST_ASPECT_NAME, new EnvelopedAspect().setValue(new Aspect(upgradeRequest.data())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thorough! Nice!!

public RestoreColumnLineageIndices(EntityService entityService, EntitySearchService entitySearchService,
EntityRegistry entityRegistry) {
super(entityService, VERSION, UPGRADE_ID);
_entitySearchService = entitySearchService;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add @nonnull and final annotations on these parameters!

import java.util.HashSet;
import java.util.Map;

public class RestoreColumnLineageIndicesTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

Copy link
Contributor

@aditya-radhakrishnan aditya-radhakrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@aditya-radhakrishnan aditya-radhakrishnan added the product PR or Issue related to the DataHub UI/UX label Nov 15, 2022
@chriscollins3456 chriscollins3456 merged commit e8aaf2a into datahub-project:master Nov 15, 2022
cccs-Dustin pushed a commit to CybercentreCanada/datahub that referenced this pull request Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants