Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): extract powerbi endorsements to tags #6638

Merged
merged 13 commits into from
Jan 18, 2023

Conversation

looppi
Copy link
Contributor

@looppi looppi commented Dec 5, 2022

Extract endorsements from workspace scan result to datasets, reports or dashboards.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 5, 2022
@github-actions
Copy link

github-actions bot commented Dec 5, 2022

Unit Test Results (build & test)

621 tests   617 ✔️  16m 8s ⏱️
157 suites      4 💤
157 files        0

Results for commit e8d4780.

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

Unit Test Results (metadata ingestion)

       8 files         8 suites   59m 11s ⏱️
   767 tests    765 ✔️ 2 💤 0
1 536 runs  1 531 ✔️ 5 💤 0

Results for commit e8d4780.

@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 6, 2022
@looppi looppi changed the title DRAFT feat(ingest): extract powerbi endorsements to tags feat(ingest): extract powerbi endorsements to tags Jan 10, 2023
@@ -16,6 +16,8 @@ source:
client_secret: bar
# Enable / Disable ingestion of user information for dashboards
extract_ownership: true
# Enable / Disable ingestion of endorsements
extract_endorsements_to_tags: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep it true by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per John's comment, I'll keep this disabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@@ -633,57 +673,6 @@ def get_pages_by_report(
for raw_instance in response_dict["value"]
]

def get_reports(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to delete this method?

Copy link
Contributor Author

@looppi looppi Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't called anymore as the reports were parsed from the workspace scan. No other reason, basically the logic moved to the part where the scan result is handled.

The previous implementation was gathering the reports through the REST API, the current the API implementation doesn't include endorsementDetails in the payload, only the workspace scan has that data. That's the main reason why I ended up rewriting the logic.

Highly frustrating to have some data in the workspace scan and some data in the REST API.

@@ -422,6 +443,16 @@ def chart_custom_properties(dashboard: PowerBiAPI.Dashboard) -> dict:
if owner_mcp is not None:
list_of_mcps.append(owner_mcp)

if self.__config.extract_endorsements_to_tags and dashboard.tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we go for the append method? where append the MCP if flag is enabled to avoid checking at multiple places

Copy link
Contributor Author

@looppi looppi Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning something like the tags in the dataclasses are populated only when extract_endorsements_to_tags==True?

Then the code would look something like this, right?

if dashboard.tags:
    list_of_mcps.append(
        self.new_mcp(
            Constant.DASHBOARD,
            dashboard_urn,
            Constant.GLOBAL_TAGS,
            self.transform_tags(dashboard.tags)
            tags,
        )
    )

@@ -139,6 +140,10 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
extract_lineage: bool = pydantic.Field(
default=True, description="Whether lineage should be ingested"
)
# Enable/Disable extracting endorsements to tags
extract_endorsements_to_tags: bool = pydantic.Field(
default=True, description="Whether to extract endorsements to tags"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we default this to "False"? Most people may be surprised by this behavior, since this can overwrite existing Tags defined in DataHub

@@ -139,6 +140,10 @@ class PowerBiAPIConfig(EnvBasedSourceConfigBase):
extract_lineage: bool = pydantic.Field(
default=True, description="Whether lineage should be ingested"
)
# Enable/Disable extracting endorsements to tags
extract_endorsements_to_tags: bool = pydantic.Field(
default=True, description="Whether to extract endorsements to tags"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please warn the author that enabling this will potentially overwrite Tags defined inside the DataHub application

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also mention that there's an AddTagTransformer that supports semantic addition of tags (e.g. PATCH instead of REPLACE)

dashboards: List[PowerBiAPI.Dashboard] = []
dashboard_data = self.get_dashboard_data(workspace)

for scanned_dashboard in scan_result["dashboards"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract into a separate method. This is getting too long

for scanned_dashboard in scan_result["dashboards"]:
# Iterate through response and create a list of PowerBiAPI.Dashboard
dashboard_details = next(
(x for x in dashboard_data if x["id"] == scanned_dashboard["id"]), None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ID guaranteed to be present? Should would be defensive on this check?

# Iterate through response and create a list of PowerBiAPI.Dashboard
dashboard = PowerBiAPI.Dashboard(
id=scanned_dashboard.get("id"),
isReadOnly=dashboard_details.get("isReadOnly"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isReadOnly guaranteed to be there? Always?

workspace_id=workspace.id,
workspace_name=workspace.name,
tiles=[],
users=[],
tags=self.parse_endorsement(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a safe "get"? Will endorsementDetails always be there?

@@ -877,6 +868,45 @@ def init_dashboard_tiles(workspace: PowerBiAPI.Workspace) -> None:

return None

def handle_report(report_data: dict) -> Optional[PowerBiAPI.Report]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change required?

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'd recommend reducing the surface area of this PR. It changes too many things at once :) Any way to break it into a few smaller chunks?

Cheers
John

@looppi
Copy link
Contributor Author

looppi commented Jan 16, 2023

I've refactored the implementation a bit and brought back the original implementation to handle fetching of reports. Now the implementation collects endorsements by id in a dict to the Workspace dataclass and those dicts are used when parsing dashboards and reports. As the datasets are parsed from the scan result directly, there was no need to implement that kind of handling, rather than use the scan result data directly.

With this change, there's no need to separate [App] reports and dashboards from the scan result and it "stays true" to the original implementation.

What do you think?


By default, extracting endorsement information to tags is disabled. The feature may be useful if organization uses [endorsements](https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-endorse-content) to identify content quality.

Please note that the default implementation overwrites tags for the ingested entities, if you need to preserve existing tags, consider using a [transformer](../../../../metadata-ingestion/docs/transformer/dataset_transformer.md#simple-add-dataset-globaltags) with `semantics: PATCH` tags instead of `OVERWRITE`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great!

Going to give this a big old "LGTM". Thanks for the hard work!

@jjoyce0510 jjoyce0510 merged commit 87b3a5d into datahub-project:master Jan 18, 2023
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Jan 18, 2023
ericyomi pushed a commit to ericyomi/datahub that referenced this pull request Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants