Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: dataset usage and query history feature guide #5900

Merged
merged 4 commits into from
Oct 14, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -445,6 +445,7 @@ module.exports = {
"docs/domains",
"docs/how/business-glossary-guide",
"docs/tags",
"docs/features/dataset-usage-and-query-history",
{
type: "doc",
id: "docs/managed-datahub/saas-slack-setup",
Expand Down
75 changes: 75 additions & 0 deletions docs/features/dataset-usage-and-query-history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import FeatureAvailability from '@site/src/components/FeatureAvailability';

# About DataHub Dataset Usage & Query History

<FeatureAvailability/>

Dataset Usage & Query History can give dataset-level information about the top queries which referenced a dataset.
maggiehays marked this conversation as resolved.
Show resolved Hide resolved

Usage data can help identify the top users who probably know the most about the dataset and top queries referencing this dataset.
You can also get an overview of the overall number of queries and distinct users.
In some sources, column level usage is also calculated, which can help identify frequently used columns.

With sources that support usage statistics, you can collect Dataset, Dashboard, and Chart usages.

## Dataset Usage & Query History Setup, Prerequisites, and Permissions

To ingest Dataset Usage & Query History data, you should check first on the specific source doc
if it is supported by the Datahub source and how to enable it.

You can validate this on the Datahub source's capabilities section:
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/source-snowflake-capabilities.png"/>
</p>

There are some sources where you have to use a different usage specific source for usage ingestion. In this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit confusing. Consider rephrasing to:

Some sources require a separate, usage-specific recipe to ingest Usage and Query History metadata. In this case, it is noted in the capabilities summary, like so:

case it is noted on the capabilities summary like in the example below.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/source-redshift-capabilities.png"/>
maggiehays marked this conversation as resolved.
Show resolved Hide resolved
</p>

Please, always check the usage prerequisities page if the source has as it can happen you have to add additional
permissions which only needs for usage.

## Using Dataset Usage & Query History

After successful ingestion, the Query tab will be enabled on datasets with any usage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: the Queries tab

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After successful ingestion, will both the Queries and Stats tab be populated?

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/feature-queries-tab.png"/>
</p>

On the query tab, you can see the top 5 queries which referenced this dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: On the Queries tab

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can we add context about how we choose the top 5 queries? is it based on number of times executed? over what period of time?

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/feature-query-history-page.png"/>
</p>

On the Stats tab, you can see the top users who run the most queries which referenced this dataset
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/feature-usage-stats-tab.png"/>
</p>

With the collected usage data, you can even see column-level usage statistics (on sources that support this):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(on sources that support this):

How can people tell which sources support column-level usage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this image correct? I don't see column level usage

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/feature-usage-stats-tab.png"/>
</p>

## Additional Resources

### Videos

**DataHub 101: Data Profiling and Usage Stats 101**
<p align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/d4S7RgWUg5U?start=254" title="DataHub 101: Data Profiling" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
maggiehays marked this conversation as resolved.
Show resolved Hide resolved
</p>

### GraphQL

- <https://datahubproject.io/docs/graphql/objects#usageaggregationmetrics>
- <https://datahubproject.io/docs/graphql/objects#userusagecounts>
- <https://datahubproject.io/docs/graphql/objects#dashboardstatssummary>
- <https://datahubproject.io/docs/graphql/objects#dashboarduserusagecounts>

## FAQ and Troubleshooting
maggiehays marked this conversation as resolved.
Show resolved Hide resolved

*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*