feat(ingest): automated term classification for snowflake #6376

mayurinehate · 2022-11-07T13:27:14Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

github-actions · 2022-11-07T13:50:22Z

Unit Test Results (metadata ingestion)

      8 files ±0       8 suites ±0 57m 13s ⏱️ -26s
  764 tests +3   761 ✔️ +3 3 💤 ±0 0 ❌ ±0
1 530 runs +6 1 523 ✔️ +6 7 💤 ±0 0 ❌ ±0

Results for commit 15ed4f7. ± Comparison against base commit 10a31b1.

♻️ This comment has been updated with latest results.

github-actions · 2022-11-07T13:54:22Z

Unit Test Results (build & test)

621 tests ±0 617 ✔️ ±0 15m 51s ⏱️ -18s
157 suites ±0     4 💤 ±0
157 files ±0     0 ❌ ±0

Results for commit 15ed4f7. ± Comparison against base commit 10a31b1.

♻️ This comment has been updated with latest results.

…on in snowflake

hsheth2 · 2022-11-22T22:24:27Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

@@ -3,6 +3,7 @@
 from datetime import datetime
 from typing import Dict, List, Optional

+import pandas as pd


nit: this should be a typing-only dep - see mypy's docs on TYPE_CHECKING

also I believe this PR acryldata/datahub-classify#6 is going to remove the pandas dep from datahub-classify

We are using pandas to read from snowflake cursor and translate to pandas dataframe in snowflake- https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#migrating-to-pandas-dataframes. so this is required dependency for snowflake connector. I am not sure how converting this to typing-only dep will be useful.

got it - in this case it's fine then, but I don't want to have a hard dep on pandas when we add classification to other sources

Makes sense. I believe that would need refractor in classification_mixin as well. Maybe, we can replace sample_data input param with a callable function that takes column as input and returns list-like values structure as output ?

Sure, although I suspect a dict {column name -> list[sample values]} is good enough for this

seems fair.

hsheth2 · 2022-11-22T22:41:19Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

+    # Ideally we do not want null values in sample data for a column.
+    # However that would require separate query per column and
+    # that would be expensive, hence not done.
+    def get_sample_values_for_table(self, conn, table_name, schema_name, db_name):


does this repeat work that the profiler does?

Kind of, but not exactly. With current implementation(with GE), profile can get only limited number of sample values(upto 20).

…oject#6376)

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 7, 2022

mayurinehate added 2 commits November 8, 2022 22:45

feat(classification): add support for automated glossary term detecti…

df14717

…on in snowflake

add integration test for classification in snowflake

93a64ca

mayurinehate force-pushed the classification_engine branch from 50746cd to 93a64ca Compare November 8, 2022 17:15

mayurinehate marked this pull request as ready for review November 8, 2022 17:16

mayurinehate and others added 4 commits November 15, 2022 10:25

Merge branch 'master' into classification_engine

5d83dab

fix lint

5777302

add snowflake in base dev requirements

5ed71f7

update versions to resolve dependency conflict

37b6394

mayurinehate force-pushed the classification_engine branch from 8f95490 to 37b6394 Compare November 17, 2022 16:21

mayurinehate and others added 10 commits November 18, 2022 00:56

change to reduce scipy backtracking

53d2c40

Merge branch 'master' into classification_engine

d5fb7a9

updating dependency

c6a29fe

Merge branch 'master' into classification_engine

7127cad

fix lint

dd886a5

update version to support python 3.7

548e420

pinning dependencies for python 3.7

e759c09

Merge branch 'master' into classification_engine

8d855b3

update dependencies, update snowflake to pandas logic

3b6744b

update doc

1bd5007

hsheth2 changed the title ~~feat(classification): add support for automated glossary term detecti…~~ feat(classification): automated glossary term detection in snowflake Nov 21, 2022

mayurinehate requested a review from hsheth2 November 22, 2022 06:21

Merge branch 'master' into classification_engine

15ed4f7

hsheth2 reviewed Nov 22, 2022

View reviewed changes

hsheth2 approved these changes Nov 23, 2022

View reviewed changes

hsheth2 changed the title ~~feat(classification): automated glossary term detection in snowflake~~ feat(ingest): automated term classification for snowflake Nov 23, 2022

hsheth2 merged commit 22847a9 into datahub-project:master Nov 23, 2022

mayurinehate mentioned this pull request Nov 24, 2022

feat(ingest): refractor classification mixin, support new infotypes #6545

Merged

5 tasks

cccs-Dustin pushed a commit to CybercentreCanada/datahub that referenced this pull request Feb 1, 2023

feat(ingest): automated term classification for snowflake (datahub-pr…

9869338

…oject#6376)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): automated term classification for snowflake #6376

feat(ingest): automated term classification for snowflake #6376

mayurinehate commented Nov 7, 2022 •

edited

Loading

github-actions bot commented Nov 7, 2022 •

edited

Loading

github-actions bot commented Nov 7, 2022 •

edited

Loading

hsheth2 Nov 22, 2022

hsheth2 Nov 22, 2022

mayurinehate Nov 23, 2022

hsheth2 Nov 23, 2022

mayurinehate Nov 23, 2022

hsheth2 Nov 23, 2022

mayurinehate Nov 24, 2022

hsheth2 Nov 22, 2022

mayurinehate Nov 23, 2022

feat(ingest): automated term classification for snowflake #6376

feat(ingest): automated term classification for snowflake #6376

Conversation

mayurinehate commented Nov 7, 2022 • edited Loading

Checklist

github-actions bot commented Nov 7, 2022 • edited Loading

Unit Test Results (metadata ingestion)

github-actions bot commented Nov 7, 2022 • edited Loading

Unit Test Results (build & test)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate commented Nov 7, 2022 •

edited

Loading

github-actions bot commented Nov 7, 2022 •

edited

Loading

github-actions bot commented Nov 7, 2022 •

edited

Loading