-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): automated term classification for snowflake #6376
feat(ingest): automated term classification for snowflake #6376
Conversation
50746cd
to
93a64ca
Compare
8f95490
to
37b6394
Compare
@@ -3,6 +3,7 @@ | |||
from datetime import datetime | |||
from typing import Dict, List, Optional | |||
|
|||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be a typing-only dep - see mypy's docs on TYPE_CHECKING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I believe this PR acryldata/datahub-classify#6 is going to remove the pandas dep from datahub-classify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using pandas to read from snowflake cursor and translate to pandas dataframe in snowflake- https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#migrating-to-pandas-dataframes. so this is required dependency for snowflake connector. I am not sure how converting this to typing-only dep will be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it - in this case it's fine then, but I don't want to have a hard dep on pandas when we add classification to other sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I believe that would need refractor in classification_mixin as well. Maybe, we can replace sample_data
input param with a callable function that takes column as input and returns list-like values structure as output ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although I suspect a dict {column name -> list[sample values]} is good enough for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fair.
# Ideally we do not want null values in sample data for a column. | ||
# However that would require separate query per column and | ||
# that would be expensive, hence not done. | ||
def get_sample_values_for_table(self, conn, table_name, schema_name, db_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this repeat work that the profiler does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of, but not exactly. With current implementation(with GE), profile can get only limited number of sample values(upto 20).
Checklist