-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notes from great_expectations workflow experiments: replace now-obsolete CLI workflow #132
Comments
Connecting to a data sourceThe prior CLI workflow guided you through some prompts, providing options for the kind of data to connect to (filesystem or relational database) and either options for either the execution engine (for filesystem datasources) or options for the database backend (for SQL datasources), and then The new mode for connecting to a (postgres) datasource is described here, and points to this other documentation page for guidance on "securely" storing connection credentials. I'll see how cumbersome it is to reference the cached import great_expectations as gx
datasource_name = "where_house_source"
data_connector_name = "data_raw_inferred_data_connector_name"
context = gx.get_context()
datasource = context.get_datasource(datasource_name)
data_connector = datasource.data_connectors[data_connector_name] but the new documentation seems largely focused on a expectation_suite_name = "data_raw.cook_county_parcel_sales.warning"
data_asset_name = "data_raw.cook_county_parcel_sales"
batch_request = {
"datasource_name": datasource_name,
"data_connector_name": data_connector_name,
"data_asset_name": data_asset_name,
"limit": 1000,
}
validator = context.get_validator(
batch_request=BatchRequest(**batch_request),
expectation_suite_name=expectation_suite_name,
) I'll experiment more to see if it's worth switching to the new flow or if I should maintain the bloc-config + |
Fluent Datasource definitionI added an env-var to the airflow dot-env file (it was the same as import great_expectations as gx
from great_expectations.exceptions import DataContextError
datasource_name = "fluent_dwh_source"
context = gx.get_context()
try:
datasource = context.sources.add_sql(name=datasource_name, connection_string="${GX_DWH_DB_CONN}")
except DataContextError:
datasource = context.get_datasource(datasource_name) and that fluent-style table_asset = datasource.add_table_asset(
name="data_raw.temp_chicago_food_inspections",
schema_name="data_raw",
table_name="temp_chicago_food_inspections",
)
query_asset = datasource.add_query_asset(
name="food_inspection_results_by_zip_code",
query="""
SELECT count(*), results, zip
FROM data_raw.temp_chicago_food_inspections
GROUP BY results, zip
ORDER BY count DESC
"""
) And after the table or query is registered, you can just reference it by name. Sidenote: I should add tasks to the general pipeline that registers tables as assets if they don't already exist. I could even set up the a task to run every time but just do nothing if the asset is already registered ( table_name="temp_chicago_food_inspections"
schema_name="data_raw"
if f"{schema_name}.{table_name}" not in datasource.get_asset_names():
table_asset = datasource.add_table_asset(
name=f"{schema_name}.{table_name}",
schema_name=schema_name,
table_name=table_name,
)
else:
print(f"A DataAsset named {schema_name}.{table_name} already exists.") |
Setting Expectations with a
|
Three days ago, the Great Expectations team published a blog post titled A fond farewell to the CLI, announcing that they've recently introduced new functionality (
Fluent Datasources
) that makes their CLI-to-jupyter-notebook workflow obsolete. I already built that CLI-to-notebook workflow into this system's developer workflow, but I agree that it's rather cumbersome, so I'm keen to explore the new functionality and hopefully improve the jupyter notebook workflow beyond what was generated by thegx
CLI tools.I'll use this issue to largely document experiments, useful bits, and usage notes, and I'll close it when I've merged in documentation reflecting the new workflow.
The text was updated successfully, but these errors were encountered: