Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mock qa data #6465

Merged
merged 37 commits into from
Mar 19, 2024
Merged

Mock qa data #6465

merged 37 commits into from
Mar 19, 2024

Conversation

Thingus
Copy link
Contributor

@Thingus Thingus commented Feb 29, 2024

Closes #6467

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

Adds the option to run QA checks on mock data, on by default


Copy link

cypress bot commented Feb 29, 2024

Passing run #21870 ↗︎

0 4 0 0 Flakiness 0

Details:

Merge branch 'master' into mock_qa_data
Project: FlowAuth Commit: 720b4a9bc1
Status: Passed Duration: 00:42 💡
Started: Mar 18, 2024 4:01 PM Ended: Mar 18, 2024 4:02 PM

Review all test suite changes for PR #6465 ↗︎

Copy link

codecov bot commented Feb 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.94%. Comparing base (fb3478b) to head (720b4a9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6465      +/-   ##
==========================================
- Coverage   93.04%   92.94%   -0.10%     
==========================================
  Files         264      264              
  Lines       10303    10303              
  Branches      835      835              
==========================================
- Hits         9586     9576      -10     
- Misses        589      598       +9     
- Partials      128      129       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

Can we have an issue number, a changelog entry etc.

I'd also dispute that the qa data is mock - it is real qa about mock data, which is kind of different (i.e. we're actually doing the checks instead of just filling in some made up numbers), so maybe change the names to reflect that.

@@ -35,6 +35,7 @@ defaults:
FLOWAPI_FLOWDB_USER: flowapi
FLOWMACHINE_FLOWDB_PASSWORD: foo
FLOWAPI_FLOWDB_PASSWORD: foo
TEST_QA_CHECK: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I kind of feel like this should default to on really.

@@ -76,3 +76,6 @@ else
echo "Must set SYNTHETIC_DATA_GENERATOR environment variable to 'sql' or 'python'."
exit 1
fi
if [ "$TEST_QA_CHECK" ]; then
pipenv run python ./run_qa_checks.py --dates ${DISASTER_END:-"2015-01-01"} --event-types calls mds sms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to just run it for all the dates in the database really?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all event types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have internal sources of truth for those? I'd guess it's etl.available_dates for dates, but I'm not sure what it would be for the various flavours of events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etl.etl_records or available_tables

from itertools import product
from pathlib import Path
from typing import List
from jinja2 import Environment, PackageLoader, Template
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs adding to the Pipfile and a relock.



# NOTE: given this gets run _after_ all the ingestion, 'dates' may be an irrelevence
# unless final_table is normally one of the dated partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final table will almost always be a dated partition table.

for date, type, template in product(qa_scn.dates, qa_scn.tables, templates)
)

with engine.connect() as conn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc you need an explicit commit now, or to use engine.begin().

db_port = os.getenv("POSTGRES_PORT", "9000")
db_password = os.getenv(
"POSTGRES_PASSWORD", "flowflow"
) # In here for dev, don't think we'll need it in prod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean for locally running this script I assume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use os.environ I think, because you don't esp want it diverging from development_environment.

flowdb/testdata/bin/run_qa_checks.py Show resolved Hide resolved

env = Environment(loader=PackageLoader("flowetl", "qa_checks/qa_checks"))

# @James; if something like this already exists, I'll grab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the operator, so I think I'd stick to what you have here.

flowdb/Pipfile Outdated
@@ -8,6 +8,7 @@ psutil = "*"

[dev-packages]
black = {extras = ["jupyter"],version = "==24.2.0"}
jinja2 = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should go in flowdb/testdata/test_data rather than the root one, isn't needed in the base image, only the test/synth ones.

@@ -76,3 +76,6 @@ else
echo "Must set SYNTHETIC_DATA_GENERATOR environment variable to 'sql' or 'python'."
exit 1
fi
if [ "$TEST_QA_CHECK" ]; then
pipenv run python ./run_qa_checks.py --dates ${DISASTER_END:-"2015-01-01"} --event-types calls mds sms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etl.etl_records or available_tables

db_port = os.getenv("POSTGRES_PORT", "9000")
db_password = os.getenv(
"POSTGRES_PASSWORD", "flowflow"
) # In here for dev, don't think we'll need it in prod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use os.environ I think, because you don't esp want it diverging from development_environment.

Thingus added 4 commits March 12, 2024 09:07
POSTGRES_DB and POSTGRES_PORT aren't in development_environment; so
hardcoded.
Not sure if this is a good idea for POSTGRES_PORT.
@Thingus Thingus added enhancement New feature or request FlowDB Issues related to FlowDB FlowETL Needs Review PR that is ready for review by a human labels Mar 13, 2024
@Thingus Thingus marked this pull request as ready for review March 13, 2024 16:05
@Thingus Thingus requested a review from greenape March 13, 2024 16:05
@@ -7,6 +7,7 @@ name = "pypi"
tohu = "*"
"geoalchemy2" = "*"
sqlalchemy = "*"
jinja2 = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll need to add to testdata/synthetic_data/Pipfile as well

def get_available_tables(engine: Engine):
with engine.begin() as conn:
tables = conn.execute(
"SELECT table_name FROM available_tables WHERE has_subscribers"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also just get this from the available dates one now I think of it, but this is good too :)

import argparse


env = Environment(loader=PackageLoader("flowetl", "qa_checks/qa_checks"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we need to copy these into the container as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sticking them into the dockerignore should grab it, right?

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy now 👌

@Thingus Thingus merged commit ba9948f into master Mar 19, 2024
39 of 40 checks passed
@Thingus Thingus deleted the mock_qa_data branch March 19, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FlowDB Issues related to FlowDB FlowETL Needs Review PR that is ready for review by a human
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QA checks on test data
2 participants