Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mock qa data #6465

Merged
merged 37 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
0ca11ad
Initial automatic running of all ingestion tests
Thingus Feb 29, 2024
537296c
Tidying up qa_check script
Thingus Feb 29, 2024
fd2fef9
Moving qa check script to testdata/bin
Thingus Feb 29, 2024
2e248a6
Argparse in run_qa_checks
Thingus Feb 29, 2024
210fc5f
Adding to ingestion scripts
Thingus Feb 29, 2024
b1b3c46
Adding QA check to circleci to check build
Thingus Mar 1, 2024
ffbccc5
Update flowdb/testdata/bin/run_qa_checks.py
Thingus Mar 11, 2024
207895c
Merge branch 'master' into mock_qa_data
Thingus Mar 11, 2024
774b414
Comments from review
Thingus Mar 11, 2024
2b9e843
Relocking pipfile
Thingus Mar 11, 2024
26a53e2
CHANGELOG.md
Thingus Mar 11, 2024
8a81141
Shifting jinja to test pipenv + relocking
Thingus Mar 12, 2024
7d549fd
Comments from PR (wip)
Thingus Mar 12, 2024
cef5b4d
Get default events tables and dates from flowdb
Thingus Mar 13, 2024
afc5d7d
Merge branch 'master' into mock_qa_data
Thingus Mar 13, 2024
a23b9f0
Moving run_qa_checks to bin + adding to dockerignoreignore
Thingus Mar 14, 2024
289ce8b
Merge branch 'mock_qa_data' of https://github.com/Flowminder/FlowKit …
Thingus Mar 14, 2024
bf1d1e2
Update flowdb/testdata/bin/run_qa_checks.py
Thingus Mar 14, 2024
91b78ae
Jinja2 and relocking for synthdata
Thingus Mar 14, 2024
2cf85a5
Merge branch 'mock_qa_data' of https://github.com/Flowminder/FlowKit …
Thingus Mar 14, 2024
46eb93e
Adding flowetl qa templates to flowdb test data containers
Thingus Mar 14, 2024
470335d
Maybe its those commas causing trouble
Thingus Mar 14, 2024
ae01aaf
Now using pop and pushdir in test_data scripts
Thingus Mar 14, 2024
6f074b5
Some container messing
Thingus Mar 15, 2024
8349130
pushd not pushdir
Thingus Mar 15, 2024
9f4b8fa
Pipenv messing
Thingus Mar 15, 2024
09cac3c
Ah, the pipfile is in the root
Thingus Mar 15, 2024
c10739a
Lets try an explicity path
Thingus Mar 15, 2024
57ca678
run_qa_checks now runs on local container
Thingus Mar 15, 2024
0b265e1
More prints + popenv run
Thingus Mar 18, 2024
4e837a9
cd instead of push/popd
Thingus Mar 18, 2024
8bd4223
Minor fixes
Thingus Mar 18, 2024
2b7c409
Explcit copy of run_qa_checks.py for synth data
greenape Mar 18, 2024
9cf2611
Adding run_qa_checks.py to synth data image
Thingus Mar 18, 2024
15d5928
Merge branch 'mock_qa_data' of https://github.com/Flowminder/FlowKit …
Thingus Mar 18, 2024
a346ddd
Removing extra COPY command
Thingus Mar 18, 2024
720b4a9
Merge branch 'master' into mock_qa_data
Thingus Mar 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ defaults:
FLOWAPI_FLOWDB_USER: flowapi
FLOWMACHINE_FLOWDB_PASSWORD: foo
FLOWAPI_FLOWDB_PASSWORD: foo
TEST_QA_CHECK: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I kind of feel like this should default to on really.

- &wait_for_flowdb
name: Wait for flowdb to start
command: |
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [Unreleased]

### Added
- Test and synthetic data generators now perform QA checks on the generated data. [#6467](https://github.com/Flowminder/FlowKit/issues/6467)

### Changed

Expand Down
1 change: 1 addition & 0 deletions development_environment
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ SUBSCRIBERS_SEED=12345
CALLS_SEED=22222
CELLS_SEED=99999
OUTPUT_ROOT_DIR=/docker-entrypoint-initdb.d
SKIP_TEST_QA_CHECK=False

# Integration tests

Expand Down
17 changes: 4 additions & 13 deletions flowdb/Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions flowdb/testdata/bin/9900_ingest_synthetic_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,8 @@ else
echo "Must set SYNTHETIC_DATA_GENERATOR environment variable to 'sql' or 'python'."
exit 1
fi
if [ "${SKIP_TEST_QA_CHECK}" != "true" ]; then
cd /docker-entrypoint-initdb.d
echo "Running QA checks on test data"
pipenv run python run_qa_checks.py qa_checks
fi
7 changes: 7 additions & 0 deletions flowdb/testdata/bin/9900_ingest_test_data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,10 @@ if [ $count != 0 ]; then
echo "$DIR is empty."
fi
fi

# &{VAR,,} should lowercase the variable on interpolation
if [ "${SKIP_TEST_QA_CHECK,,}" != "true" ]; then
echo "Running qa checks in /docker-entrypoint-initdb.d/qa_checks"
cd /docker-entrypoint-initdb.d
pipenv run python run_qa_checks.py qa_checks
fi
138 changes: 138 additions & 0 deletions flowdb/testdata/bin/run_qa_checks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

from dataclasses import asdict, dataclass
Thingus marked this conversation as resolved.
Show resolved Hide resolved
from datetime import date, datetime
from itertools import product
from pathlib import Path
from typing import List
from jinja2 import Environment, FileSystemLoader, Template
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
import os
import argparse


update_template_string = """
INSERT INTO etl.post_etl_queries
(cdr_date, cdr_type, type_of_query_or_check, outcome, optional_comment_or_description, timestamp)
VALUES(
'{{cdr_date}}',
'{{cdr_type}}',
'{{type_of_query_or_check}}',
({{outcome_query}}),
'{{optional_comment_or_description}}',
'{{timestamp}}'
)

"""


@dataclass
class QaTemplate:
display_name: str
template: Template
event_type: str


@dataclass
class QaRow:
cdr_date: date
cdr_type: str
type_of_query_or_check: str
outcome_query: str
optional_comment_or_description: str
timestamp: datetime


@dataclass
class MockQaScenario:
dates: List[date]
tables: List[str]


def render_qa_check(template: Template, date: date, cdr_type: str) -> str:
return template.render(
final_table=f"events.{cdr_type}_{date.strftime('%Y%m%d')}",
cdr_type=cdr_type,
ds=date.strftime("%Y-%m-%d"),
)


def get_available_tables(engine: Engine):
with engine.begin() as conn:
tables = conn.execute(
text("SELECT table_name FROM available_tables WHERE has_subscribers")
)
return [t[0] for t in tables.all()]


def get_available_dates(engine: Engine):
with engine.begin() as conn:
dates = conn.execute(text("SELECT cdr_date FROM etl.available_dates"))
return [d[0] for d in dates.all()]


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Runs all flowetl checks for ingested data"
)
parser.add_argument("template_path", help="Path to the QA templates")
parser.add_argument(
"--dates",
type=lambda s: datetime.datetime.strptime(s, "%Y-%m-%d"),
help="Date to run ingestion check on. Can be specified multiple times.",
nargs="*",
)
parser.add_argument(
"--event_types", help="Event tables to run qa checks on.", nargs="*"
)
args = parser.parse_args()
env = Environment(loader=FileSystemLoader(args.template_path))
print(f"Loaded {len(env.list_templates())} templates")
update_template = env.from_string(update_template_string)
db_user = os.environ["POSTGRES_USER"]
conn_str = f"postgresql://{db_user}@/flowdb"
engine = create_engine(conn_str)
print(f"Connecting to flowdb on {conn_str}.")

dates = get_available_dates(engine) if not args.dates else args.dates

event_types = (
get_available_tables(engine) if not args.event_types else args.event_types
)

qa_scn = MockQaScenario(dates=dates, tables=event_types)

templates = (
QaTemplate(
Path(t).name,
env.get_template(t),
Path(t).parent if Path(t).parent != Path(".") else "any",
)
for t in env.list_templates(".sql")
)

qa_rows = (
QaRow(
date,
cdr_type,
template.display_name,
render_qa_check(template.template, date, cdr_type),
"Made from mock data",
datetime.now(),
)
for date, cdr_type, template in product(qa_scn.dates, qa_scn.tables, templates)
if template.event_type in [cdr_type, "any"]
)

with engine.begin() as conn:
for row in qa_rows:
print(
f"Running {row.type_of_query_or_check} for cdr type {row.cdr_type} date {row.cdr_date}"
)
conn.execute(text(update_template.render(**asdict(row))))

out = conn.execute(text("SELECT * FROM etl.post_etl_queries LIMIT 10"))
print(out.fetchall())
2 changes: 1 addition & 1 deletion flowdb/testdata/synthetic_data/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ structlog = "*"
tohu = "==0.6.7"
numpy = "<=1.26.2" # Tohu uses float division where it should be using int division, and hence passes a float where numpy expects an int (https://github.com/maxalbert/tohu/blob/3adf0c58b13ef1e1d716d7d613484d2adc58fb60/tohu/v6/primitive_generators.py#L335)
# This used to work, but doesn't as of numpy 1.26.0 (although I haven't managed to track down the relevant change or find a corresponding issue or changelog entry)

jinja2 = "*"
[dev-packages]
black = {extras = ["jupyter"],version = "==24.2.0"}

Expand Down
Loading
Loading