-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update variant nowcast target data to reflect the latest standards and decicions #296
Comments
Resolves #296 This changeset adds sequence_as_of and tree_as_of columns to the oracle output target data files.
tree_as_of is cancelled! |
Moving this out of "ready for review" status pending finalization of Hubverse target data formats. |
Picking this back up now that the RFC for time series target data formats has been approved Will update the ticket's "definition of done" to reflect the outcome of today's variant nowcast hub meeting. |
I read through the updated main comment just now and confirmed that it looks right to me. |
Resolves #296 This changeset adds sequence_as_of and tree_as_of columns to the oracle output target data files.
Background
In the weeks since we began generating target data for the variant nowcast hub, we've identified two changes to make.
oracle output
Because the oracle output files are not partitioned by
sequence_as_of
, they will be overwritten each time the hub's post-submission jobs are run (if the round_id in question is still in the 14 week window interim data window).For normal hub operations, this is fine. However, if the target data jobs are run out of order (for example, a manual backfill run with an older nowcast_date), that would break the implicit assumption that
oracle.parquet
fill always reflect the most recent nowcast_date.We decided that partitioning oracle outputs by sequence_date makes them too onerous to access. So we can't guarantee that overwriting the file won't happen, but we can add sequence_as_of information to the oracle output files as a breadcrumb.
time series
The RFC for Hubverse time series target data formats was approved on 2025-02-28 🎉 .
Based on the new time series guidance, we decided to:
as_of
, whereas_of
represents thesequence_as_of
datenowcast_date
tree_as_of
from the time series data (there is no immediate use case, and we're able to get the tree_as_of date from thecreated_at
field in themodeled_clades
json (for example))Definition of done
oracle.parquet
files generated byget_target_data.py
will contain anas_of
column that representssequence_as_of
sequence_as_of
will be renamed toas_of
tree_as_of
will be removedtarget-data
README will contain a data dictionary for both time series and oracle output target data formatsThe text was updated successfully, but these errors were encountered: