Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update variant nowcast target data to reflect the latest standards and decicions #296

Open
5 tasks
bsweger opened this issue Jan 28, 2025 · 4 comments
Open
5 tasks
Assignees

Comments

@bsweger
Copy link
Collaborator

bsweger commented Jan 28, 2025

Background

In the weeks since we began generating target data for the variant nowcast hub, we've identified two changes to make.

oracle output

Because the oracle output files are not partitioned by sequence_as_of, they will be overwritten each time the hub's post-submission jobs are run (if the round_id in question is still in the 14 week window interim data window).

For normal hub operations, this is fine. However, if the target data jobs are run out of order (for example, a manual backfill run with an older nowcast_date), that would break the implicit assumption that oracle.parquet fill always reflect the most recent nowcast_date.

We decided that partitioning oracle outputs by sequence_date makes them too onerous to access. So we can't guarantee that overwriting the file won't happen, but we can add sequence_as_of information to the oracle output files as a breadcrumb.

time series

The RFC for Hubverse time series target data formats was approved on 2025-02-28 🎉 .

Based on the new time series guidance, we decided to:

  • Partition time series target data onas_of, where as_of represents the sequence_as_of date
  • Use a secondary partition of nowcast_date
  • Explicitly include the above partition fields as columns in the time series parquet files
  • Remove tree_as_of from the time series data (there is no immediate use case, and we're able to get the tree_as_of date from the created_at field in themodeled_clades json (for example))

Definition of done

  • The oracle.parquet files generated by get_target_data.py will contain an as_of column that represents sequence_as_of
  • The time series parquet files are partitioned by as_of (aka sequence_as_of) and then by nowcast_date
  • The current time series parquet column sequence_as_of will be renamed to as_of
  • The current time series parquet column tree_as_of will be removed
  • The hub's target-data README will contain a data dictionary for both time series and oracle output target data formats
@bsweger bsweger added this to Lab Work Jan 28, 2025
@bsweger bsweger converted this from a draft issue Jan 28, 2025
@bsweger bsweger self-assigned this Jan 28, 2025
@bsweger bsweger added this to the Variant Nowcast milestone Jan 28, 2025
bsweger added a commit that referenced this issue Jan 28, 2025
Resolves #296

This changeset adds sequence_as_of and tree_as_of columns to the
oracle output target data files.
@bsweger
Copy link
Collaborator Author

bsweger commented Jan 29, 2025

I also added a tree_as_of column as part of this work, even though we didn't explicitly have that as a requirement. It aligns with our approach for the time series outputs and provides clarify for people accessing this file outside of the hub context.

tree_as_of is cancelled!

@bsweger bsweger moved this from In Progress to Ready for Review in Lab Work Jan 29, 2025
@bsweger bsweger moved this from Ready for Review to In Progress in Lab Work Feb 10, 2025
@bsweger
Copy link
Collaborator Author

bsweger commented Feb 10, 2025

Moving this out of "ready for review" status pending finalization of Hubverse target data formats.

@bsweger
Copy link
Collaborator Author

bsweger commented Feb 28, 2025

Picking this back up now that the RFC for time series target data formats has been approved

Will update the ticket's "definition of done" to reflect the outcome of today's variant nowcast hub meeting.

@bsweger bsweger changed the title Add a sequence_as_of column to oracle output target data Update variant nowcast target data to reflect the latest standards and decicions Feb 28, 2025
@elray1
Copy link
Collaborator

elray1 commented Feb 28, 2025

I read through the updated main comment just now and confirmed that it looks right to me.

bsweger added a commit that referenced this issue Feb 28, 2025
Resolves #296

This changeset adds sequence_as_of and tree_as_of columns to the
oracle output target data files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants