-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frequencies: Fix pivot logic to always include the end date #1121
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
ef59e5b
Correct and expand tests for pivots logic
huddlej bc71412
Fix pivot logic to always include the end date
huddlej 05f284d
Fix test logic that depended on rounding errors
huddlej beca8c4
Update expected frequencies for functional tests
huddlej dcd256f
Use isodate instead of python-dateutil
victorlin df62699
Simplify date conversions with numeric_date
huddlej 0591500
Update change log to reflect major change
huddlej File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,92 +1,92 @@ | ||
{ | ||
"BRA/2016/FC_6706": { | ||
"frequencies": [ | ||
0.022579, | ||
0.011312, | ||
0.206983, | ||
0.102282 | ||
0.022368, | ||
0.011759, | ||
0.207309, | ||
0.102551 | ||
] | ||
}, | ||
"COL/FLR_00008/2015": { | ||
"frequencies": [ | ||
0.243453, | ||
0.208443, | ||
0.008645, | ||
0.010659 | ||
0.244555, | ||
0.204597, | ||
0.008456, | ||
0.010737 | ||
] | ||
}, | ||
"Colombia/2016/ZC204Se": { | ||
"frequencies": [ | ||
0.126056, | ||
0.231597, | ||
0.014167, | ||
0.017099 | ||
0.12489, | ||
0.234574, | ||
0.013671, | ||
0.017245 | ||
] | ||
}, | ||
"DOM/2016/BB_0183": { | ||
"frequencies": [ | ||
0.017888, | ||
0.009342, | ||
0.183671, | ||
0.148759 | ||
0.017736, | ||
0.009645, | ||
0.185763, | ||
0.148494 | ||
] | ||
}, | ||
"EcEs062_16": { | ||
"frequencies": [ | ||
0.018981, | ||
0.009763, | ||
0.190994, | ||
0.134398 | ||
0.018817, | ||
0.010092, | ||
0.192704, | ||
0.134289 | ||
] | ||
}, | ||
"HND/2016/HU_ME59": { | ||
"frequencies": [ | ||
0.009484, | ||
0.006254, | ||
0.090552, | ||
0.457824 | ||
0.009425, | ||
0.006444, | ||
0.093589, | ||
0.456546 | ||
] | ||
}, | ||
"PAN/CDC_259359_V1_V3/2015": { | ||
"frequencies": [ | ||
0.224832, | ||
0.214575, | ||
0.008963, | ||
0.011176 | ||
0.225577, | ||
0.211235, | ||
0.008761, | ||
0.011258 | ||
] | ||
}, | ||
"PRVABC59": { | ||
"frequencies": [ | ||
0.243453, | ||
0.208443, | ||
0.008645, | ||
0.010659 | ||
0.244555, | ||
0.204597, | ||
0.008456, | ||
0.010737 | ||
] | ||
}, | ||
"VEN/UF_1/2016": { | ||
"frequencies": [ | ||
0.030662, | ||
0.016483, | ||
0.206983, | ||
0.070196 | ||
0.030336, | ||
0.017436, | ||
0.204461, | ||
0.07079 | ||
] | ||
}, | ||
"ZKC2/2016": { | ||
"frequencies": [ | ||
0.062612, | ||
0.083787, | ||
0.080395, | ||
0.036947 | ||
0.06174, | ||
0.089622, | ||
0.076833, | ||
0.037353 | ||
] | ||
}, | ||
"generated_by": { | ||
"program": "augur", | ||
"version": "7.0.2" | ||
"version": "19.2.0" | ||
}, | ||
"pivots": [ | ||
2015.75, | ||
2016.0, | ||
2016.25, | ||
2016.5 | ||
2015.7521, | ||
2016.0041, | ||
2016.2527, | ||
2016.5014 | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above,
pivot_end
is defined as a floating point (not changed in this PR). but doesn't this defeat the entire purpose of doing month/week increments to maintain human readable dates? And if the last data point is on Jan 2nd and the interval is 3 months, thepivot_end
is 3 months into the year.What if we instead did something like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would put the end point to the next full month or week after the last data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The floating point representation of
pivot_start
andpivot_end
is just an intermediate state as far as date calculations. The user can provide a human-readable (ISO-8601) date from the command line, it gets turned into a floating point value for historical reasons, but then it gets converted back to ISO-8601 before performing date math.pivot_end
andpivot_start
could be defined as human-readable (ISO-8601) dates from the start and only lead to floating point values at the end of the function. The floating point values returned byget_pivots
can be converted back to the expected ISO-8601 dates withfloat_to_datestring
. The benefit of the ISO-8601 representation is that we can perform standard date math like:This is obviously only a problem for months since they don't have the same number of days. Week intervals are consistent between standard date math and floating point math.
Ah, that's right and not great. With a single observation from that date, you'd get:
We usually run
augur frequencies
with an end date of the current day, though. In the current PR, if you had a single observation from Jan 2 and an end date of today, you'd get:But in the current implementation in
master
, you'd get:and throw out your single observation. For a monthly interval, your code example above would produce:
Since the pivots slice through a specific time in the KDE distributions instead of summing over a time interval, is there a benefit to moving the last pivot any later than we have data instead of using the latest observation date as the end date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I didn't realize TreeTime already had
datetime_from_numeric
anddatestring_from_numeric
. We could just use the latter instead of Augur'sfloat_to_datestring
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, John. and you are right, the
end_date
solves this problem in most cases. But it also means that our monthly pivots are not aligned with the beginning or end of a month. If that's ok, we might just do month=30days and week=7days.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pivots for KDE frequencies should never extend more than the narrow width beyond the last time window were there is more or less representative data. They would be too heavily influenced by fluctuations in that case. This is less of an issue for the diffusion frequencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That alignment to the beginning of a month was the original goal of the pandas-based implementation, but it is a pretty arbitrary goal. Even representing intervals by month feels arbitrary when "month" doesn't have a fixed value. The simpler approach of using weeks is appealing without needing to cast "month" to a fixed number of days. I would be happy to switch flu pivots to every 4 weeks, for instance.
This also makes a lot of sense. Estimating KDE frequencies at a date beyond the most recent data is an error in the opposite direction of the original bug addressed in this PR. By that logic, we should change the following logic from:
to always use a min/max date based on the earliest and latest sample, instead of expanding the range of pivots:
Our tree can't plot any dates beyond the most recent tip, so it makes sense that frequencies shouldn't plot beyond the most recent tip either. The behavior of that code I propose there doesn't account for our most common use case where we do provide an "end date" that is usually today even if the most recent data are never collected from today. Providing an "end date" only makes sense in the context of retrospective work like the flu forecasting project, where we want to estimate frequencies in well-defined windows across existing data.
Providing an earlier "start date" makes sense, though, since it allows the pivots to include all available data. The code above could produce pivots that don't include the earliest observation, depending on the pivot interval and end date. So, we probably want something like this that keeps the earlier start date than min observation but uses the max observation as the end date.
The existing changes in this PR will ensure that the end date is always the last pivot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is more work we could do in this area, but I think we should leave treatment of observed data (without start and end dates) to another PR. This PR fixes a bug that immediately affects our flu work, so I'd like to merge and release it soon. I documented this conversation as a new issue.