-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow build-specific --narrow-bandwidth param in frequencies #1130
Conversation
There was a bug in the current behavior where the rule `frequencies` was calling `config["frequencies"]["narrow_bandwidth"]`. This resulted in --narrow-bandwidth to be fixed to whatever was specified in parameters.yaml. However, we were relying on build-specific settings to differentiate behavior in the all-time vs 6m vs 2m vs 1m builds, ala: ``` frequencies global_1m: narrow_bandwidth: 0.019 ``` This commit fixes this issue and allows overriding of narrow_bandwidth in parameters.yaml by build-specific settings. It also provides a genuine default (equal to augur default) so that parameters.yaml doesn't have to always specify narrow_bandwidth.
1363c19
to
3d8bed9
Compare
This removes `recent_days_to_censor` from GISAID/21L and open builds to match removal from GISAID builds.
8eca4ee
to
f8081fe
Compare
This follows recommendation in issue #1131 to use the pattern frequencies: default: min_date: "6M" narrow_bandwidth: 0.038 This use of "default" extends just to min_date, max_date and narrow_bandwidth. This behavior is now documented in parameters.yaml as well as workflow-config-file.rst. The specification of frequencies parameters in builds.yaml now follows this pattern.
f8081fe
to
6458357
Compare
Okay. I've updated this PR to follow "strategy 2" from #1131 and use I've also re-triggered trial builds with https://github.com/nextstrain/ncov/actions/runs/11062321363. @victorlin: Could you review this when you have a moment? I think it should be ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good – one small thing and a couple non-blocking comments. I can make these changes if it helps.
|
||
|
||
# settings that can be over-ridden across all builds, but not for specific builds | ||
recent_days_to_censor: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is now defined in two places: the config file (where this comment is added) and in the workflow:
ncov/workflow/snakemake_rules/common.smk
Line 203 in 6458357
recent_days_to_censor = config.get("frequencies", {}).get("recent_days_to_censor", 0) |
Since all workflow invocations¹ source from defaults/parameters.yaml
, the default value defined in common.smk
never gets used. I suggest removing it:
recent_days_to_censor = config["frequencies"]["recent_days_to_censor"]
¹ under expected usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I liked having workflow default in addition to the parameters.yaml default. As you point out this redundancy also exists for frequencies.narrow_bandwidth
.
But perhaps I understand your concern, where by having multiple layers of "defaults" it might add confusion. It would be nice if people could look just at parameters.yaml
to understand defaults rather than digging into the workflow.
I think the original push for adding workflow redundancy here was that parameters.yaml
(still) doesn't include recent_days_to_censor
and if we start looking for it in the workflow and people are stilling using old parameters.yaml
files then things will break.
I think we need a broader pattern to adhere to here (kind of like the conversion in issue #1131). I'm not immediately sure what's best for this specific PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we start looking for it in the workflow and people are stilling using old
parameters.yaml
files then things will break.
I'm not familiar with external usage of ncov so I appreciate the thoughts here.
Since parameters.yaml
is under version control, shouldn't it be sufficient to have workflow changes + parameters.yaml
changes (adding recent_days_to_censor
) in the same PR? If someone pulls the workflow changes, they'll also pull the default recent_days_to_censor
in parameters.yaml
. I can think of a few possibilities where breakage might happen:
-
User has made custom changes to
parameters.yaml
.git pull
will try to auto-merge and and flag any merge conflicts. -
User has explicitly removed this line from Snakefile:
Line 42 in 82ca8d4
configfile: "defaults/parameters.yaml" -
User is using a profile that does not source from
parameters.yaml
, i.e. missing this line:ncov/nextstrain_profiles/nextstrain-open/config.yaml
Lines 1 to 2 in 82ca8d4
configfile: - defaults/parameters.yaml This seems the most likely for external usage, and I believe it's one of the reasons why we now recommend
--configfile
over--profile
nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a broader pattern to adhere to here
The broader pattern seems to be having a default in parameters.yaml
+ direct access in Snakemake. This is how it's done for other config:
ncov/workflow/snakemake_rules/main_workflow.smk
Lines 1165 to 1168 in 8399290
pivot_interval = config["frequencies"]["pivot_interval"], | |
pivot_interval_units = config["frequencies"]["pivot_interval_units"], | |
narrow_bandwidth = config["frequencies"]["narrow_bandwidth"], | |
proportion_wide = config["frequencies"]["proportion_wide"] |
Lines 143 to 152 in 8399290
# Number of weeks between pivots | |
pivot_interval: 1 | |
# Measure pivots in weeks | |
pivot_interval_units: "weeks" | |
# KDE bandwidths in proportion of a year to use per strain. | |
# using 15 day bandwidth | |
narrow_bandwidth: 0.041 | |
proportion_wide: 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughts. I agree with your logic. Though I would like to push the fixes to a separate PR.
# else return augur frequencies default value | ||
else: | ||
return 0.0833 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking)
# else return augur frequencies default value | |
else: | |
return 0.0833 |
This shouldn't be necessary since defaults/parameters.yaml
already defines the default value and should always be used¹.
Non-blocking because _get_min_date_for_frequencies
already has similar redundancy.
¹ under expected usage
max_date: "2022-01-01" | ||
narrow_bandwidth: 0.076 | ||
|
||
Each named traits configuration (``default`` or build-named) supports specification of ``min_date``, ``max_date`` and ``narrow_bandwidth``. Other parameters can only be specified across all builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking)
Would it be worth allowing build-specific recent_days_to_censor
, pivot_interval
, etc. as well? If done right, this could reduce the duplication of functions _get_min_date_for_frequencies
/_get_max_date_for_frequencies
/_get_narrow_bandwidth_for_wildcards
.
Obviously beyond the scope this PR – I could take this or at least make an issue to start it off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally. Ideally, every entry in parameters.yaml
would be able to be over-ridden with build-specific entries in builds.yaml
. As you say, the current Snakemake workflow strategy here however would require padding out individual functions for each parameter. It would be wonderful to have a generic strategy that makes for automatic over-rides.
I think maybe more important than making this work for ncov
is thinking through how to do this right across pathogens. Everyone requests the ncov
style parameter-overrides (this came up for mpox
most obviously). cc @joverlee521
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's definitely been a lot of discussions around build configs outside of ncov
. I'll take some time to wrangle previous discussions and write up a summary outside of this PR (probably in pathogen-repo-guide).
Description of proposed changes
This is a bug fix PR. There was a bug in the current behavior where the rule
frequencies
was callingconfig["frequencies"]["narrow_bandwidth"]
. This resulted in--narrow-bandwidth
to be fixed to whatever was specified inparameters.yaml
. However, we were relying on build-specific settings to differentiate behavior in theall-time
vs6m
vs2m
vs1m
builds, ala:This PR fixes this issue and allows overriding of
narrow_bandwidth
inparameters.yaml
by build-specific settings. It also provides a genuine default (equal to augur default) so thatparameters.yaml
doesn't have to always specifynarrow_bandwidth
.Additionally, this PR removes the build-specific setting for
recent_days_to_censor
. This setting wasn't actually being applied in a build-specific fashion and so we had been always using the workflow default of0
. I decided to stick with this.Testing
I've tested locally and via trial build. Trail builds are visible at:
etc...