-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HWE Test Implementation #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, +1. The testing is especially thorough. Just a couple of minor changes then I think this can be merged.
(The numba coverage questions can be addressed in #77.)
Additionally, both covariate and trait arrays will be rechunked to have blocks | ||
along the sample (row) dimension but not the column dimension (i.e. | ||
they must be tall and skinny). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this warning removed for a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was duplicated.
hardy_weinberg_p_value_vec_jit = njit(hardy_weinberg_p_value_vec, fastmath=True) | ||
|
||
|
||
def hardy_weinberg_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be added to the top-level __init__.py
so it's a part of the public API.
Also, +1 on the name (even though it's not a verb!). I think using the abbreviation hwe
internally is fine too (as you have done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n = len(obs_hets) | ||
p = np.empty(n, dtype=np.float64) | ||
for i in range(n): | ||
p[i] = hardy_weinberg_p_value_jit(obs_hets[i], obs_hom1[i], obs_hom2[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: we're using Dask for parallelization here (across blocks) which is fine, but we may want to consider numba's parallel=True
in the future for this loop as another dispatch option.
# Otherwise compute genotype counts from calls | ||
else: | ||
# TODO: Use API genotype counting function instead, e.g. | ||
# https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open an issue to track this please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgkit/tests/test_hwe.py
Outdated
|
||
def test_hwep__raise_on_negative(): | ||
args = [[-1, 0, 0], [0, -1, 0], [0, 0, -1]] | ||
for arg in args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might be slighter neater to parameterise the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgkit/stats/hwe.py
Outdated
# TODO: Use API genotype counting function instead, e.g. | ||
# https://github.com/pystatgen/sgkit/issues/29#issuecomment-656691069 | ||
mask = ds["call_genotype_mask"].any(dim="ploidy") | ||
gtc = xr.where(mask, -1, ds["call_genotype"].sum(dim="ploidy")) # type: ignore[no-untyped-call] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit/question: we are being a bit inconsistent on case for these variables. E.g. in count_alleles
the variables are G
, CTS
, etc. Also, in the summary stats PR I used G
for genotypes, but n_het
for heterogeneous counts.
Are there some rules we can use? E.g. use lowercase except when we are replicating a paper which uses X
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pystatgen/sgkit@466859e#diff-2ec013d1f086d9558934c38ac90411d1R173
I like the capital convention so array names don't conflict with scalars when there are a lot of both. When there aren't a lot of either, switching to lower case name sounds good and is what I would prefer too. I don't know how to make a threshold for that clear, so I've been trying to err on the side of sticking with the capital letter convention for consistency (thanks for flagging this one).
I filed https://github.com/pystatgen/sgkit/issues/117 to make a record of this.
This PR contains a new implementation for https://github.com/pystatgen/sgkit/issues/28.
Summary of changes:
sgkit/stats/hwe.py
moduleThere may be better ways to organize jit-compiled function tests, see https://github.com/pystatgen/sgkit/issues/77.