-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset "schema" v0.3 #276
Conversation
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
130af83
to
722e50f
Compare
sgkit/stats/aggregation.py
Outdated
"""Compute quality control variant statistics from genotype calls. | ||
|
||
Parameters | ||
---------- | ||
ds | ||
Genotype call dataset such as from | ||
`sgkit.create_genotype_call_dataset`. | ||
call_genotype | ||
Input variable name holding call_genotype. | ||
As defined by `sgkit.variables.call_genotype`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably decide on how we want to do this linking, and what kind of information goes into method doc vs variable docs. I would actually suggest we do this in a separate PR, to keep review slim. But if people feel strong to do it here, fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI added links between methods and variables in the documentation.
sgkit/variables.py
Outdated
"""TODO""" | ||
genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i") | ||
""" | ||
Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the 1st sentence appears on the front page of the variable, the rest appears on the variable specific page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment at the top of these definitions saying this?
sgkit/variables.py
Outdated
ndim: Union[None, int, Set[int]] = None | ||
|
||
|
||
call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should decide if we want to list this in a single namespace, group these, and or what order do we want them documented. Also would be good to have bidirectional links between variables and methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to making it possible to see what methods use or produce a variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)
Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables
prefix), but I'm not sure which way to go on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tomwhite here
call_genotype_probability_mask = ArrayLikeSpec( | ||
"call_genotype_probability_mask", kind="b", ndim=3 | ||
) | ||
"""TODO""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some TODO's still where the doc was missing at the origin.
6f836d0
to
2e469f7
Compare
sgkit/model.py
Outdated
@@ -83,7 +79,9 @@ def create_genotype_call_dataset( | |||
check_array_like(variant_id, kind={"U", "O"}, ndim=1) | |||
data_vars["variant_id"] = ([DIM_VARIANT], variant_id) | |||
attrs: Dict[Hashable, Any] = {"contigs": variant_contig_names} | |||
return xr.Dataset(data_vars=data_vars, attrs=attrs) | |||
return variables.validate( | |||
xr.Dataset(data_vars=data_vars, attrs=attrs), *data_vars.keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to have a validate
overload that checks against all sgkit.variables by default (to avoid *data_vars.keys
)?
sgkit/stats/hwe.py
Outdated
@@ -140,6 +144,12 @@ def hardy_weinberg_test( | |||
where `N` is equal to the number of variants and the 3 columns contain | |||
heterozygous, homozygous reference, and homozygous alternate counts | |||
(in that order) across all samples for a variant. | |||
call_genotype | |||
Input variable name holding call_genotype. | |||
As defined by :data:`sgkit.variables.call_genotype`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This isn't a complete sentence which will be a little odd given how common "As defined by X." is. I would suggest "Defined by X." instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than a couple minor things. Thanks for cleaning up all those signatures! Much more consistent now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great - thanks @ravwojdyla. I have a few suggestions (mostly minor), but this should be ready to merge soon.
One thing worth mentioning is that users writing their own methods do not have to use the validation if they don't want to (in fact, it's not really a part of the public API so it's not easy to use). I think that's fine - we want to encourage users to take advantage of xarray APIs as needed - and this work is to make sgkit's methods more internally consistent and well-documented.
sgkit/variables.py
Outdated
"""TODO""" | ||
genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i") | ||
""" | ||
Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment at the top of these definitions saying this?
sgkit/stats/aggregation.py
Outdated
@@ -113,17 +119,24 @@ def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset: | |||
) | |||
} | |||
) | |||
return conditional_merge_datasets(ds, new_ds, merge) | |||
return variables.validate( | |||
conditional_merge_datasets(ds, new_ds, merge), "call_allele_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should output variable names reference the spec too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but the question still remains where the dataset is created. A typo there won't cause validation to fail will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite a typo there would definitely cause a validation error (which is intentional/good). Whether we should reference variable or use strings is a good question, but more from a "dev practices" standpoint. I guess I lean towards referencing it, BUT it would mean that you need to for example say: variables.pc_relate_phi.default_name
instead of "pc_relate_phi"
. We could shorten it to variables.pc_relate_phi
, if we provide some magic methods (hash/str).
I don't really see much value of using the reference from the user perspective, since the validation works anyway. One thing that comes to me mind tho, is that maybe if we reference in all the methods, maybe there is some plugin that could generate the links for usage, so that we could link variables to methods 🤔
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would probably lean towards referencing the variable too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for posterity, I've commented on the wrong comment section, copying over here:
@tomwhite so I just remembered something, this would be going against some early comments in #17, what do you think? also ping @eric-czech
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using sgkit.variables for creating the output datasets sounds good. The closest thing to an edge case I can think of right now is a function that generates a mask or mean imputes one of several possible input variables. Perhaps validation could eventually be more dynamic for those cases. Otherwise, I can't imagine where using a constant would be awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite @eric-czech ok, this PR is now using sgkit.variables
pretty much everywhere we need a variable name (and this PR is ready for review). I did not like the overly verbose variables.<varname>.default_name
, and making Spec
hashable and acting like a string introduces complexity and will likely lead to issues in xr (see here), so I've opted for the solution you can see in 6271e89.
sgkit/variables.py
Outdated
ndim: Union[None, int, Set[int]] = None | ||
|
||
|
||
call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)
Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables
prefix), but I'm not sure which way to go on that.
sgkit/variables.py
Outdated
""" | ||
variant_hwe_p_value = ArrayLikeSpec("variant_hwe_p_value", kind="f") | ||
"""P values from HWE test for each variant as float in [0, 1].""" | ||
variant_beta = ArrayLikeSpec("variant_beta") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should call this variant_hwe_beta
? (Similarly for a few others.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and the other "naming" comments below + all the TODOs in the docs, I suggest we open a separate issue and follow up in separate PR(s). One of the nice side effects of having all those variables in the same place is that we can see the inconsistencies etc. is that fine with you @tomwhite ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fine by me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite so I just remembered something, this would be going against some early comments in https://github.com/pystatgen/sgkit/issues/17, what do you think? also ping @eric-czech
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravwojdyla do you mean the comment about variables.pc_relate_phi.default_name
vs "pc_relate_phi"
? OK, happy to leave as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite right, wrong comment section, sorry, this should have been in https://github.com/pystatgen/sgkit/pull/276#discussion_r495782207
Let's see what @eric-czech thinks. Other than that, afaiu @tomwhite this is ready for another review.
sgkit/variables.py
Outdated
"""Sample PCs (PCxS).""" | ||
pc_relate_phi = ArrayLikeSpec("pc_relate_phi", ndim=2, kind="f") | ||
"""PC Relate kinship coefficient matrix.""" | ||
base_prediction = ArrayLikeSpec("base_prediction", ndim=4, kind="f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regenie_base_prediction
?
sgkit/variables.py
Outdated
homozygous reference, and homozygous alternate counts (in that order) | ||
across all samples for a variant. | ||
""" | ||
call_allele_count = ArrayLikeSpec("call_allele_count", ndim=3, kind="u") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make genotype_counts
/call_allele_count
consistent. Perhaps in a follow-up PR though.
sgkit/stats/aggregation.py
Outdated
@@ -51,16 +52,20 @@ def count_alleles(g: ArrayLike, _: ArrayLike, out: ArrayLike) -> None: | |||
out[a] += 1 | |||
|
|||
|
|||
def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset: | |||
def count_call_alleles( | |||
ds: Dataset, *, call_genotype: str = "call_genotype", merge: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change here to make variable name arguments (and merge
) keyword only.
8a9d4b5
to
9768178
Compare
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
9768178
to
8ca9f8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
sgkit/variables.py
Outdated
ndim: Union[None, int, Set[int]] = None | ||
|
||
|
||
call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tomwhite here
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
6271e89
to
46184e5
Compare
FYI rebased to remove the conflict. |
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
See https://github.com/pystatgen/sgkit/pull/124 for more context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I've marked this auto-merge @ravwojdyla, should merge once you've cleared up the conflicts. |
46184e5
to
5a459e5
Compare
Re: #43
Resurrect: #124 (please read it for context)
Depends on #274