Dataset "schema" v0.3 #276

ravwojdyla · 2020-09-23T14:16:50Z

Re: #43
Resurrect: #124 (please read it for context)

~~Depends on #274~~

mergify · 2020-09-24T07:30:46Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

ravwojdyla · 2020-09-24T14:41:57Z

sgkit/stats/aggregation.py

    """Compute quality control variant statistics from genotype calls.

    Parameters
    ----------
    ds
        Genotype call dataset such as from
        `sgkit.create_genotype_call_dataset`.
+    call_genotype
+        Input variable name holding call_genotype.
+        As defined by `sgkit.variables.call_genotype`.


We should probably decide on how we want to do this linking, and what kind of information goes into method doc vs variable docs. I would actually suggest we do this in a separate PR, to keep review slim. But if people feel strong to do it here, fine as well.

FYI added links between methods and variables in the documentation.

ravwojdyla · 2020-09-24T14:44:46Z

sgkit/variables.py

+"""TODO"""
+genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i")
+"""
+Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal


FYI, the 1st sentence appears on the front page of the variable, the rest appears on the variable specific page.

Can you add a comment at the top of these definitions saying this?

ravwojdyla · 2020-09-24T14:44:59Z

sgkit/variables.py

+    ndim: Union[None, int, Set[int]] = None
+
+
+call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)


We should decide if we want to list this in a single namespace, group these, and or what order do we want them documented. Also would be good to have bidirectional links between variables and methods.

+1 to making it possible to see what methods use or produce a variable.

My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)

Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables prefix), but I'm not sure which way to go on that.

I agree with @tomwhite here

ravwojdyla · 2020-09-24T14:46:05Z

sgkit/variables.py

+call_genotype_probability_mask = ArrayLikeSpec(
+    "call_genotype_probability_mask", kind="b", ndim=3
+)
+"""TODO"""


There are some TODO's still where the doc was missing at the origin.

eric-czech · 2020-09-25T16:58:52Z

sgkit/model.py

@@ -83,7 +79,9 @@ def create_genotype_call_dataset(
        check_array_like(variant_id, kind={"U", "O"}, ndim=1)
        data_vars["variant_id"] = ([DIM_VARIANT], variant_id)
    attrs: Dict[Hashable, Any] = {"contigs": variant_contig_names}
-    return xr.Dataset(data_vars=data_vars, attrs=attrs)
+    return variables.validate(
+        xr.Dataset(data_vars=data_vars, attrs=attrs), *data_vars.keys()


Any reason not to have a validate overload that checks against all sgkit.variables by default (to avoid *data_vars.keys)?

eric-czech · 2020-09-25T17:00:47Z

sgkit/stats/hwe.py

@@ -140,6 +144,12 @@ def hardy_weinberg_test(
        where `N` is equal to the number of variants and the 3 columns contain
        heterozygous, homozygous reference, and homozygous alternate counts
        (in that order) across all samples for a variant.
+    call_genotype
+        Input variable name holding call_genotype.
+        As defined by :data:`sgkit.variables.call_genotype`.


nit: This isn't a complete sentence which will be a little odd given how common "As defined by X." is. I would suggest "Defined by X." instead.

eric-czech

LGTM other than a couple minor things. Thanks for cleaning up all those signatures! Much more consistent now.

tomwhite

This looks great - thanks @ravwojdyla. I have a few suggestions (mostly minor), but this should be ready to merge soon.

One thing worth mentioning is that users writing their own methods do not have to use the validation if they don't want to (in fact, it's not really a part of the public API so it's not easy to use). I think that's fine - we want to encourage users to take advantage of xarray APIs as needed - and this work is to make sgkit's methods more internally consistent and well-documented.

tomwhite · 2020-09-28T08:44:27Z

sgkit/variables.py

+"""TODO"""
+genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i")
+"""
+Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal


Can you add a comment at the top of these definitions saying this?

tomwhite · 2020-09-28T08:47:26Z

sgkit/stats/aggregation.py

@@ -113,17 +119,24 @@ def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset:
            )
        }
    )
-    return conditional_merge_datasets(ds, new_ds, merge)
+    return variables.validate(
+        conditional_merge_datasets(ds, new_ds, merge), "call_allele_count"


Should output variable names reference the spec too?

@tomwhite this is outdated via 47eef29

OK, but the question still remains where the dataset is created. A typo there won't cause validation to fail will it?

@tomwhite a typo there would definitely cause a validation error (which is intentional/good). Whether we should reference variable or use strings is a good question, but more from a "dev practices" standpoint. I guess I lean towards referencing it, BUT it would mean that you need to for example say: variables.pc_relate_phi.default_name instead of "pc_relate_phi". We could shorten it to variables.pc_relate_phi, if we provide some magic methods (hash/str).

I don't really see much value of using the reference from the user perspective, since the validation works anyway. One thing that comes to me mind tho, is that maybe if we reference in all the methods, maybe there is some plugin that could generate the links for usage, so that we could link variables to methods 🤔

wdyt?

Yes, I would probably lean towards referencing the variable too.

for posterity, I've commented on the wrong comment section, copying over here:

@tomwhite so I just remembered something, this would be going against some early comments in #17, what do you think? also ping @eric-czech

I think using sgkit.variables for creating the output datasets sounds good. The closest thing to an edge case I can think of right now is a function that generates a mask or mean imputes one of several possible input variables. Perhaps validation could eventually be more dynamic for those cases. Otherwise, I can't imagine where using a constant would be awkward.

@tomwhite @eric-czech ok, this PR is now using sgkit.variables pretty much everywhere we need a variable name (and this PR is ready for review). I did not like the overly verbose variables.<varname>.default_name, and making Spec hashable and acting like a string introduces complexity and will likely lead to issues in xr (see here), so I've opted for the solution you can see in 6271e89.

tomwhite · 2020-09-28T08:53:02Z

sgkit/variables.py

+    ndim: Union[None, int, Set[int]] = None
+
+
+call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)


My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)

Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables prefix), but I'm not sure which way to go on that.

tomwhite · 2020-09-28T08:54:45Z

sgkit/variables.py

+"""
+variant_hwe_p_value = ArrayLikeSpec("variant_hwe_p_value", kind="f")
+"""P values from HWE test for each variant as float in [0, 1]."""
+variant_beta = ArrayLikeSpec("variant_beta")


I wonder if we should call this variant_hwe_beta? (Similarly for a few others.)

For this and the other "naming" comments below + all the TODOs in the docs, I suggest we open a separate issue and follow up in separate PR(s). One of the nice side effects of having all those variables in the same place is that we can see the inconsistencies etc. is that fine with you @tomwhite ?

Yes, fine by me

@tomwhite so I just remembered something, this would be going against some early comments in https://github.com/pystatgen/sgkit/issues/17, what do you think? also ping @eric-czech

@ravwojdyla do you mean the comment about variables.pc_relate_phi.default_name vs "pc_relate_phi"? OK, happy to leave as a string.

@tomwhite right, wrong comment section, sorry, this should have been in https://github.com/pystatgen/sgkit/pull/276#discussion_r495782207

Let's see what @eric-czech thinks. Other than that, afaiu @tomwhite this is ready for another review.

tomwhite · 2020-09-28T08:55:09Z

sgkit/variables.py

+"""Sample PCs (PCxS)."""
+pc_relate_phi = ArrayLikeSpec("pc_relate_phi", ndim=2, kind="f")
+"""PC Relate kinship coefficient matrix."""
+base_prediction = ArrayLikeSpec("base_prediction", ndim=4, kind="f")


regenie_base_prediction?

tomwhite · 2020-09-28T08:56:14Z

sgkit/variables.py

+homozygous reference, and homozygous alternate counts (in that order)
+across all samples for a variant.
+"""
+call_allele_count = ArrayLikeSpec("call_allele_count", ndim=3, kind="u")


We should make genotype_counts/call_allele_count consistent. Perhaps in a follow-up PR though.

tomwhite · 2020-09-28T09:01:41Z

sgkit/stats/aggregation.py

@@ -51,16 +52,20 @@ def count_alleles(g: ArrayLike, _: ArrayLike, out: ArrayLike) -> None:
            out[a] += 1


-def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset:
+def count_call_alleles(
+    ds: Dataset, *, call_genotype: str = "call_genotype", merge: bool = True


Nice change here to make variable name arguments (and merge) keyword only.

mergify · 2020-09-28T13:24:30Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

jeromekelleher

LGTM!

jeromekelleher · 2020-09-28T15:34:52Z

sgkit/variables.py

+    ndim: Union[None, int, Set[int]] = None
+
+
+call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)


I agree with @tomwhite here

mergify · 2020-09-30T09:24:40Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

ravwojdyla · 2020-09-30T10:13:18Z

FYI rebased to remove the conflict.

mergify · 2020-10-01T15:09:29Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

See https://github.com/pystatgen/sgkit/pull/124 for more context

eric-czech

Looks good!

jeromekelleher · 2020-10-01T16:38:14Z

I've marked this auto-merge @ravwojdyla, should merge once you've cleared up the conflicts.

mergify bot added the conflict PR conflict label Sep 24, 2020

ravwojdyla force-pushed the rav/schema_v0_3 branch from 130af83 to 722e50f Compare September 24, 2020 14:39

mergify bot removed the conflict PR conflict label Sep 24, 2020

ravwojdyla commented Sep 24, 2020

View reviewed changes

ravwojdyla force-pushed the rav/schema_v0_3 branch 2 times, most recently from 6f836d0 to 2e469f7 Compare September 25, 2020 14:16

ravwojdyla requested review from tomwhite, eric-czech and jeromekelleher September 25, 2020 14:21

eric-czech reviewed Sep 25, 2020

View reviewed changes

eric-czech requested changes Sep 25, 2020

View reviewed changes

tomwhite requested changes Sep 28, 2020

View reviewed changes

ravwojdyla requested review from tomwhite and eric-czech September 28, 2020 10:14

ravwojdyla force-pushed the rav/schema_v0_3 branch from 8a9d4b5 to 9768178 Compare September 28, 2020 10:21

mergify bot added the conflict PR conflict label Sep 28, 2020

ravwojdyla force-pushed the rav/schema_v0_3 branch from 9768178 to 8ca9f8b Compare September 28, 2020 14:03

mergify bot removed the conflict PR conflict label Sep 28, 2020

jeromekelleher approved these changes Sep 28, 2020

View reviewed changes

eric-czech mentioned this pull request Sep 29, 2020

Cohort grouping for popgen #260

Merged

mergify bot added the conflict PR conflict label Sep 30, 2020

ravwojdyla force-pushed the rav/schema_v0_3 branch from 6271e89 to 46184e5 Compare September 30, 2020 10:12

mergify bot removed the conflict PR conflict label Sep 30, 2020

tomwhite approved these changes Oct 1, 2020

View reviewed changes

mergify bot added the conflict PR conflict label Oct 1, 2020

Add SgkitVariables

22fd605

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla mentioned this pull request Oct 1, 2020

Link variables to methods (and back) #293

Open

ravwojdyla added 7 commits October 1, 2020 18:14

Use SgkitVariables in the computation functions

4b499b1

Add sgkit.variables to the doc

c18902d

Add links between methods and variables

070f828

Add an option to validate whole dataset

6bb078f

Change wording, fix pc_relate signature

a4a3b77

Sort variables names + add top level comment

663a4b9

Reference variables instead of using strings

51b4587

eric-czech approved these changes Oct 1, 2020

View reviewed changes

jeromekelleher added the auto-merge Auto merge label for mergify test flight label Oct 1, 2020

ravwojdyla added 2 commits October 2, 2020 02:49

Update variable changes after recent popgen changes

11e55cb

Use Hashable instead of string for var names

5a459e5

ravwojdyla force-pushed the rav/schema_v0_3 branch from 46184e5 to 5a459e5 Compare October 2, 2020 00:51

mergify bot removed the conflict PR conflict label Oct 2, 2020

mergify bot merged commit a510e12 into sgkit-dev:master Oct 2, 2020

ravwojdyla mentioned this pull request Oct 2, 2020

Review sgkit variables #295

Closed

		ndim: Union[None, int, Set[int]] = None


		call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)

Dataset "schema" v0.3 #276

Dataset "schema" v0.3 #276

Conversation

ravwojdyla commented Sep 23, 2020 • edited Loading

mergify bot commented Sep 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech left a comment • edited Loading

Choose a reason for hiding this comment

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Sep 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Sep 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Sep 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Sep 28, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Sep 30, 2020

ravwojdyla commented Sep 30, 2020

mergify bot commented Oct 1, 2020

eric-czech left a comment

Choose a reason for hiding this comment

jeromekelleher commented Oct 1, 2020

ravwojdyla commented Sep 23, 2020 •

edited

Loading

eric-czech left a comment •

edited

Loading

ravwojdyla Sep 28, 2020 •

edited

Loading

ravwojdyla Sep 29, 2020 •

edited

Loading

ravwojdyla Sep 28, 2020 •

edited

Loading