Add PC Relate #228

ravwojdyla · 2020-09-02T12:05:46Z

Need to open a new PR to move to forked repo. Original #133 .

Changes:

add pc_relate method
add tests for pc_relate
add validation data generator for pc_relate
add missing_pct to simulate_genotype_call_dataset, to support missing data

ravwojdyla · 2020-09-02T12:09:55Z

@eric-czech no implementation/logic changes since #133, just added more tests and doc. Could you please take a look at the tests, and if there is anything more needed from test logic perspective? (I still need to update the dataset for validation, see below).

I added validation code to produce validation dataset BUT I have an issue, I would really like to use a "production-ish" dataset for validation, I noticed that zarr zip store, produced from sgkit_plink is actually almost twice the size of original uncompressed plink data (this almost smells like a bug, but I'm not familiar with the sgkit_plink codebase, so maybe that is something expected). Do you have recommendations to keep the size for tests as small as possible?

(sgkit-dev) ➜  pc_relate git:(rav/pc_relate) ✗ du -sh hapmap_YRI_r23a_filtered.*
 37M    hapmap_YRI_r23a_filtered.bed
 69M    hapmap_YRI_r23a_filtered.bim
4.0K    hapmap_YRI_r23a_filtered.fam
210M    hapmap_YRI_r23a_filtered.zarr.zip
 37M    hapmap_YRI_r23a_filtered.zip

hapmap_YRI_r23a_filtered.zarr.zip only includes vars: ['call_genotype', 'call_genotype_mask'].

This should not be merged until we figure out a way to make the test data smaller, or maybe store it somewhere else, or generate it as part of the test.

eric-czech · 2020-09-03T10:25:06Z

Could you please take a look at the tests, and if there is anything more needed from test logic perspective?

Ok I'll take another look. Can you zip up test_pc_relate/zarr_data to get the file count down some more?

I noticed that zarr zip store, produced from sgkit_plink is actually almost twice the size of original uncompressed

I'm not surprised, something like https://github.com/pystatgen/sgkit/issues/80 was necessary in the past for me to produce zarrs smaller than uncompressed PLINK given how wasteful our representation is (we currently use 8 times as many bytes as PLINK does for calls).

I would be surprised though if you couldn't set Blosc/ztsd compressors and get down to the same size with a higher clevel. I have no sense of how much the zip compression works by default.

this almost smells like a bug, but I'm not familiar with the sgkit_plink codebase

How so? I can't see how they'd be connected unless the data written has a ton more entropy somehow (and is completely wrong) or the data types are getting promoted.

eric-czech · 2020-09-04T14:07:56Z

sgkit/stats/pc_relate.py

+
+    Raises
+    ------
+    ValueError


These should be broken out like https://github.com/pystatgen/sgkit/pull/214#discussion_r478502734 (or tested on sphinx with a different bullet -- I'm not sure what would make it happy).

@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?

eric-czech · 2020-09-04T14:10:44Z

sgkit/stats/pc_relate.py

+    a locus are IBD. Several of the most common family relationships and their
+    corresponding kinship coefficient:
+
+    +--------------------------------------------------+---------------------+


eric-czech

Looks great @ravwojdyla!

Except for the minor docstring change for exceptions raised and saving the test data in a ZipStore, I think it's good to go.

Pull request has been modified.

ravwojdyla · 2020-09-05T13:00:11Z

I wonder what you think about this solution @eric-czech, @tomwhite, @jeromekelleher, @alimanfoo, @hammer . To properly validate PR Relate, apart from unit tests, I want to use production public data but:

I don't want to check it into git repo (since it's ~100s MB)
I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)
I would like to "track" changes in the upstream reference R implementation

So here I suggest adding a "Validation suite" Github workflow, that runs weekly or by manual trigger, this allows me to:

fetch production HapMap data from upstream
run up to date R reference implementation of PC Relate
compare our implementation against up to date results

without polluting git repo.

Some points:

I believe this would be a good signal for external people that we care about quality and that they can expect similar/same results as the tools they already use (!)
validation codebase could serve as a great documentation, and examples for people that are getting started, and maybe know R/Hail implementations but would like to see how it could be done with sgkit
the workflow implementation is super simple
to fully run the validation (which includes building docker image, building R packages, installing up to date sgkit-plink, downloading HapMap data, R PC Relate, our PC Relate) we need around ~40 minutes, but since this is not a unit test, and it runs weekly, we don't need to worry about it much
right now the validation workflow, only includes PC Relate, but we could include more steps obviously, we could even use this pattern for benchmarks Add benchmark suite #68 , an exciting validation could be to run full GWAS (or other high level use cases) using Hail and sgkit and comparing them on weekly basis
the validation codebase requires only bash and docker, and runs the same on github action and locally: ./validation/gwas/method/pc_relate/run.sh (which builds the validation docker image, and runs itself inside the docker image)
the validation test, is a regular pytest, I used to work with a project where we had a nice integration between pytest and docker to run specific pytest inside specific docker images, if we like this pattern, I could reimplement it

Downsides:

there's a lot of moving steps (upstream data, R packages etc) involved in the process (as mentioned above), so in the future, there might be some failures in the validation of PC Relate, that are not related to changes in our repo, for example change in a R package, this is problematic BUT there is a bright side of it, it prevents the validation codebase from going stale (which is great!)

I've added 7775e58 to show you have the validation run looks like, see the outputs here (this commit would obviously not be merged in, I just wanted to validate the execution and show it to you).

tomwhite · 2020-09-07T09:19:36Z

This is great @ravwojdyla!

Validating the code in sgkit is probably the most important job this script does at the moment, since sgkit's code is changing rapidly (not PC relate specifically - more the other parts). So I think even having that by itself would be valuable. Tracking the R implementation may become onerous, and may not necessarily be something we want to do longer term - but we could easily freeze the version of the package we validate against.

I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)

The input data should be fixed, shouldn't it? So going stale isn't a concern. I think it might be preferable to host a copy that we control, for various reasons (reliability, not overloading the upstream source, etc).

ravwojdyla · 2020-09-07T09:40:39Z

@tomwhite

but we could easily freeze the version of the package we validate against.

yep, I have not done that, because I actually would like to get an error if the upstream gets updated and fail. If we agree that we prefer to freeze versions, that's fine too (we could also freeze everything but the GENESIS package), tho it would not fail on actual changes in the upstream implementation (without periodic manual update of the versions). We could also keep the current state, and in the future if this becomes problematic, we can freeze all versions?

I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)

The input data should be fixed, shouldn't it? So going stale isn't a concern. I think it might be preferable to host a copy that we control, for various reasons (reliability, not overloading the upstream source, etc).

Yes, the staleness I'm referring to is the codebase, in the past I have seen two ways of doing this:

you freeze data in say GCS (or similar) in known paths, and sometimes update (overwrite) that data - which makes the new code base work, but old might fail (this is a short term solutions usually)
you essentially reinvent some parts of the external resource tracking within git, that could be lightweight, but nevertheless still requires work. Essentially why LFS exists.

Normally if this was not an open source project, I would also lean towards hosting our own data, but since this is the open source realm, I have slight preference to use vanilla upstream as another signal that we test on actual public datasets etc.

wdyt?

ravwojdyla · 2020-09-07T12:11:39Z

And btw, I would also like to point out an interesting use of git bisect (already mentioned in #26) in the context of weekly validation codebase. Say we have this history:

on Sunday the validation runs - it's all green
during the following week we have bunch of changes to different parts of sgkit (all unit tests are green)
next Sunday, validation fails due to combo or one of the changes introduced throughout the week

Question: how do we know which change exactly introduced the regression/bug? git bisect (with linear git history that we keep since #26) + validation suite comes to the rescue:

you mark the current master as bad by running validation suite and getting return code != 0
you mark previous Sunday as good by running validation suite and getting return code == 0
you let git automatically binary search for the commit that introduced the bug

You can read more about git bisect here. We could document this flow, when we get the validation flow in.

hammer · 2020-09-07T12:19:23Z

I agree with Tom: we don't want to be in the business of tracking upstream code and data. It would be nice for us to just validate against a particular release or hash of an upstream tool and some data fixture produced by that tool.

ravwojdyla · 2020-09-07T12:27:00Z

@hammer @tomwhite ok, if we want point in time validation, we can freeze all the versions and push the input data to public GCS bucket (or do you prefer LFS)? My preference is GCS (there might be some need to configure access cost).

hammer · 2020-09-07T12:30:13Z

I also prefer GCS.

ravwojdyla · 2020-09-07T12:41:11Z

Also going forward if we like the Validation Suite approach, I will split this PR into PC Relate + unit tests (this PR), and a separate PR with validation suite + PC Relate validation (everything from this PR related to the Validation Suite).

ravwojdyla · 2020-09-08T15:28:54Z

As mentioned above I am splitting this PR.

eric-czech

I pulled the code down to experiment with it a bit and it helped me find a thing or two -- nothing major except for the PCA applications in the tests. I think that extra test I proposed would be a good addition too, in order to challenge the implementation a bit more without being slow / overly complicated.

eric-czech · 2020-09-12T18:16:40Z

sgkit/stats/pc_relate.py

@@ -0,0 +1,144 @@
+from typing import Tuple
+


Add to top level api init?

eric-czech · 2020-09-12T18:20:40Z

sgkit/stats/pc_relate.py

+    ----------
+    ds : `xr.Dataset`
+        Dataset containing (S = num samples, V = num variants, D = ploidy, PC = num PC):
+        * genotype calls: "call_genotype" (SxVxD)


I'm not sure how this will render in sphinx, but it would probably be a good idea to use the same convention as in the results section where the variables are listed the way parameters are (but with more indentation). That seems to render at least somewhat nicely for https://pystatgen.github.io/sgkit/generated/sgkit.gwas_linear_regression.html#sgkit-gwas-linear-regression.

eric-czech · 2020-09-12T18:23:01Z

sgkit/tests/test_pc_relate.py

+    assert g.call_genotype.shape == (1000, 10, 2)
+    assert g.call_genotype_mask.shape == (1000, 10, 2)
+
+    # Sprinkle some tests data, this is a bit verbose, but in tests verbosity is not bad:


nit: "Test individual cases" perhaps since "sprinkle some tests data" doesn't make sense (grammar-wise)? Not sure you need the verbosity disclaimer, I certainly agree it's not bad.

fixed used "Test individual cases"

eric-czech · 2020-09-12T18:23:28Z

sgkit/tests/test_pc_relate.py

+def test_impute_genotype_call_with_variant_mean() -> None:
+    g = simulate_genotype_call_dataset(1000, 10, missing_pct=0.1)
+    call_g, call_g_mask = _collapse_ploidy(g)
+    # Sprinkle some tests data


nit: Like above, not quite sure what that's getting at

eric-czech · 2020-09-12T18:33:59Z

sgkit/tests/test_pc_relate.py

+    n_samples = 100
+    g = simulate_genotype_call_dataset(1000, n_samples)
+    call_g, _ = _collapse_ploidy(g)
+    pcs = PCA(call_g, ncomp=2).loadings


I took a look at the satsmodels code/docs and this would give the eigenvectors from a decomposition on input data where the samples are features rather than observations. It happens to have the same dimensions but is not the same as the projection of the samples into the PC space implied by a decomposition on the transposed inputs (i.e. samples x variants rather than variants x samples like call_g). It's confusing because sometimes people refer to those projections as "loadings" as well, but "scores" is more common. And the pcs should be scores rather than loadings -- I haven't checked GENESIS but I took a look at Hail to be certain. It also doesn't really make sense to use scaled eigenvectors from the transposed problem for anything unless those eigenvectors come from left-singular vectors instead of right-singular vectors (they'll be right singular when coming from PCA).

I would suggest using either one of these to generate scores instead:

# `x` would have shape (samples, variants) in these cases def pc_scores_numpy(x, k): x = x - x.mean(axis=0, keepdims=True) u, s, v = np.linalg.svd(x, full_matrices=False) r = u * s[np.newaxis] return r[:, :k] def pc_scores_sklearn(x, k): return PCA(n_components=k, svd_solver='full').fit_transform(x) # Then in the test: # pcs = pc_scores(call_g.T.values, 2) # ds["sample_pcs"] = (("components", "samples"), pcs.T)

This was all fresh since I was doing a lot of work with SVD this week, but I wanted to double check the above as well as the usual "loadings" definitions so I verified in this gist: https://gist.github.com/eric-czech/52a04007f4cfb8f05b312079ca770add

Oh also I found that the statsmodels.PCA was super slow even with small data, so those definitely speed things up a lot.

Oh dang, I remember I meant to validate that, thanks for the catch! Also I did not want to add sklearn as another dependency, but I guess since it's dev/test deps, that was a premature optimisation on my part. Fixed, used sklearn (and added it to the requirements-dev.txt).

eric-czech · 2020-09-12T18:42:40Z

sgkit/tests/test_pc_relate.py

+    data_np = phi.pc_relate_phi.data.compute()  # to be able to use fancy indexing below
+    upper_phi = data_np[np.triu_indices_from(data_np, 1)]
+    assert (upper_phi > -0.5).all() and (upper_phi < 0.5).all()
+


I think a useful test here would be to try to at least pick out some parent/child relationships since they're fairly easy to simulate. I was tinkering with this a bit and it snowballed into this:

# Create a dataset that is 2/3 founders and 1/3 progeny rs = np.random.RandomState(1) ds = simulate_genotype_call_dataset(1000, 300, seed=1) ds['sample_type'] = xr.DataArray(np.repeat(['mother', 'father', 'child'], 100), dims='samples') sample_groups = ds.groupby('sample_type').groups def simulate_new_generation(ds): # Generate progeny genotypes as a combination of randomly # selected haplotypes from each parents idx = sample_groups['mother'] + sample_groups['father'] gt = ds.call_genotype.isel(samples=idx).values idx = rs.randint(0, 2, size=gt.shape[:2]) # Collapse to haplotype across ploidy dim using indexer # * shape = (samples, variants) ht = gt[np.ix_(*map(range, gt.shape[:2])) + (idx,)].T gt_child = np.stack([ht[sample_groups[t]] for t in ['mother', 'father']]).T ds['call_genotype'].values = np.concatenate((gt, gt_child), axis=1) return ds # Redefine the progeny genotypes ds = simulate_new_generation(ds) # Infer kinship call_g, _ = _collapse_ploidy(ds) pcs = pc_scores(call_g.T.values, 1) ds["sample_pcs"] = (("components", "samples"), pcs.T) ds["pc_relate_phi"] = pc_relate(ds)["pc_relate_phi"].compute() # Check that all coefficients are in expected ranges cts = ( ds["pc_relate_phi"].to_series().reset_index() .pipe(lambda df: df.loc[df.sample_x >= df.sample_y]['pc_relate_phi']) .pipe( pd.cut, bins=[p for phi in [0, .25, .5] for p in [phi-.1, phi+.1]], labels = ['unrelated', 'unclassified', 'parent/child', 'unclassified', 'self'], ordered=False ) .value_counts() ) print(cts) # unrelated 44650 # self 300 # parent/child 200 # unclassified 0 assert cts['parent/child'] == len(sample_groups['child']) * 2 assert cts['self'] == ds.dims['samples'] assert cts['unclassified'] == 0

That produces kinship estimates with pretty well segregated distributions, which you can tighten up a bit by adding more variants to the simulation:

Thanks Eric!!! I initially thought that I don't want to implement this kinda of tests since all those cases were covered by #242 (on real data). I added your test anyway.

Pull request has been modified.

ravwojdyla · 2020-09-12T19:36:47Z

Weird test_read_vcfzarr is failing 🤔:

ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (['variants'], False) to Variable.

Let's see if it's a persistent error or a flaky test:

ravwojdyla · 2020-09-12T20:00:54Z

Ah interesting dask was released yesterday version: 2.26.0 and it breaks our test_read_vcfzarr. That same test works fine on 2.25.0.

mergify · 2020-09-14T09:20:55Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

Pull request has been modified.

tomwhite · 2020-09-14T09:36:21Z

sgkit/stats/pc_relate.py

+
+    Raises
+    ------
+    ValueError


@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?

tomwhite · 2020-09-14T09:37:56Z

sgkit/stats/pc_relate.py

+    return call_g, call_g_mask
+
+
+def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:


We should add the merge=True parameter, but that can be done separately.

tomwhite · 2020-09-14T09:38:43Z

sgkit/__init__.py

    DIM_ALLELE,
    DIM_PLOIDY,
-    DIM_SAMPLE,


Nit: why did this change?

must have been isort doing something weird, reverted to the original

mergify · 2020-09-14T10:07:08Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

Pull request has been modified.

ravwojdyla · 2020-09-14T10:22:08Z

@tomwhite also added pc_relate to the doc api, and fixed some rst issues.

mergify · 2020-09-14T11:00:51Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

ravwojdyla · 2020-09-14T11:11:32Z

@tomwhite is this good to merge?

tomwhite · 2020-09-14T11:15:27Z

From my point of view, yes. Did @eric-czech approve this one?

eric-czech · 2020-09-14T11:23:04Z

Did @eric-czech approve this one?

I did but it looks like mergify dismissed it. Re-approving now.

ravwojdyla force-pushed the rav/pc_relate branch from b2bc816 to 0f42bab Compare September 2, 2020 12:17

eric-czech reviewed Sep 4, 2020

View reviewed changes

eric-czech previously approved these changes Sep 4, 2020

View reviewed changes

ravwojdyla force-pushed the rav/pc_relate branch from 0f42bab to 7775e58 Compare September 5, 2020 12:10

ravwojdyla force-pushed the rav/pc_relate branch from 7775e58 to 070ce72 Compare September 8, 2020 15:13

ravwojdyla mentioned this pull request Sep 8, 2020

Validation Suite (PC Relate) #242

Merged

4 tasks

ravwojdyla requested review from eric-czech and tomwhite September 8, 2020 15:41

eric-czech previously requested changes Sep 12, 2020

View reviewed changes

ravwojdyla force-pushed the rav/pc_relate branch from 070ce72 to 8ff5684 Compare September 12, 2020 19:23

ravwojdyla force-pushed the rav/pc_relate branch from 8ff5684 to 4423feb Compare September 12, 2020 19:38

ravwojdyla mentioned this pull request Sep 12, 2020

Dask 2.26.0 breaks test_read_vcfzarr #245

Closed

eric-czech previously approved these changes Sep 12, 2020

View reviewed changes

mergify bot added the conflict PR conflict label Sep 14, 2020

ravwojdyla force-pushed the rav/pc_relate branch from a4caa48 to b2a52fd Compare September 14, 2020 09:29

mergify bot removed the conflict PR conflict label Sep 14, 2020

tomwhite previously approved these changes Sep 14, 2020

View reviewed changes

mergify bot added the conflict PR conflict label Sep 14, 2020

ravwojdyla force-pushed the rav/pc_relate branch from b2a52fd to 6b75dcc Compare September 14, 2020 10:20

mergify bot removed the conflict PR conflict label Sep 14, 2020

mergify bot added the conflict PR conflict label Sep 14, 2020

ravwojdyla added 2 commits September 14, 2020 13:01

Support missing data in genotype simulation

1d37cf3

Add PC Relate

5dca075

ravwojdyla force-pushed the rav/pc_relate branch from 6b75dcc to 5dca075 Compare September 14, 2020 11:02

mergify bot removed the conflict PR conflict label Sep 14, 2020

eric-czech approved these changes Sep 14, 2020

View reviewed changes

tomwhite added the auto-merge Auto merge label for mergify test flight label Sep 14, 2020

mergify bot merged commit bec1616 into sgkit-dev:master Sep 14, 2020

ravwojdyla deleted the rav/pc_relate branch September 14, 2020 11:49

ravwojdyla mentioned this pull request Sep 14, 2020

PC-Relate implementation #24

Closed

		return call_g, call_g_mask


		def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:

Add PC Relate #228

Add PC Relate #228

Conversation

ravwojdyla commented Sep 2, 2020 • edited Loading

ravwojdyla commented Sep 2, 2020 • edited Loading

eric-czech commented Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech left a comment

Choose a reason for hiding this comment

ravwojdyla commented Sep 5, 2020 • edited Loading

tomwhite commented Sep 7, 2020

ravwojdyla commented Sep 7, 2020 • edited Loading

ravwojdyla commented Sep 7, 2020 • edited Loading

hammer commented Sep 7, 2020

ravwojdyla commented Sep 7, 2020 • edited Loading

hammer commented Sep 7, 2020

ravwojdyla commented Sep 7, 2020 • edited Loading

ravwojdyla commented Sep 8, 2020

eric-czech left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

eric-czech Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla commented Sep 12, 2020 • edited Loading

ravwojdyla commented Sep 12, 2020 • edited Loading

mergify bot commented Sep 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Sep 14, 2020

ravwojdyla commented Sep 14, 2020

mergify bot commented Sep 14, 2020

ravwojdyla commented Sep 14, 2020

tomwhite commented Sep 14, 2020

eric-czech commented Sep 14, 2020

ravwojdyla commented Sep 2, 2020 •

edited

Loading

ravwojdyla commented Sep 2, 2020 •

edited

Loading

eric-czech commented Sep 3, 2020 •

edited

Loading

ravwojdyla commented Sep 5, 2020 •

edited

Loading

ravwojdyla commented Sep 7, 2020 •

edited

Loading

ravwojdyla commented Sep 7, 2020 •

edited

Loading

ravwojdyla commented Sep 7, 2020 •

edited

Loading

ravwojdyla commented Sep 7, 2020 •

edited

Loading

eric-czech Sep 12, 2020 •

edited

Loading

ravwojdyla Sep 12, 2020 •

edited

Loading

eric-czech Sep 12, 2020 •

edited

Loading

ravwojdyla commented Sep 12, 2020 •

edited

Loading

ravwojdyla commented Sep 12, 2020 •

edited

Loading