Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PC Relate #228

Merged
merged 2 commits into from
Sep 14, 2020
Merged

Add PC Relate #228

merged 2 commits into from
Sep 14, 2020

Conversation

ravwojdyla
Copy link
Collaborator

@ravwojdyla ravwojdyla commented Sep 2, 2020

Need to open a new PR to move to forked repo. Original #133 .

Changes:

  • add pc_relate method
  • add tests for pc_relate
  • add validation data generator for pc_relate
  • add missing_pct to simulate_genotype_call_dataset, to support missing data

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 2, 2020

@eric-czech no implementation/logic changes since #133, just added more tests and doc. Could you please take a look at the tests, and if there is anything more needed from test logic perspective? (I still need to update the dataset for validation, see below).

I added validation code to produce validation dataset BUT I have an issue, I would really like to use a "production-ish" dataset for validation, I noticed that zarr zip store, produced from sgkit_plink is actually almost twice the size of original uncompressed plink data (this almost smells like a bug, but I'm not familiar with the sgkit_plink codebase, so maybe that is something expected). Do you have recommendations to keep the size for tests as small as possible?

(sgkit-dev) ➜  pc_relate git:(rav/pc_relate) ✗ du -sh hapmap_YRI_r23a_filtered.*
 37M    hapmap_YRI_r23a_filtered.bed
 69M    hapmap_YRI_r23a_filtered.bim
4.0K    hapmap_YRI_r23a_filtered.fam
210M    hapmap_YRI_r23a_filtered.zarr.zip
 37M    hapmap_YRI_r23a_filtered.zip

hapmap_YRI_r23a_filtered.zarr.zip only includes vars: ['call_genotype', 'call_genotype_mask'].

This should not be merged until we figure out a way to make the test data smaller, or maybe store it somewhere else, or generate it as part of the test.

@eric-czech
Copy link
Collaborator

eric-czech commented Sep 3, 2020

Could you please take a look at the tests, and if there is anything more needed from test logic perspective?

Ok I'll take another look. Can you zip up test_pc_relate/zarr_data to get the file count down some more?

I noticed that zarr zip store, produced from sgkit_plink is actually almost twice the size of original uncompressed

I'm not surprised, something like https://github.com/pystatgen/sgkit/issues/80 was necessary in the past for me to produce zarrs smaller than uncompressed PLINK given how wasteful our representation is (we currently use 8 times as many bytes as PLINK does for calls).

I would be surprised though if you couldn't set Blosc/ztsd compressors and get down to the same size with a higher clevel. I have no sense of how much the zip compression works by default.

this almost smells like a bug, but I'm not familiar with the sgkit_plink codebase

How so? I can't see how they'd be connected unless the data written has a ton more entropy somehow (and is completely wrong) or the data types are getting promoted.


Raises
------
ValueError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be broken out like https://github.com/pystatgen/sgkit/pull/214#discussion_r478502734 (or tested on sphinx with a different bullet -- I'm not sure what would make it happy).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?

a locus are IBD. Several of the most common family relationships and their
corresponding kinship coefficient:

+--------------------------------------------------+---------------------+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

eric-czech
eric-czech previously approved these changes Sep 4, 2020
Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @ravwojdyla!

Except for the minor docstring change for exceptions raised and saving the test data in a ZipStore, I think it's good to go.

@mergify mergify bot dismissed eric-czech’s stale review September 5, 2020 12:10

Pull request has been modified.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 5, 2020

I wonder what you think about this solution @eric-czech, @tomwhite, @jeromekelleher, @alimanfoo, @hammer . To properly validate PR Relate, apart from unit tests, I want to use production public data but:

  • I don't want to check it into git repo (since it's ~100s MB)
  • I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)
  • I would like to "track" changes in the upstream reference R implementation

So here I suggest adding a "Validation suite" Github workflow, that runs weekly or by manual trigger, this allows me to:

  • fetch production HapMap data from upstream
  • run up to date R reference implementation of PC Relate
  • compare our implementation against up to date results

without polluting git repo.

Some points:

  • I believe this would be a good signal for external people that we care about quality and that they can expect similar/same results as the tools they already use (!)
  • validation codebase could serve as a great documentation, and examples for people that are getting started, and maybe know R/Hail implementations but would like to see how it could be done with sgkit
  • the workflow implementation is super simple
  • to fully run the validation (which includes building docker image, building R packages, installing up to date sgkit-plink, downloading HapMap data, R PC Relate, our PC Relate) we need around ~40 minutes, but since this is not a unit test, and it runs weekly, we don't need to worry about it much
  • right now the validation workflow, only includes PC Relate, but we could include more steps obviously, we could even use this pattern for benchmarks Add benchmark suite #68 , an exciting validation could be to run full GWAS (or other high level use cases) using Hail and sgkit and comparing them on weekly basis
  • the validation codebase requires only bash and docker, and runs the same on github action and locally: ./validation/gwas/method/pc_relate/run.sh (which builds the validation docker image, and runs itself inside the docker image)
  • the validation test, is a regular pytest, I used to work with a project where we had a nice integration between pytest and docker to run specific pytest inside specific docker images, if we like this pattern, I could reimplement it

Downsides:

  • there's a lot of moving steps (upstream data, R packages etc) involved in the process (as mentioned above), so in the future, there might be some failures in the validation of PC Relate, that are not related to changes in our repo, for example change in a R package, this is problematic BUT there is a bright side of it, it prevents the validation codebase from going stale (which is great!)

I've added 7775e58 to show you have the validation run looks like, see the outputs here (this commit would obviously not be merged in, I just wanted to validate the execution and show it to you).

@tomwhite
Copy link
Collaborator

tomwhite commented Sep 7, 2020

This is great @ravwojdyla!

Validating the code in sgkit is probably the most important job this script does at the moment, since sgkit's code is changing rapidly (not PC relate specifically - more the other parts). So I think even having that by itself would be valuable. Tracking the R implementation may become onerous, and may not necessarily be something we want to do longer term - but we could easily freeze the version of the package we validate against.

I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)

The input data should be fixed, shouldn't it? So going stale isn't a concern. I think it might be preferable to host a copy that we control, for various reasons (reliability, not overloading the upstream source, etc).

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 7, 2020

@tomwhite

but we could easily freeze the version of the package we validate against.

yep, I have not done that, because I actually would like to get an error if the upstream gets updated and fail. If we agree that we prefer to freeze versions, that's fine too (we could also freeze everything but the GENESIS package), tho it would not fail on actual changes in the upstream implementation (without periodic manual update of the versions). We could also keep the current state, and in the future if this becomes problematic, we can freeze all versions?

I also don't want to store the data/results in external systems (GCS/S3 etc, which likely would go stale)

The input data should be fixed, shouldn't it? So going stale isn't a concern. I think it might be preferable to host a copy that we control, for various reasons (reliability, not overloading the upstream source, etc).

Yes, the staleness I'm referring to is the codebase, in the past I have seen two ways of doing this:

  • you freeze data in say GCS (or similar) in known paths, and sometimes update (overwrite) that data - which makes the new code base work, but old might fail (this is a short term solutions usually)
  • you essentially reinvent some parts of the external resource tracking within git, that could be lightweight, but nevertheless still requires work. Essentially why LFS exists.

Normally if this was not an open source project, I would also lean towards hosting our own data, but since this is the open source realm, I have slight preference to use vanilla upstream as another signal that we test on actual public datasets etc.

wdyt?

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 7, 2020

And btw, I would also like to point out an interesting use of git bisect (already mentioned in #26) in the context of weekly validation codebase. Say we have this history:

  • on Sunday the validation runs - it's all green
  • during the following week we have bunch of changes to different parts of sgkit (all unit tests are green)
  • next Sunday, validation fails due to combo or one of the changes introduced throughout the week

Question: how do we know which change exactly introduced the regression/bug? git bisect (with linear git history that we keep since #26) + validation suite comes to the rescue:

  • you mark the current master as bad by running validation suite and getting return code != 0
  • you mark previous Sunday as good by running validation suite and getting return code == 0
  • you let git automatically binary search for the commit that introduced the bug

You can read more about git bisect here. We could document this flow, when we get the validation flow in.

@hammer
Copy link
Contributor

hammer commented Sep 7, 2020

I agree with Tom: we don't want to be in the business of tracking upstream code and data. It would be nice for us to just validate against a particular release or hash of an upstream tool and some data fixture produced by that tool.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 7, 2020

@hammer @tomwhite ok, if we want point in time validation, we can freeze all the versions and push the input data to public GCS bucket (or do you prefer LFS)? My preference is GCS (there might be some need to configure access cost).

@hammer
Copy link
Contributor

hammer commented Sep 7, 2020

I also prefer GCS.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 7, 2020

Also going forward if we like the Validation Suite approach, I will split this PR into PC Relate + unit tests (this PR), and a separate PR with validation suite + PC Relate validation (everything from this PR related to the Validation Suite).

@ravwojdyla
Copy link
Collaborator Author

As mentioned above I am splitting this PR.

@ravwojdyla ravwojdyla mentioned this pull request Sep 8, 2020
4 tasks
Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the code down to experiment with it a bit and it helped me find a thing or two -- nothing major except for the PCA applications in the tests. I think that extra test I proposed would be a good addition too, in order to challenge the implementation a bit more without being slow / overly complicated.

@@ -0,0 +1,144 @@
from typing import Tuple

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to top level api init?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

----------
ds : `xr.Dataset`
Dataset containing (S = num samples, V = num variants, D = ploidy, PC = num PC):
* genotype calls: "call_genotype" (SxVxD)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this will render in sphinx, but it would probably be a good idea to use the same convention as in the results section where the variables are listed the way parameters are (but with more indentation). That seems to render at least somewhat nicely for https://pystatgen.github.io/sgkit/generated/sgkit.gwas_linear_regression.html#sgkit-gwas-linear-regression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

assert g.call_genotype.shape == (1000, 10, 2)
assert g.call_genotype_mask.shape == (1000, 10, 2)

# Sprinkle some tests data, this is a bit verbose, but in tests verbosity is not bad:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Test individual cases" perhaps since "sprinkle some tests data" doesn't make sense (grammar-wise)? Not sure you need the verbosity disclaimer, I certainly agree it's not bad.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed used "Test individual cases"

def test_impute_genotype_call_with_variant_mean() -> None:
g = simulate_genotype_call_dataset(1000, 10, missing_pct=0.1)
call_g, call_g_mask = _collapse_ploidy(g)
# Sprinkle some tests data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Like above, not quite sure what that's getting at

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

n_samples = 100
g = simulate_genotype_call_dataset(1000, n_samples)
call_g, _ = _collapse_ploidy(g)
pcs = PCA(call_g, ncomp=2).loadings
Copy link
Collaborator

@eric-czech eric-czech Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the satsmodels code/docs and this would give the eigenvectors from a decomposition on input data where the samples are features rather than observations. It happens to have the same dimensions but is not the same as the projection of the samples into the PC space implied by a decomposition on the transposed inputs (i.e. samples x variants rather than variants x samples like call_g). It's confusing because sometimes people refer to those projections as "loadings" as well, but "scores" is more common. And the pcs should be scores rather than loadings -- I haven't checked GENESIS but I took a look at Hail to be certain. It also doesn't really make sense to use scaled eigenvectors from the transposed problem for anything unless those eigenvectors come from left-singular vectors instead of right-singular vectors (they'll be right singular when coming from PCA).

I would suggest using either one of these to generate scores instead:

# `x` would have shape (samples, variants) in these cases

def pc_scores_numpy(x, k):
    x = x - x.mean(axis=0, keepdims=True)
    u, s, v = np.linalg.svd(x, full_matrices=False)
    r = u * s[np.newaxis]
    return r[:, :k]

def pc_scores_sklearn(x, k):
    return PCA(n_components=k, svd_solver='full').fit_transform(x)

# Then in the test:
# pcs = pc_scores(call_g.T.values, 2)
# ds["sample_pcs"] = (("components", "samples"), pcs.T)

This was all fresh since I was doing a lot of work with SVD this week, but I wanted to double check the above as well as the usual "loadings" definitions so I verified in this gist: https://gist.github.com/eric-czech/52a04007f4cfb8f05b312079ca770add

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also I found that the statsmodels.PCA was super slow even with small data, so those definitely speed things up a lot.

Copy link
Collaborator Author

@ravwojdyla ravwojdyla Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dang, I remember I meant to validate that, thanks for the catch! Also I did not want to add sklearn as another dependency, but I guess since it's dev/test deps, that was a premature optimisation on my part. Fixed, used sklearn (and added it to the requirements-dev.txt).

data_np = phi.pc_relate_phi.data.compute() # to be able to use fancy indexing below
upper_phi = data_np[np.triu_indices_from(data_np, 1)]
assert (upper_phi > -0.5).all() and (upper_phi < 0.5).all()

Copy link
Collaborator

@eric-czech eric-czech Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a useful test here would be to try to at least pick out some parent/child relationships since they're fairly easy to simulate. I was tinkering with this a bit and it snowballed into this:

# Create a dataset that is 2/3 founders and 1/3 progeny
rs = np.random.RandomState(1)
ds = simulate_genotype_call_dataset(1000, 300, seed=1)
ds['sample_type'] = xr.DataArray(np.repeat(['mother', 'father', 'child'], 100), dims='samples')
sample_groups = ds.groupby('sample_type').groups

def simulate_new_generation(ds):
    # Generate progeny genotypes as a combination of randomly 
    # selected haplotypes from each parents
    idx = sample_groups['mother'] + sample_groups['father']
    gt = ds.call_genotype.isel(samples=idx).values
    idx = rs.randint(0, 2, size=gt.shape[:2])
    # Collapse to haplotype across ploidy dim using indexer
    # * shape = (samples, variants)
    ht = gt[np.ix_(*map(range, gt.shape[:2])) + (idx,)].T
    gt_child = np.stack([ht[sample_groups[t]] for t in ['mother', 'father']]).T
    ds['call_genotype'].values = np.concatenate((gt, gt_child), axis=1)
    return ds

# Redefine the progeny genotypes
ds = simulate_new_generation(ds)

# Infer kinship
call_g, _ = _collapse_ploidy(ds)
pcs = pc_scores(call_g.T.values, 1)
ds["sample_pcs"] = (("components", "samples"), pcs.T)
ds["pc_relate_phi"] = pc_relate(ds)["pc_relate_phi"].compute()

# Check that all coefficients are in expected ranges
cts = (
    ds["pc_relate_phi"].to_series().reset_index()
    .pipe(lambda df: df.loc[df.sample_x >= df.sample_y]['pc_relate_phi'])
    .pipe(
        pd.cut, 
        bins=[p for phi in [0, .25, .5] for p in [phi-.1, phi+.1]],
        labels = ['unrelated', 'unclassified', 'parent/child', 'unclassified', 'self'],
        ordered=False
    )
    .value_counts()
)
print(cts)
# unrelated       44650
# self              300
# parent/child      200
# unclassified        0
assert cts['parent/child'] == len(sample_groups['child']) * 2
assert cts['self'] == ds.dims['samples']
assert cts['unclassified'] == 0

That produces kinship estimates with pretty well segregated distributions, which you can tighten up a bit by adding more variants to the simulation:

Screen Shot 2020-09-12 at 2 41 15 PM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eric!!! I initially thought that I don't want to implement this kinda of tests since all those cases were covered by #242 (on real data). I added your test anyway.

@mergify mergify bot dismissed eric-czech’s stale review September 12, 2020 19:24

Pull request has been modified.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 12, 2020

Weird test_read_vcfzarr is failing 🤔:

ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (['variants'], False) to Variable.

Let's see if it's a persistent error or a flaky test:

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 12, 2020

Ah interesting dask was released yesterday version: 2.26.0 and it breaks our test_read_vcfzarr. That same test works fine on 2.25.0.

eric-czech
eric-czech previously approved these changes Sep 12, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 14, 2020
@mergify mergify bot dismissed eric-czech’s stale review September 14, 2020 09:30

Pull request has been modified.

@mergify mergify bot removed the conflict PR conflict label Sep 14, 2020
tomwhite
tomwhite previously approved these changes Sep 14, 2020

Raises
------
ValueError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?

return call_g, call_g_mask


def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the merge=True parameter, but that can be done separately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

DIM_ALLELE,
DIM_PLOIDY,
DIM_SAMPLE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why did this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must have been isort doing something weird, reverted to the original

@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 14, 2020
@mergify mergify bot dismissed tomwhite’s stale review September 14, 2020 10:20

Pull request has been modified.

@mergify mergify bot removed the conflict PR conflict label Sep 14, 2020
@ravwojdyla
Copy link
Collaborator Author

@tomwhite also added pc_relate to the doc api, and fixed some rst issues.

@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 14, 2020
@mergify mergify bot removed the conflict PR conflict label Sep 14, 2020
@ravwojdyla
Copy link
Collaborator Author

@tomwhite is this good to merge?

@tomwhite
Copy link
Collaborator

From my point of view, yes. Did @eric-czech approve this one?

@eric-czech
Copy link
Collaborator

Did @eric-czech approve this one?

I did but it looks like mergify dismissed it. Re-approving now.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Sep 14, 2020
@mergify mergify bot merged commit bec1616 into sgkit-dev:master Sep 14, 2020
@ravwojdyla ravwojdyla deleted the rav/pc_relate branch September 14, 2020 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants