-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PC Relate #228
Add PC Relate #228
Conversation
@eric-czech no implementation/logic changes since #133, just added more tests and doc. Could you please take a look at the tests, and if there is anything more needed from test logic perspective? (I still need to update the dataset for validation, see below). I added validation code to produce validation dataset BUT I have an issue, I would really like to use a "production-ish" dataset for validation, I noticed that zarr zip store, produced from sgkit_plink is actually almost twice the size of original uncompressed plink data (this almost smells like a bug, but I'm not familiar with the sgkit_plink codebase, so maybe that is something expected). Do you have recommendations to keep the size for tests as small as possible? (sgkit-dev) ➜ pc_relate git:(rav/pc_relate) ✗ du -sh hapmap_YRI_r23a_filtered.*
37M hapmap_YRI_r23a_filtered.bed
69M hapmap_YRI_r23a_filtered.bim
4.0K hapmap_YRI_r23a_filtered.fam
210M hapmap_YRI_r23a_filtered.zarr.zip
37M hapmap_YRI_r23a_filtered.zip
This should not be merged until we figure out a way to make the test data smaller, or maybe store it somewhere else, or generate it as part of the test. |
b2bc816
to
0f42bab
Compare
Ok I'll take another look. Can you zip up
I'm not surprised, something like https://github.com/pystatgen/sgkit/issues/80 was necessary in the past for me to produce zarrs smaller than uncompressed PLINK given how wasteful our representation is (we currently use 8 times as many bytes as PLINK does for calls). I would be surprised though if you couldn't set Blosc/ztsd compressors and get down to the same size with a higher clevel. I have no sense of how much the zip compression works by default.
How so? I can't see how they'd be connected unless the data written has a ton more entropy somehow (and is completely wrong) or the data types are getting promoted. |
|
||
Raises | ||
------ | ||
ValueError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be broken out like https://github.com/pystatgen/sgkit/pull/214#discussion_r478502734 (or tested on sphinx with a different bullet -- I'm not sure what would make it happy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?
a locus are IBD. Several of the most common family relationships and their | ||
corresponding kinship coefficient: | ||
|
||
+--------------------------------------------------+---------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @ravwojdyla!
Except for the minor docstring change for exceptions raised and saving the test data in a ZipStore, I think it's good to go.
0f42bab
to
7775e58
Compare
Pull request has been modified.
I wonder what you think about this solution @eric-czech, @tomwhite, @jeromekelleher, @alimanfoo, @hammer . To properly validate PR Relate, apart from unit tests, I want to use production public data but:
So here I suggest adding a "Validation suite" Github workflow, that runs weekly or by manual trigger, this allows me to:
without polluting git repo. Some points:
Downsides:
I've added 7775e58 to show you have the validation run looks like, see the outputs here (this commit would obviously not be merged in, I just wanted to validate the execution and show it to you). |
This is great @ravwojdyla! Validating the code in sgkit is probably the most important job this script does at the moment, since sgkit's code is changing rapidly (not PC relate specifically - more the other parts). So I think even having that by itself would be valuable. Tracking the R implementation may become onerous, and may not necessarily be something we want to do longer term - but we could easily freeze the version of the package we validate against.
The input data should be fixed, shouldn't it? So going stale isn't a concern. I think it might be preferable to host a copy that we control, for various reasons (reliability, not overloading the upstream source, etc). |
yep, I have not done that, because I actually would like to get an error if the upstream gets updated and fail. If we agree that we prefer to freeze versions, that's fine too (we could also freeze everything but the
Yes, the staleness I'm referring to is the codebase, in the past I have seen two ways of doing this:
Normally if this was not an open source project, I would also lean towards hosting our own data, but since this is the open source realm, I have slight preference to use vanilla upstream as another signal that we test on actual public datasets etc. wdyt? |
And btw, I would also like to point out an interesting use of
Question: how do we know which change exactly introduced the regression/bug?
You can read more about git bisect here. We could document this flow, when we get the validation flow in. |
I agree with Tom: we don't want to be in the business of tracking upstream code and data. It would be nice for us to just validate against a particular release or hash of an upstream tool and some data fixture produced by that tool. |
I also prefer GCS. |
Also going forward if we like the Validation Suite approach, I will split this PR into PC Relate + unit tests (this PR), and a separate PR with validation suite + PC Relate validation (everything from this PR related to the Validation Suite). |
7775e58
to
070ce72
Compare
As mentioned above I am splitting this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled the code down to experiment with it a bit and it helped me find a thing or two -- nothing major except for the PCA applications in the tests. I think that extra test I proposed would be a good addition too, in order to challenge the implementation a bit more without being slow / overly complicated.
@@ -0,0 +1,144 @@ | |||
from typing import Tuple | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to top level api init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
sgkit/stats/pc_relate.py
Outdated
---------- | ||
ds : `xr.Dataset` | ||
Dataset containing (S = num samples, V = num variants, D = ploidy, PC = num PC): | ||
* genotype calls: "call_genotype" (SxVxD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this will render in sphinx, but it would probably be a good idea to use the same convention as in the results section where the variables are listed the way parameters are (but with more indentation). That seems to render at least somewhat nicely for https://pystatgen.github.io/sgkit/generated/sgkit.gwas_linear_regression.html#sgkit-gwas-linear-regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
sgkit/tests/test_pc_relate.py
Outdated
assert g.call_genotype.shape == (1000, 10, 2) | ||
assert g.call_genotype_mask.shape == (1000, 10, 2) | ||
|
||
# Sprinkle some tests data, this is a bit verbose, but in tests verbosity is not bad: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Test individual cases" perhaps since "sprinkle some tests data" doesn't make sense (grammar-wise)? Not sure you need the verbosity disclaimer, I certainly agree it's not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed used "Test individual cases"
sgkit/tests/test_pc_relate.py
Outdated
def test_impute_genotype_call_with_variant_mean() -> None: | ||
g = simulate_genotype_call_dataset(1000, 10, missing_pct=0.1) | ||
call_g, call_g_mask = _collapse_ploidy(g) | ||
# Sprinkle some tests data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Like above, not quite sure what that's getting at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
sgkit/tests/test_pc_relate.py
Outdated
n_samples = 100 | ||
g = simulate_genotype_call_dataset(1000, n_samples) | ||
call_g, _ = _collapse_ploidy(g) | ||
pcs = PCA(call_g, ncomp=2).loadings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the satsmodels code/docs and this would give the eigenvectors from a decomposition on input data where the samples are features rather than observations. It happens to have the same dimensions but is not the same as the projection of the samples into the PC space implied by a decomposition on the transposed inputs (i.e. samples x variants rather than variants x samples like call_g
). It's confusing because sometimes people refer to those projections as "loadings" as well, but "scores" is more common. And the pcs
should be scores rather than loadings -- I haven't checked GENESIS but I took a look at Hail to be certain. It also doesn't really make sense to use scaled eigenvectors from the transposed problem for anything unless those eigenvectors come from left-singular vectors instead of right-singular vectors (they'll be right singular when coming from PCA
).
I would suggest using either one of these to generate scores instead:
# `x` would have shape (samples, variants) in these cases
def pc_scores_numpy(x, k):
x = x - x.mean(axis=0, keepdims=True)
u, s, v = np.linalg.svd(x, full_matrices=False)
r = u * s[np.newaxis]
return r[:, :k]
def pc_scores_sklearn(x, k):
return PCA(n_components=k, svd_solver='full').fit_transform(x)
# Then in the test:
# pcs = pc_scores(call_g.T.values, 2)
# ds["sample_pcs"] = (("components", "samples"), pcs.T)
This was all fresh since I was doing a lot of work with SVD this week, but I wanted to double check the above as well as the usual "loadings" definitions so I verified in this gist: https://gist.github.com/eric-czech/52a04007f4cfb8f05b312079ca770add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh also I found that the statsmodels.PCA was super slow even with small data, so those definitely speed things up a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dang, I remember I meant to validate that, thanks for the catch! Also I did not want to add sklearn as another dependency, but I guess since it's dev/test deps, that was a premature optimisation on my part. Fixed, used sklearn
(and added it to the requirements-dev.txt
).
data_np = phi.pc_relate_phi.data.compute() # to be able to use fancy indexing below | ||
upper_phi = data_np[np.triu_indices_from(data_np, 1)] | ||
assert (upper_phi > -0.5).all() and (upper_phi < 0.5).all() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a useful test here would be to try to at least pick out some parent/child relationships since they're fairly easy to simulate. I was tinkering with this a bit and it snowballed into this:
# Create a dataset that is 2/3 founders and 1/3 progeny
rs = np.random.RandomState(1)
ds = simulate_genotype_call_dataset(1000, 300, seed=1)
ds['sample_type'] = xr.DataArray(np.repeat(['mother', 'father', 'child'], 100), dims='samples')
sample_groups = ds.groupby('sample_type').groups
def simulate_new_generation(ds):
# Generate progeny genotypes as a combination of randomly
# selected haplotypes from each parents
idx = sample_groups['mother'] + sample_groups['father']
gt = ds.call_genotype.isel(samples=idx).values
idx = rs.randint(0, 2, size=gt.shape[:2])
# Collapse to haplotype across ploidy dim using indexer
# * shape = (samples, variants)
ht = gt[np.ix_(*map(range, gt.shape[:2])) + (idx,)].T
gt_child = np.stack([ht[sample_groups[t]] for t in ['mother', 'father']]).T
ds['call_genotype'].values = np.concatenate((gt, gt_child), axis=1)
return ds
# Redefine the progeny genotypes
ds = simulate_new_generation(ds)
# Infer kinship
call_g, _ = _collapse_ploidy(ds)
pcs = pc_scores(call_g.T.values, 1)
ds["sample_pcs"] = (("components", "samples"), pcs.T)
ds["pc_relate_phi"] = pc_relate(ds)["pc_relate_phi"].compute()
# Check that all coefficients are in expected ranges
cts = (
ds["pc_relate_phi"].to_series().reset_index()
.pipe(lambda df: df.loc[df.sample_x >= df.sample_y]['pc_relate_phi'])
.pipe(
pd.cut,
bins=[p for phi in [0, .25, .5] for p in [phi-.1, phi+.1]],
labels = ['unrelated', 'unclassified', 'parent/child', 'unclassified', 'self'],
ordered=False
)
.value_counts()
)
print(cts)
# unrelated 44650
# self 300
# parent/child 200
# unclassified 0
assert cts['parent/child'] == len(sample_groups['child']) * 2
assert cts['self'] == ds.dims['samples']
assert cts['unclassified'] == 0
That produces kinship estimates with pretty well segregated distributions, which you can tighten up a bit by adding more variants to the simulation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Eric!!! I initially thought that I don't want to implement this kinda of tests since all those cases were covered by #242 (on real data). I added your test anyway.
070ce72
to
8ff5684
Compare
Pull request has been modified.
Weird
Let's see if it's a persistent error or a flaky test: |
8ff5684
to
4423feb
Compare
Ah interesting |
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
a4caa48
to
b2a52fd
Compare
Pull request has been modified.
|
||
Raises | ||
------ | ||
ValueError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravwojdyla just wanted to check if you have tried running Sphinx doc for this PR?
sgkit/stats/pc_relate.py
Outdated
return call_g, call_g_mask | ||
|
||
|
||
def pc_relate(ds: xr.Dataset, maf: float = 0.01) -> xr.Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add the merge=True
parameter, but that can be done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
sgkit/__init__.py
Outdated
DIM_ALLELE, | ||
DIM_PLOIDY, | ||
DIM_SAMPLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must have been isort doing something weird, reverted to the original
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
b2a52fd
to
6b75dcc
Compare
Pull request has been modified.
@tomwhite also added |
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
6b75dcc
to
5dca075
Compare
@tomwhite is this good to merge? |
From my point of view, yes. Did @eric-czech approve this one? |
I did but it looks like mergify dismissed it. Re-approving now. |
Need to open a new PR to move to forked repo. Original #133 .
Changes:
pc_relate
methodpc_relate
pc_relate
missing_pct
tosimulate_genotype_call_dataset
, to support missing data