-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] count_allele_calls #114
[WIP] count_allele_calls #114
Conversation
Thanks for picking this up @timothymillar! I've been hesitant to wade into numba for things like counting (put a few thoughts at https://github.com/pystatgen/sgkit/issues/49#issuecomment-674561391) but I agree it's probably the right choice here. It's a shame that there doesn't seem to be a good way to do this with Dask/Xarray, but if we do lose the freedom in array backends it would be nice to still support CPU/GPU easily. Have you thought at all about using a gufunc for this, which would make it easier to compile to either? I think the whole module could collapse down to something like this in that case, where using the sentinel missing value also simplifies things vs using the separate mask variable: @numba.guvectorize(['void(numba.int8[:], numba.int32[:], numba.int32[:])'], '(n),(k)->(k)')
def count_alleles(x, _, out):
out[:] = 0
for v in x:
if v >= 0:
out[v] += 1
def count_call_alleles(ds) -> DataArray:
G = da.asarray(ds.call_genotype)
# This array is only necessary to tell dask/numba what the
# dimensions and dtype are for the output array
O = da.empty(G.shape[:2] + (ds.dims['alleles'],), dtype='int32')
O = O.rechunk(G.chunks[:2] + (-1,))
return xr.DataArray(
count_alleles(G, O),
dims=('variants', 'samples', 'alleles'),
name='call_allele_count'
)
def count_variant_alleles(ds) -> DataArray:
return (
count_call_alleles(ds)
.sum(dim='samples')
.rename('variant_allele_count')
) I'd be a proponent of simply removing the |
Good idea, I'll give it a go
I was a little uncertain of the application of the mask as their seems to be two options when it comes to a user masking out additional values (which I assume is an intended feature at some point):
Essentially should functions working on a DataSet trust the mask or the sentinel values? |
The mask is only a convenience on
We haven't talked about that yet (feel free to file an issue), but afaik Dask still doesn't support assignment so I see that process as:
|
@eric-czech The gufunc version is working nicely but I'm having some issues satisfying the CI with the use of A couple of notes about the implementation:
|
Nice @timothymillar! This looks great.
Perfect, makes sense.
Me neither. I started using the Dask style like that but then switched to referencing our
Good call!
Hmm I do not. Does it go away if you do |
n_alleles = ds.dims["alleles"] | ||
G = da.asarray(ds["call_genotype"]) | ||
shape = (G.chunks[0], G.chunks[1], n_alleles) | ||
N = da.empty(n_alleles, dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea using this instead! For the subsequent map_blocks call it might be a good idea to ensure this has only a single chunk so that the blocks broadcast. I can't imagine why anyone would ever configure the default chunk size to be so small that n_alleles
items would result in multiple chunks, but guarding against it is probably a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine why anyone would ever configure the default chunk size to be so small that
n_alleles
items would result in multiple chunks
Neither but this edge case should be handled correctly by this PR, see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Thoughts on a da.asarray(ds["call_genotype"].chunk(chunks=dict(ploidy=-1)))
then? I think ploidy chunks of size one could actually be pretty common when concatenating haplotype arrays.
I could also see the argument in leaving it up to the user to interpret those error messages and rechunk themselves so that they have to think through the performance implications, but I had already started being defensive about that in the GWAS regression methods. My rationale was that it's better to optimize for a less frustrating user experience than for making performance concerns prominent (and we could always add warnings for that later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh nm, I see why it doesn't matter now (and the test for it). Ignore that.
shape = (G.chunks[0], G.chunks[1], n_alleles) | ||
N = da.empty(n_alleles, dtype=np.uint8) | ||
return xr.DataArray( | ||
da.map_blocks(count_alleles, G, N, chunks=shape, drop_axis=2, new_axis=2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what keeps this from working as count_alleles(G, N)
instead? Do the block shapes need to be identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count_alleles(G, N)
results in an error if G is chunked in the ploidy dimension. See comment below for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Here's a minimal example explaining the use of setup: import numpy as np
import dask.array as da
from sgkit.stats.aggregation import count_alleles
n_alleles = 2
genotypes = np.array(
[[[ 0, 0],
[ 0, 1],
[ 1, 0]],
[[-1, 0],
[ 0, -1],
[-1, -1]]], dtype=np.int8)
N = da.empty(n_alleles, dtype=np.uint8)
G = da.asarray(genotypes).rechunk((1,1,1)) # unlikely chunking (All of the below options work correctly if G is not chunked in dimension 2 (ploidy)) Option 1: calling count_alleles(G, N).compute() Results in error:
I didn't actually explore the use of Option 2: naive use of shape = (G.chunks[0], G.chunks[1], n_alleles)
da.map_blocks(count_alleles, G, N, chunks=shape).compute() Results in incorrect array dimensions:
Option 3: Using shape = (G.chunks[0], G.chunks[1], n_alleles)
da.map_blocks(count_alleles, G, N, chunks=shape, drop_axis=2, new_axis=2).compute() Result:
|
One of those seems to have fixed it, thanks for the help! |
sgkit/stats/aggregation.py
Outdated
nopython=True, | ||
) | ||
def count_alleles(g: ArrayLike, _: ArrayLike, out: ArrayLike) -> None: | ||
"""Generaliszed U-function for computing per sample allele counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, not my strong point
sgkit/stats/aggregation.py
Outdated
>>> import sgkit as sg | ||
>>> from sgkit.testing import simulate_genotype_call_dataset | ||
>>> ds = simulate_genotype_call_dataset(n_variant=4, n_sample=2, seed=1) | ||
>>> ds['call_genotype'].to_series().unstack().astype(str).apply('/'.join, axis=1).unstack() # doctest: +NORMALIZE_WHITESPACE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good place for @tomwhite's code in https://github.com/pystatgen/sgkit/pull/58 now fyi.
This is good to go as far as I'm concerned. @tomwhite / @ravwojdyla / @jeromekelleher could one of you take a look as well? Two+ approvals seems appropriate for this one. |
Just for reference the following is a strait forward implementation only using def count_call_alleles(ds):
G = ds["call_genotype"]
n_variant, n_allele = G.shape[0], ds.dims["alleles"]
G = G.expand_dims(dim="alleles", axis=-1)
I = da.arange(n_allele, dtype='int8')
A = G == I # one-hot encoding of alleles
AC = A.sum(axis=-2)
return AC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, although I don't understand the details of how this is interacting with numba.
@tomwhite, @ravwojdyla I think we should have one more vote here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks great @timothymillar.
Clever! I bet that's still much faster than Thanks again for picking this up @timothymillar, nicely done. I'll set it to merge. |
@Mergifyio refresh |
Command |
See issue #85
This implements an additional jitted function
count_call_alleles_ndarray_jit
for ndarrays only rather than doing it all in dask.This approach seems to be inline with the goals outlined here but I'm happy to replicate the approach of the original
count_alleles
function if that is preferred.Likewise I can re-write
count_alleles
using this approach which should improve performance (mainly on chucked arrays due tonjit(..., nogil=True)
)I haven't addednumba
torequirements.txt
orsetup.py
Because that is done in #76I guess add numba now for CI and fix conflict later