-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bottleneck in terms of time consuming of HWE benchmark #736
Comments
One issue here is that I believe the following line leads to two passes over the raw genotype data when the eager computation is forced in You should try this instead: ds = (
ds
.assign(**ds[['variant_allele_frequency', 'variant_hwe_p_value']].compute())
.pipe(lambda ds: ds.sel(variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6))))
) Here's a reproducible example to motivate it: import xarray as xr
import numpy as np
from dask.diagnostics import Profiler
ds = (
xr.Dataset(data_vars=dict(
x=(('a',), np.arange(1000)),
y=(('a',), np.arange(1000))
))
.chunk(100)
.assign(z1=lambda ds: ds['x'] + ds['y'])
.assign(z2=lambda ds: ds['x'] * ds['y'])
)
with Profiler() as prof:
ds2 = (
ds.assign(**ds[['z1', 'z2']].compute())
.pipe(lambda ds: ds.sel(a=(ds['z1'] > 0) & (ds['z2'] > 0)))
)
f'Number of tasks = {len(prof.results)}'
# Number of tasks = 20
with Profiler() as prof:
ds3 = ds.sel(a=(ds['z1'] > 0) & (ds['z2'] > 0))
f'Number of tasks = {len(prof.results)}'
# Number of tasks = 50
xr.testing.assert_equal(ds2, ds3) The key difference there is that calling compute explicitly leads to (I think in this case) one pass over the array chunks in Also, for the benchmark I would suggest doing something like |
Thanks @eric-czech ! Your explanation and code samples make much sense to me. Follow your suggestion, I tried to have a looked at dask graph to have more sense. However, I cannot see any task graph in |
A good news is that this method has reduced |
How long, approximately, does it take now start to finish for sgkit vs hail (where both are writing .mt or .zarr files respectively)? |
First, I just noticed recently, Hail can not write p_value/AF to .mt, features like p_value and AF are 'Expression' type/class in their implemention, with export() function. The output is .tsv, which is human readable text file. My experiment: And 1kg is 70x larger than chr21 in their original .vcf format. |
What I'm currently doing is to make them both doing eager loading/computation, instead of building the task graph and waiting the final triggering of compution in final output stage. So in this way, ideally I can compare the breakdown of their data loading, execuation and output writing time. |
Ok great! Making progress then.
I think we should avoid this -- I like the idea but there's no good way to break that out for a fair comparison in both dask and spark since the amount of work they do in each of those stages will vary so much. It would be sufficient to just measure the execution time for the whole operation, including I/O. |
Yes, forcing everything to be eager seems like a bad measure of how things would work in practise, since this is something we'd discourage. |
Closing old issue |
HWE benchmark is doing the similar things to GWAS tutorial. Because a bottleneck in sgkit dataset final selection, I can only run the benchmark on chromosome 21(201MB in original vcf) instead of chr1-22(14GB in original vcf).
Sgkit is doing:
Hail is doing:
PLINK has flags:
--hardy --hwe 1e-6 --maf 0.01
The three lines of sgkit take (0.067585, 1.688614, 375.0212) seconds respectively, when using 16 core,
sel()
takes a lot and Hail'sfilter_rows()
only take 0.010368 seconds.Sgkit's runtime is dominating by
xarray.Dataset.sel()
function, and much more than Hail and PLINK take. Do you have an idea, what could be an equivalent but more efficient syntax to do the same selection here?I research a bit about xarray, sel(), it seems to have been inefficient for years with an open issue (
sel()
callisel()
inside).variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6))
is indexing with boolean arrays, and might not be the preferred way bysel()
.The text was updated successfully, but these errors were encountered: