-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenomicsDBImport should issue a warning when a large number of intervals is used #5066
Comments
@kgururaj As I start to think about upgrading exome joint calling to use GenomicsDBImport the 100 interval threshold seems like it might be problematic. I've been working with WGS data, so I don't have much intuition for benchmarking with missing data. Is there any performance downside to running over larger intervals that include missing data? For example, if we want to scatter the exome 50 ways, each subset of the exome interval list will have ~4000 intervals, but the GVCFs won't have data outside those intervals. Does it make sense to pass to GenomicsDBImport a single interval encompassing all of those? |
|
One admittedly completely degenerate case I tried: I imported a single gvcf that contained a single site that spanned a single locus, with a single sample, and specified 1000 small intervals, none of which overlapped the variant. The import takes a few minutes, but running SelectVariants on that workspace, with no intervals specified, takes about 30 minutes on my laptop to return the empty vcf. If I do the same thing but with a single interval at import time, the query takes a couple of seconds. |
Yep, same issue - opening and closing TileDB arrays which contain no data incurs the overhead of directory scans without any tangible benefit. |
* Some speedup by eliminating a ls operation
When a large number of intervals is specified at import time, a large number of arrays are created, which can lead to exhausting available open file handles. In addition, my informal tests indicate that querying a workspace created from an import that used a large number of intervals is pretty slow. @kgururaj suggests we might want issue a warning at a threshold of 100 intervals. See discussion in #4997.
The text was updated successfully, but these errors were encountered: