Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

Closed
cmnbroad opened this issue Jul 27, 2018 · 4 comments
Assignees

Comments

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 27, 2018

When a large number of intervals is specified at import time, a large number of arrays are created, which can lead to exhausting available open file handles. In addition, my informal tests indicate that querying a workspace created from an import that used a large number of intervals is pretty slow. @kgururaj suggests we might want issue a warning at a threshold of 100 intervals. See discussion in #4997.

@ldgauthier
Copy link
Contributor

@kgururaj As I start to think about upgrading exome joint calling to use GenomicsDBImport the 100 interval threshold seems like it might be problematic. I've been working with WGS data, so I don't have much intuition for benchmarking with missing data. Is there any performance downside to running over larger intervals that include missing data? For example, if we want to scatter the exome 50 ways, each subset of the exome interval list will have ~4000 intervals, but the GVCFs won't have data outside those intervals. Does it make sense to pass to GenomicsDBImport a single interval encompassing all of those?

@kgururaj
Copy link
Collaborator

kgururaj commented Aug 6, 2018

  • Large number of open file handles: this was an issue in TileDB which got fixed as part of the restructuring that @nalinigans did for supporting HDFS/S3/GCS (Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017). I was too lazy to fix this again. If it's going to take some time for PR Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017 to be merged, I can submit a separate fix for this. This would fix any crashes/termination issues.
  • Performance of a single import process with a large number of intervals
    • Restating the obvious, but this is a single process (and by default, a single thread) with many intervals to import. As you increase the number of samples, this will become a performance pain point.
    • More important than the number of intervals is the amount of data imported per interval. Each interval import involves opening the VCF files (loading index structures while creating FeatureReader objects), writing to TileDB/GenomicsDB. and closing the VCF file handles (destroying FeatureReader objects). If the amount of data written for each interval is sufficiently large, the cost of opening/closing the VCF files (creating/destroying FeatureReaders) is small relative to the total time taken.
    • In the test cases I and Chris were trying, the amount of data written per interval was small (or 0 in many cases). The time taken in opening/closing the VCF files (and loading/destroying the index) dominates the total time.
  • For a single import process (single thread), creating a large interval is better (or no worse) than passing several small intervals. TileDB/GenomicsDB has 0 overhead for regions with no data (for example, WES gVCFs). Having larger intervals will likely avoid issues described above. Hence, an advisory message will be beneficial.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Aug 6, 2018

One admittedly completely degenerate case I tried: I imported a single gvcf that contained a single site that spanned a single locus, with a single sample, and specified 1000 small intervals, none of which overlapped the variant. The import takes a few minutes, but running SelectVariants on that workspace, with no intervals specified, takes about 30 minutes on my laptop to return the empty vcf.

If I do the same thing but with a single interval at import time, the query takes a couple of seconds.

@kgururaj
Copy link
Collaborator

kgururaj commented Aug 7, 2018

Yep, same issue - opening and closing TileDB arrays which contain no data incurs the overhead of directory scans without any tangible benefit.

@cmnbroad cmnbroad self-assigned this Aug 13, 2018
kgururaj added a commit that referenced this issue Aug 14, 2018
* Some speedup by eliminating a ls operation
droazen pushed a commit that referenced this issue Aug 14, 2018
* Some speedup by eliminating a ls operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants