GenomicsDBImport should issue a warning when a large number of intervals is used #5066

cmnbroad · 2018-07-27T22:49:28Z

When a large number of intervals is specified at import time, a large number of arrays are created, which can lead to exhausting available open file handles. In addition, my informal tests indicate that querying a workspace created from an import that used a large number of intervals is pretty slow. @kgururaj suggests we might want issue a warning at a threshold of 100 intervals. See discussion in #4997.

ldgauthier · 2018-08-02T14:57:10Z

@kgururaj As I start to think about upgrading exome joint calling to use GenomicsDBImport the 100 interval threshold seems like it might be problematic. I've been working with WGS data, so I don't have much intuition for benchmarking with missing data. Is there any performance downside to running over larger intervals that include missing data? For example, if we want to scatter the exome 50 ways, each subset of the exome interval list will have ~4000 intervals, but the GVCFs won't have data outside those intervals. Does it make sense to pass to GenomicsDBImport a single interval encompassing all of those?

kgururaj · 2018-08-06T03:08:27Z

Large number of open file handles: this was an issue in TileDB which got fixed as part of the restructuring that @nalinigans did for supporting HDFS/S3/GCS (Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017). I was too lazy to fix this again. If it's going to take some time for PR Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017 to be merged, I can submit a separate fix for this. This would fix any crashes/termination issues.
Performance of a single import process with a large number of intervals
- Restating the obvious, but this is a single process (and by default, a single thread) with many intervals to import. As you increase the number of samples, this will become a performance pain point.
- More important than the number of intervals is the amount of data imported per interval. Each interval import involves opening the VCF files (loading index structures while creating FeatureReader objects), writing to TileDB/GenomicsDB. and closing the VCF file handles (destroying FeatureReader objects). If the amount of data written for each interval is sufficiently large, the cost of opening/closing the VCF files (creating/destroying FeatureReaders) is small relative to the total time taken.
- In the test cases I and Chris were trying, the amount of data written per interval was small (or 0 in many cases). The time taken in opening/closing the VCF files (and loading/destroying the index) dominates the total time.
For a single import process (single thread), creating a large interval is better (or no worse) than passing several small intervals. TileDB/GenomicsDB has 0 overhead for regions with no data (for example, WES gVCFs). Having larger intervals will likely avoid issues described above. Hence, an advisory message will be beneficial.

cmnbroad · 2018-08-06T17:56:25Z

One admittedly completely degenerate case I tried: I imported a single gvcf that contained a single site that spanned a single locus, with a single sample, and specified 1000 small intervals, none of which overlapped the variant. The import takes a few minutes, but running SelectVariants on that workspace, with no intervals specified, takes about 30 minutes on my laptop to return the empty vcf.

If I do the same thing but with a single interval at import time, the query takes a couple of seconds.

kgururaj · 2018-08-07T00:22:13Z

Yep, same issue - opening and closing TileDB arrays which contain no data incurs the overhead of directory scans without any tangible benefit.

* Some speedup by eliminating a ls operation

cmnbroad added the GenomicsDB label Jul 27, 2018

cmnbroad mentioned this issue Jul 30, 2018

User gets StackOverflowError when using Multi-interval in GenomicsDBImport with GATK 4.0.6.0 #4994

Closed

cmnbroad self-assigned this Aug 13, 2018

cmnbroad mentioned this issue Aug 13, 2018

Display warning when using large interval lists with GenomicsDBImport. #5102

Merged

cmnbroad closed this as completed in #5102 Aug 14, 2018

kgururaj added a commit that referenced this issue Aug 14, 2018

* Fix the file handles issue discussed in #5066

638e61a

* Some speedup by eliminating a ls operation

droazen pushed a commit that referenced this issue Aug 14, 2018

* Fix the file handles issue discussed in #5066 (#5105)

6e44c60

* Some speedup by eliminating a ls operation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

cmnbroad commented Jul 27, 2018 •

edited

Loading

ldgauthier commented Aug 2, 2018

kgururaj commented Aug 6, 2018 •

edited

Loading

cmnbroad commented Aug 6, 2018 •

edited

Loading

kgururaj commented Aug 7, 2018

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

GenomicsDBImport should issue a warning when a large number of intervals is used #5066

Comments

cmnbroad commented Jul 27, 2018 • edited Loading

ldgauthier commented Aug 2, 2018

kgururaj commented Aug 6, 2018 • edited Loading

cmnbroad commented Aug 6, 2018 • edited Loading

kgururaj commented Aug 7, 2018

cmnbroad commented Jul 27, 2018 •

edited

Loading

kgururaj commented Aug 6, 2018 •

edited

Loading

cmnbroad commented Aug 6, 2018 •

edited

Loading