-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: gridss SV caller #1829
Comments
GPL-3, should be fine. |
Daniel is across the street; integration of GRIDSS is on my todo list. It can take split read and other information from existing files which gets around the micro-pipeline issue. I think. |
Paging @d-cameron . |
GRIDSS does have a lot of stages, and, in an attempt to make components reusable so tools authors don't keep re-implementing the same wheels, it uses a modular design. At a high level, it's sorted BAM in to VCF out just like many other callers. There are a few options for pipeline integration:
The first option is the simplest and probably the best starting point. The other complication is the call out to https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library through a JNI wrapper for the AnnotateInexactHomology stage. I've included a pre-compiled version in the GRIDSS jar but if it is being packaged with something that has a build script, it's probably worth including the compilation of JNI SSW binary for the target system in the build script itself. Java 1.8 shouldn't be problematic as it's needed by htsjdk/picard/GATK anyway. GRIDSS can be build from source using maven, or just download the jar with dependencies already packaged inside (hence the download is 38Mb). Aside regarding https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#tuning-cores: you almost never want to give a Java process 32-48Gb of memory as the inability to use compressed OOPs for heaps over 32Gb effectively means it has less memory available when given 35Gb than when given 31Gb. The break-even point depends on the program but it is somewhere around 40-48Gb. |
Daniel, thanks for all the information! We'll explore this a bit on our side. Mostly trying to figure out if there's a way to re-use existing data, e.g., samblaster's discordant / split read output. Re-running the complete alignment for 100x tumor/normal pairs would be a bit expensive. |
GRIDSS need bwa to identify split reads from the soft clipped reads (both for input reads and the break-end contigs). On typical (human WGS) data, it's only about 1-2% of the reads that GRIDSS feeds to the aligner. It's definitely not the full 100x! The simplest approach is to just give the sorted BAM file to GRIDSS and let GRIDSS handle the rest. As long as bwa is on path and the bwa index is in the expected location, that should be all that is required for initial integration. |
Calling for single sample, paired somatic and pooled cases with GRIDSS (https://github.com/PapenfussLab/gridss). Uses the top level gridss.CallVariants target for doing all preparation from aligned input files. It works on smaller samples but uses a lot of time and resources on large WGS runs; ideally we'll re-use previous components from bcbio preparation in the final integration. #1829
Daniel, Oliver and Sven-Eric; For a large WGS sample (185Gb input BAM) this ends up being too resource intensive to work practically. I ended up with a long runtime doing Picard extraction and sorting and filled up the last ~600Gb of disk space on my AWS test run. It would be cool to explore entry points where we could avoid some of the preparation process. We can give GRIDSS split and discordant reads prepared via samblaster (https://github.com/GregoryFaust/samblaster). Daniel, would that be a possible entry point to avoid some of the overhead? It also sounds like the disk space issue might be due to name and then coordinate sorting during ComputeSamTags. Is that something we could add as a streaming process during alignment to avoid needing all this extra disk IO and temporary file generation? Thanks for all the help so far. I'm happy to have this in and looking forward to optimizing it. |
GRIDSS intermediate files should only take around 10-25% of size of the input file so I've had a look into it and the high disk usage is due to a combination of:
The most obvious solution is to drop the low compression JVM argument and update GRIDSS to delete the temporary files for each step as soon as the subsequent step completes. Combined, these should drop the GRIDSS disk usage to under 100Gb for your input BAM which is much more reasonable. |
There's a few issues with using samblaster for read extraction:
Although GRIDSS performs split read identification from clipped reads, there's no reason that it couldn't be done earlier and samblaster's recommendation of using YAHA is probably a better option than the GRIDSS default of calling bwa. If decent split read identification was performed upstream (and outputs split read alignments with the SA tag) then there's no need for GRIDSS to do it and this step can be skipped. Although bwa in theory does this split read alignment, in practice I've found that feeding the soft clips back to bwa again recovers additional split reads. One big potential gain is to perform some of the GRIDSS pre-processing before the reads are coordinate sorted. This can indeed be streamed as GRIDSS doesn't actually need name-sorted bam files, merely that the R1 and R2 sam records are grouped together. CompuaeSamTags pre-processing operations requires information from the mate. Namely:
|
To strip out the GRIDSS pre-processing stages completely, we would need to:
If you also skipped running gridss.AnnotateReferenceCoverage then GRIDSS could run purely off the extracted BAM. Up to 2/3 of the execution time of GRIDSS is currently spent just reading the input file (extract metrics; extract SV reads; annotate reference alelle coverage). |
Thanks for all the detailed information, Daniel. I am not sure we have the resources to make large scale changes at this time. The additional computational time is manageable, in particular since we would run GRIDDS in parallel with other CNV and SV callers which also take 12+ hours on a typical 100x tumor/normal pair. Storage however is going to be problematic, especially if streaming isn't ideal due to the issues you raised. Brad, would you like me to run GRIDSS locally to get a better sense of overall resource requirements? |
Do no use less compressed BAMs for temporary fixes. Thanks to @ d-cameron #1829
Daniel; It also sounded from your comment like there were some disk space intermediate fixes you were looking at inside of GRIDSS. Is that right? If so, I'll wait for those, update the GRIDSS conda package and then retry with the compression fix in place. Thanks again. |
Correct. There's no need to keep the intermediate files for 5 different steps and delete them all, when the peak disk usage can be reduced by deleting earlier. |
Yes - between compression and deletion timing, peak additional disk usage should be less than the size of the input file. My guess is a peak around 50% but I'll need to rerun some samples to get some numbers for you. |
Brilliant, thanks Daniel. Let us know when things are in place within GRIDSS and I'll update the bioconda package and do more testing with the bcbio runs. Much appreciated. |
… interest (instead of all unmapped reads) bcbio/bcbio-nextgen#1829
I've released GRIDSS 1.3.4 that includes the disk usage optimisations. Sorry for the delay. |
Daniel; |
That is somewhat strange as the sorting just calls the same classes that Picard SortSam uses. There are a couple of options I can think of:
With samtools 1.4 now properly multi-threaded, it might be worth adding the capability for external sorting to GRIDSS in the same manner external alignment is performed. That still might not solve this disk usage issue though. |
I've just run GRIDSS 1.3.4 on ERA172924 (Illumina PCR-free 50x WGS NA12878) and got the following disk usage ( Input file: 116G Peak disk usage occurred during the merging of the split reads identified by the external aligner with the extracted reads (51G in WORKING_DIR), and again in the immediately following steps that sorts back to coordinate sort order (30G in WORKING_DIR, 22G in TMP_DIR). For GiaB HG002 bwa aligned against hg38, a 159G BAM file results in 57G final WORKING_DIR usage. Extrapolating from the final/peak ratio of 3.4, I would estimate 194G peak usage. How large is the SV subset bam file (WORKING_DIR/input.bam.gridss.working/gridss.tmp.extracted.input.bam) on your data set? |
GRIDSS 1.4.0 is now released. There's a bug fix for some read being unnecessarily included in the intermediate files as well as some run-time performance improvements. Hopefully the updated version should behave better on your sample data. |
@chapmanb, is this still of interest in the light of all the CNV caller work you are undertaking at the moment, or would you prefer me to close this issue? |
Sven-Eric -- we are still planning to work on GRIDSS, but after the CNV updates finish. Thanks for checking on this; I'd love to leave it open as we try to improve the GRIDSS integration. |
Sven-Eric, @pdiakumis in my group has been comparing GRIDSS calls for a number of WGS tumor/normal samples with MANTA (and Hartwig's BreakPointInspector). Working with Daniel we seem to have a good set of filters; we might also look at some of Daniel's recent improvements. Happy to keep the issue open. |
FYI: I've added an example pipeline script to GRIDSS (https://github.com/PapenfussLab/gridss/blob/master/example/example_pipeline.sh) to show how one can avoid creating most of the intermediate files and avoid the additional sorting operations by hooking the GRIDSS pre-processing step further upstream in the pipeline. |
Thanks, closing this along with #2500. Please reopen if GRIDSS is something that would be super useful to improve, but there hasn't been any action on its development for a year so I'll close it for now. Thank you! |
FYI: GRIDSS now has a driver shell script that does most of the optimisations outlined above. Compared to the bcbio version, the latest GRIDSS release has lower peak temp disk usage, lower peak memory, improved multithread CPU utilisation, and lower overall CPU usage. |
Thanks for all of the information @d-cameron! |
Hi Daniel, Also related to #2500 (comment). cc'ing @AndrewPattison to this discussion who's been working with GRIDSS and comparing stuff a lot lately. I'll be back in Melbourne in a couple weeks to start work on a feature branch. All the best, |
Hi Peter @pdiakumis! I see there is a lot of discussion of GRIDSS. Sergey |
We are waiting for the hg38 resource bundle to be ready, but are likely to run Gridss outside of bcbio to be able to make use of post-processed / filtered SNVs. |
So it seems as if the
gridss
SV caller is doing better thanmanta
on some SV classes. Perhaps it would be worth it to add it? It's based on Java, but is deployed as a micro-pipeline it seems. Haven't checked the license yet.The text was updated successfully, but these errors were encountered: