Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: gridss SV caller #1829

Closed
schelhorn opened this issue Feb 23, 2017 · 30 comments
Closed

Feature request: gridss SV caller #1829

schelhorn opened this issue Feb 23, 2017 · 30 comments

Comments

@schelhorn
Copy link

So it seems as if the gridss SV caller is doing better than manta on some SV classes. Perhaps it would be worth it to add it? It's based on Java, but is deployed as a micro-pipeline it seems. Haven't checked the license yet.

@schelhorn
Copy link
Author

GPL-3, should be fine.

@ohofmann
Copy link
Collaborator

ohofmann commented Feb 23, 2017

Daniel is across the street; integration of GRIDSS is on my todo list. It can take split read and other information from existing files which gets around the micro-pipeline issue. I think.

@ohofmann
Copy link
Collaborator

Paging @d-cameron .

@d-cameron
Copy link

d-cameron commented Feb 24, 2017

GRIDSS does have a lot of stages, and, in an attempt to make components reusable so tools authors don't keep re-implementing the same wheels, it uses a modular design. At a high level, it's sorted BAM in to VCF out just like many other callers. There are a few options for pipeline integration:

  • it can be treated as a single command-line tool. The only dependency complication in this approach is the requirement for bwa mem to be on path (and bwa index in the same location as the reference fasta). Everything else can be treated as internal to the caller, including the paths of intermediate files generated by GRIDSS. It uses htsjdk and picard tools for parsing so, to the user, the GRIDSS program(s) look just like additional Picard tools.

  • the GRIDSS workflow can be incorporated into the calling pipeline. This approach allows for a more flexible pipeline which has the potential for better performance both in terms of computation time, as well as results. For example, the SoftClipsToSplitReads step converts soft clipped reads into split reads with the SA tag set (i.e. matching bwa) by invoking an external aligner and the entire stage is unnecessary if you use an aligner that aggressively splits reads and reports using the standard SA tag. GRIDSS doesn't care about which aligner is used so there might be a better choice than the default of calling bwa. Similarly, ComputeSamTags could be incorporated into an upstream FASTA->BAM alignment stage thus removing the requirement of GRIDSS to perform a name sort to populate the tags required by GRIDSS (such as the R2 and Q2 tags). Even something like running CollectGridssMetrics upstream would shave 1/4 of the runtime off GRIDSS since the execution time is dominated by the 3 passes of the input BAM (extract metrics, extract SV reads, then third pass for reference coverage annotations). Technically, the entire SV reads extraction process is optional and the back end of the GRIDSS pipeline (assembly & variant calling) could work straight off the input BAM. The approach also allows better CPU utilisation as the calling workflow would know how many cores are actually used in which stage.

The first option is the simplest and probably the best starting point.

The other complication is the call out to https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library through a JNI wrapper for the AnnotateInexactHomology stage. I've included a pre-compiled version in the GRIDSS jar but if it is being packaged with something that has a build script, it's probably worth including the compilation of JNI SSW binary for the target system in the build script itself.

Java 1.8 shouldn't be problematic as it's needed by htsjdk/picard/GATK anyway.

GRIDSS can be build from source using maven, or just download the jar with dependencies already packaged inside (hence the download is 38Mb).

Aside regarding https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#tuning-cores: you almost never want to give a Java process 32-48Gb of memory as the inability to use compressed OOPs for heaps over 32Gb effectively means it has less memory available when given 35Gb than when given 31Gb. The break-even point depends on the program but it is somewhere around 40-48Gb.

@ohofmann
Copy link
Collaborator

Daniel, thanks for all the information! We'll explore this a bit on our side. Mostly trying to figure out if there's a way to re-use existing data, e.g., samblaster's discordant / split read output. Re-running the complete alignment for 100x tumor/normal pairs would be a bit expensive.

@d-cameron
Copy link

d-cameron commented Feb 25, 2017

Re-running the complete alignment for 100x tumor/normal pairs would be a bit expensive.

GRIDSS need bwa to identify split reads from the soft clipped reads (both for input reads and the break-end contigs). On typical (human WGS) data, it's only about 1-2% of the reads that GRIDSS feeds to the aligner. It's definitely not the full 100x!

The simplest approach is to just give the sorted BAM file to GRIDSS and let GRIDSS handle the rest. As long as bwa is on path and the bwa index is in the expected location, that should be all that is required for initial integration.

chapmanb added a commit that referenced this issue Mar 2, 2017
Calling for single sample, paired somatic and pooled cases with
GRIDSS (https://github.com/PapenfussLab/gridss). Uses the top level
gridss.CallVariants target for doing all preparation from aligned
input files. It works on smaller samples but uses a lot of time and
resources on large WGS runs; ideally we'll re-use previous components
from bcbio preparation in the final integration. #1829
@chapmanb
Copy link
Member

chapmanb commented Mar 2, 2017

Daniel, Oliver and Sven-Eric;
Thanks so much for the discussion and pointers. I put together an initial GRIDSS integration in bcbio using the top level gridss.CallVariants targets. This lets GRIDSS do all the input file preparation and works well for smaller test samples.

For a large WGS sample (185Gb input BAM) this ends up being too resource intensive to work practically. I ended up with a long runtime doing Picard extraction and sorting and filled up the last ~600Gb of disk space on my AWS test run. It would be cool to explore entry points where we could avoid some of the preparation process. We can give GRIDSS split and discordant reads prepared via samblaster (https://github.com/GregoryFaust/samblaster). Daniel, would that be a possible entry point to avoid some of the overhead? It also sounds like the disk space issue might be due to name and then coordinate sorting during ComputeSamTags. Is that something we could add as a streaming process during alignment to avoid needing all this extra disk IO and temporary file generation?

Thanks for all the help so far. I'm happy to have this in and looking forward to optimizing it.

@d-cameron
Copy link

GRIDSS intermediate files should only take around 10-25% of size of the input file so I've had a look into it and the high disk usage is due to a combination of:

  • the -Dsamjdk.compression_level=1 parameter - useful for large high-speed disk arrays, but not the best idea for general use. Dropping this parameter will improve the BAM gzip compress.
  • Deleting per-input intermediate files only after all pre-processing is done. There's about 5 steps and they all write their own intermediate file. They only contain about 10% of the reads but when combined with bad+fast compress and keeping them all around till the end, results in high maximal disk usage
  • My institute getting a new HPC cluster with 500Tb of storage so I haven't been keeping an eye on disk usage in recent GRIDSS versions

The most obvious solution is to drop the low compression JVM argument and update GRIDSS to delete the temporary files for each step as soon as the subsequent step completes. Combined, these should drop the GRIDSS disk usage to under 100Gb for your input BAM which is much more reasonable.

@d-cameron
Copy link

d-cameron commented Mar 2, 2017

There's a few issues with using samblaster for read extraction:

  • it uses the proper pair SAM flag for identifying discordant read pairs. I've had issues with aligners setting the proper pair flag based purely on alignment contig and orientation regardless of inferred fragment size (hence why GRIDSS calls Picard tool's CollectInsertSizeMetrics and uses the output of that to determine the expected fragment size distribution.

  • unmapped/clipped read outputs to a fastq, not a bam

  • not sure if it supports extracting the entire read pair when only one of the reads is mapped

  • does it support sending split and discordant reads to the same file?

Although GRIDSS performs split read identification from clipped reads, there's no reason that it couldn't be done earlier and samblaster's recommendation of using YAHA is probably a better option than the GRIDSS default of calling bwa. If decent split read identification was performed upstream (and outputs split read alignments with the SA tag) then there's no need for GRIDSS to do it and this step can be skipped. Although bwa in theory does this split read alignment, in practice I've found that feeding the soft clips back to bwa again recovers additional split reads.

One big potential gain is to perform some of the GRIDSS pre-processing before the reads are coordinate sorted. This can indeed be streamed as GRIDSS doesn't actually need name-sorted bam files, merely that the R1 and R2 sam records are grouped together. CompuaeSamTags pre-processing operations requires information from the mate. Namely:

  • Softening hard clips for supp records (as per bwa output) pulls the info from the soft clipped primary record
  • R2, Q2, MC, MQ come from the mate.
    • samblaster --addMateTags will work for MC and MQ
    • Writing R2 and Q2 for all reads would double the size of the BAM file. If the insert size distribution was known (or the proper pair flag could be trusted), then these could be written only for discordant read pairs to save space.
    • Might be problematic to stream, but a compromise of using an insert size distribution based on the first X reads could work.
  • GRIDSS also adjusts the details of the mate if they don't actually match what the mate read says
    • Mostly this ends up stripping the mate info from reads when some upstream step removed the mate read from the bam. Pretty much the same as Picard's FixMateInformation but unpairs reads instead of crashing on them.
  • For each read, GRIDSS adds the supp flag to all SA alignments except one. Can be skipped if we know what the pipeline does (ie not bwa mem -M)

@d-cameron
Copy link

d-cameron commented Mar 2, 2017

To strip out the GRIDSS pre-processing stages completely, we would need to:

  • Pre-calculate metrics (can be done at any stage in the pipeline).
    • If you're calling Picard's CollectMultipleMetrics then this can be replaced with gridss.CollectGridssMetrics as it's a subclass with extra GRIDSS-specific metrics added.
  • Extract soft clipped, split, indel-containing reads, discordant read pairs, and read pairs with only one read mapped into a coordinate sorted bam file
  • Have the input bam in the correct format
    • R2, Q2, MC, MQ, NM, SA tags set
    • sam flags (including supp flag on SA recordS) and mate fields set correctly
  • do a better job of split read identification than just running bwa (optional, but highly desirable)

If you also skipped running gridss.AnnotateReferenceCoverage then GRIDSS could run purely off the extracted BAM. Up to 2/3 of the execution time of GRIDSS is currently spent just reading the input file (extract metrics; extract SV reads; annotate reference alelle coverage).

@ohofmann
Copy link
Collaborator

ohofmann commented Mar 2, 2017

Thanks for all the detailed information, Daniel. I am not sure we have the resources to make large scale changes at this time. The additional computational time is manageable, in particular since we would run GRIDDS in parallel with other CNV and SV callers which also take 12+ hours on a typical 100x tumor/normal pair. Storage however is going to be problematic, especially if streaming isn't ideal due to the issues you raised.

Brad, would you like me to run GRIDSS locally to get a better sense of overall resource requirements?

chapmanb added a commit that referenced this issue Mar 3, 2017
Do no use less compressed BAMs for temporary fixes. Thanks to @
d-cameron #1829
@chapmanb
Copy link
Member

chapmanb commented Mar 3, 2017

Daniel;
Thanks so much for the detailed feedback this is really helpful. Apologies about missing the impact of the -Dsamjdk.compression_level=1 flag, I should have suspected this one. I took it from the GRIDSS example script but didn't realize it applied to all of the intermediates written to disk. Echoing Oliver's comments, it sounds like we should retest with this in place and then investigate other options for moving the pre-processing upstream. We could definitely massage the samblaster split/discordant outputs to be more in line with needs for GRIDSS, but will have a hard time changing over full-scale to the approach you're using since it sounds like the clipping changes will effect other steps like variant calling. Am I reading everything right?

It also sounded from your comment like there were some disk space intermediate fixes you were looking at inside of GRIDSS. Is that right? If so, I'll wait for those, update the GRIDSS conda package and then retry with the compression fix in place. Thanks again.

@d-cameron
Copy link

It also sounded from your comment like there were some disk space intermediate fixes you were looking at inside of GRIDSS. Is that right?

Correct. There's no need to keep the intermediate files for 5 different steps and delete them all, when the peak disk usage can be reduced by deleting earlier.

@d-cameron
Copy link

d-cameron commented Mar 4, 2017

it sounds like we should retest with this in place and then investigate other options for moving the pre-processing upstream.... Am I reading everything right?

Yes - between compression and deletion timing, peak additional disk usage should be less than the size of the input file. My guess is a peak around 50% but I'll need to rerun some samples to get some numbers for you.

@chapmanb
Copy link
Member

chapmanb commented Mar 5, 2017

Brilliant, thanks Daniel. Let us know when things are in place within GRIDSS and I'll update the bioconda package and do more testing with the bcbio runs. Much appreciated.

d-cameron added a commit to PapenfussLab/gridss that referenced this issue Mar 23, 2017
@d-cameron
Copy link

I've released GRIDSS 1.3.4 that includes the disk usage optimisations. Sorry for the delay.

@chapmanb
Copy link
Member

Daniel;
Thanks much for this. I tried a run with 1.3.4 and am still getting pretty high disk usage (600Gb for 185Gb input), filling up the space during the BAM sorting step. I have dropped -Dsamjdk.compression_level=1, is there anything else I should be tweaking with this latest version to help reduce? Thanks again for the help.

@d-cameron
Copy link

d-cameron commented Mar 29, 2017

That is somewhat strange as the sorting just calls the same classes that Picard SortSam uses. There are a couple of options I can think of:

  • Is snappy compression working in your environment? There should be a snappy-1.0.33-libsnappyjava.so file (as well as a libjjwjni.so) in the directory GRIDSS was executed from. If it's missing it would explain the high disk usage as picard/htsjdk uses snappy to compress the sorting intermediate files on disk.
  • It is dying at the first sorting step (name sorting), or the second (back to coordinate)? Which input.bam.gridss.working/gridss.tmp.*.bam files exist and how large are they? If it's comparable to the input file then GRIDSS is probably extract most of the reads.
  • Disk usage of the first stage sort could be reduced combining the sort and extract into a single operation. Unfortunately, that only works for the first sort as it can use the input bam directly.
  • This can happen when the library fragment size is less than the read length and adapter trimming is not performed. In this sort of data, practically every single read is soft clipped so GRIDSS extracts them all. GRIDSS has logic to ignore soft-clipped adapters but that's currently downstream of the read extraction step.
  • Is the test data public?

With samtools 1.4 now properly multi-threaded, it might be worth adding the capability for external sorting to GRIDSS in the same manner external alignment is performed. That still might not solve this disk usage issue though.

@d-cameron
Copy link

I've just run GRIDSS 1.3.4 on ERA172924 (Illumina PCR-free 50x WGS NA12878) and got the following disk usage (du with 1 second polling interval):

Input file: 116G
work (final usage): 15G
work (peak usage): 51G
tmp (final usage): 0G
tmp (peak usage): 22G
total (peak usage): 51G

Peak disk usage occurred during the merging of the split reads identified by the external aligner with the extracted reads (51G in WORKING_DIR), and again in the immediately following steps that sorts back to coordinate sort order (30G in WORKING_DIR, 22G in TMP_DIR).

For GiaB HG002 bwa aligned against hg38, a 159G BAM file results in 57G final WORKING_DIR usage. Extrapolating from the final/peak ratio of 3.4, I would estimate 194G peak usage.

How large is the SV subset bam file (WORKING_DIR/input.bam.gridss.working/gridss.tmp.extracted.input.bam) on your data set?

@d-cameron
Copy link

GRIDSS 1.4.0 is now released. There's a bug fix for some read being unnecessarily included in the intermediate files as well as some run-time performance improvements. Hopefully the updated version should behave better on your sample data.

@schelhorn
Copy link
Author

@chapmanb, is this still of interest in the light of all the CNV caller work you are undertaking at the moment, or would you prefer me to close this issue?

@chapmanb
Copy link
Member

chapmanb commented Nov 2, 2017

Sven-Eric -- we are still planning to work on GRIDSS, but after the CNV updates finish. Thanks for checking on this; I'd love to leave it open as we try to improve the GRIDSS integration.

@ohofmann
Copy link
Collaborator

ohofmann commented Nov 3, 2017

Sven-Eric, @pdiakumis in my group has been comparing GRIDSS calls for a number of WGS tumor/normal samples with MANTA (and Hartwig's BreakPointInspector). Working with Daniel we seem to have a good set of filters; we might also look at some of Daniel's recent improvements. Happy to keep the issue open.

@d-cameron
Copy link

FYI: I've added an example pipeline script to GRIDSS (https://github.com/PapenfussLab/gridss/blob/master/example/example_pipeline.sh) to show how one can avoid creating most of the intermediate files and avoid the additional sorting operations by hooking the GRIDSS pre-processing step further upstream in the pipeline.

@roryk
Copy link
Collaborator

roryk commented Aug 10, 2019

Thanks, closing this along with #2500. Please reopen if GRIDSS is something that would be super useful to improve, but there hasn't been any action on its development for a year so I'll close it for now. Thank you!

@roryk roryk closed this as completed Aug 10, 2019
@d-cameron
Copy link

d-cameron commented Aug 15, 2019

FYI: GRIDSS now has a driver shell script that does most of the optimisations outlined above. Compared to the bcbio version, the latest GRIDSS release has lower peak temp disk usage, lower peak memory, improved multithread CPU utilisation, and lower overall CPU usage.

@roryk roryk reopened this Aug 15, 2019
@roryk
Copy link
Collaborator

roryk commented Aug 15, 2019

Thanks for all of the information @d-cameron!

@pdiakumis
Copy link
Contributor

Hi Daniel,

Also related to #2500 (comment).
Thanks for all your great work with GRIDSS!
We've got a bit of work to do with the upstream processes in our group at the moment but GRIDSS, LINX et al. are definitely in our plans. An older version of PURPLE is already supported in bcbio (#2532) so a potential plan for us to start with would be to bump those modules to work with the updated PURPLE pre-processing steps and feed it GRIDSS SVs. We'd of course have to look at adapting the GRIDSS bcbio module too which hasn't been touched in a year.

cc'ing @AndrewPattison to this discussion who's been working with GRIDSS and comparing stuff a lot lately. I'll be back in Melbourne in a couple weeks to start work on a feature branch.

All the best,
Peter

@naumenko-sa naumenko-sa mentioned this issue May 29, 2020
90 tasks
@naumenko-sa
Copy link
Contributor

Hi Peter @pdiakumis!

I see there is a lot of discussion of GRIDSS.
We are currently making an inventory of all active issues.
Do you know what is the status of GRIDSS module in bcbio,
is anybody working to improve and document it?

Sergey

@ohofmann
Copy link
Collaborator

We are waiting for the hg38 resource bundle to be ready, but are likely to run Gridss outside of bcbio to be able to make use of post-processed / filtered SNVs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants