bamCoverage --extendReads crashes when no default fragment size is provided #17

cnluzon · 2020-05-21T18:55:10Z

I have found bamCoverage crashes sometimes on the scale_bigwig step.

It seems to be happening when --extendReads is provided with no default fragment size.

I think it does not come from bam file manipulation in the duplicate marking, because I have tried to run the same deepTools command with the mapped equivalent bam file (directly coming from bowtie) and it still crashes.

The error I get is this one from deepTools:

AssertionError: fragment start greater than fragmentend for read NB501365:478:H2HFNBGXF:2:22301:2673:4193_CATGTA

However, if I go and look at the BAM file:

NB501365:478:H2HFNBGXF:2:12207:13587:20077_TGACTA	163	chr7	125003593	42	34M	=	125003631	74	TTCTACTGTCTGAAAGGATCATAATGACAAAAAT	/AA/A/E6EEEEEEEEEEEEE/EEEEEEEEEEEE	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:34	YS:i:0	YT:Z:CP
NB501365:478:H2HFNBGXF:2:12207:13587:20077_TGACTA	83	chr7	125003631	42	36M	=	125003593	-74	GCTACCAATAATATGGGAAAATATTATTTGTTAGAA	E/EEE<EA6/EEEA/EE/</EE/EEEEEEAAE//EE	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:36	YS:i:0	YT:Z:CP

The corresponding lines are perfectly valid, even though they have some peculiarities: First mate coordinates (second in line, because it's on the minus strand, according to the 83 flag field) are indeed less than second mate (125003593).

However I don't think this is the real problem because it seems like a fairly usual situation and some of the scaled bw files are fully produced. I also found one where the paired mate is not mapped and also crashes:

NB501365:478:H2HFNBGXF:2:23111:15285:18671_AAACTT       137     chr17   72329   1       34M     =       72329   0       ACTGAGAAAGTTCAGAGCCTGGAGAGAAGGCTGG    6AAAAEEEEEEEEEEEEEAEEEEEEEAAE/EEEE      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:3 4       YT:Z:UP
NB501365:478:H2HFNBGXF:2:23111:15285:18671_AAACTT       69      chr17   72329   0       *       =       72329   0       ACGACGCTCTTCCGATCTGAAGGGGCACATCTACAGC AAEEEEEEEAEEEEAEEAEEEEEEEE/EEEEE/EEEA   YT:Z:UP

I have also tried to isolate reads that cause this error and look at them on IGV, they also load fine and make sense when I look at them.

The same bamCoverage run does not crash if I do provide a fragment size value:

bamCoverage -p 20 --binSize 1 --extendReads 150 --scaleFactor 69.91912350319042 --bam mapped/bamfile.bam -o scaled/scaled.bw

However I am not sure of the effect this has in the final bigwig file, so I am not sure if this is a fix.

Another thing is that if I run several times with the same BAM file, I get it crashing in different read IDs if -p > 1. This I guess it is due to parallelization of the process, but it suggests that this is a frequent enough problem to keep drawing different reads that make it crash.

The TLEN field is also a bit suspicious in those cases but I need to look into it in more detail.

The text was updated successfully, but these errors were encountered:

cnluzon · 2020-05-22T08:59:18Z

The source of the problem seems to be that in the cases where our data crashed, the amount of mapped, properly-paired reads is very low. DeepTools estimates this fragment length from properly-paired reads, and tends to add a lot of zeros to the "fragment-length" distribution when it doesn't find proper pairs.

The fragment length is estimated as the median of these, so it ends up being 0. There are cases where this estimate is added or subtracted to fragmentStart to compute the fragment, and after that it asserts fragmentStart < fragmentEnd and crashes.

So I think a possible solution is taking the estimate fragment size from picard, which seems to be more robust (at least, it always estimated something in these cases) and put it as parameter for the --extendReads option. Otherwise we stick to 150 as an estimate.

This fragment length estimate number is only used when reads are not paired or not properly paired, so the risk of adding a lot of imprecision to the result I'd say it's manageable. In all the other cases fragments will be made by filling the gap between read ends, so you would get the true fragment.

Perhaps additionally it would be good to warn about the situation where very little reads were mapped, but I am not sure this is necessary, because that will show everywhere in the results, anyway.

simonelsasser · 2020-05-22T14:05:11Z

In what kind of data do we have so few paired reads? seems strange. S

…

On Fri, May 22, 2020 at 10:59 AM C. Navarro ***@***.***> wrote: The source of the problem seems to be that in the cases where our data crashed, the amount of mapped, properly-paired reads is very low. DeepTools estimates this fragment length from properly-paired reads, and tends to add a lot of zeros to the "fragment-length" distribution when it doesn't find proper pairs. The fragment length is estimated as the median of these, so it ends up being 0. There are cases where this estimate is added or subtracted to fragmentStart to compute the fragment, and after that it asserts fragmentStart < fragmentEnd and crashes. So I think a possible solution is taking the estimate fragment size from picard, which seems to be more robust (at least, it always estimated something in these cases) and put it as parameter for the --extendReads option. Otherwise we stick to 150 as an estimate. This fragment length estimate number is only used when reads are *not paired or not properly paired*, so the risk of adding a lot of imprecision to the result I'd say it's manageable. In all the other cases fragments will be made by filling the gap between read ends, so you would get the *true* fragment. Perhaps additionally it would be good to warn about the situation where very little reads were mapped, but I am not sure this is necessary, because that will show everywhere in the results, anyway. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEITAOD3JF6IRAW5APSEXXLRSY5HJANCNFSM4NHCJPZA> .

cnluzon · 2020-05-22T14:14:49Z

It is a bit strange @simonelsasser, I hadn't noticed before. I was running a fresh mapping of Banu's human data yesterday and a bunch of BAM files crashed. It seems to happen with several barcodes. But I think rather than a paired-reads problem it's few reads in general, which in turn means few paired-reads, the ones deepTools needs.

I think the pipeline ideally should be robust to this, and just report the values.

I'm running this with the currently proposed fix (take picard's defaults), to see if it solves it in the real-size problem. The test dataset ran successfully with it.

simonelsasser · 2020-05-22T14:24:22Z

maybe input samples? Or are there some mouse samples in it, which obviously may not map well to human. S

…

On Fri, May 22, 2020 at 4:15 PM C. Navarro ***@***.***> wrote: It is a bit strange @simonelsasser <https://github.com/simonelsasser>, I hadn't noticed before. I was running a fresh mapping of Banu's human data yesterday and a bunch of BAM files crashed. It seems to happen with several barcodes. But I think rather than a paired-reads problem it's few reads in general, which in turn means few paired-reads, the ones deepTools needs. I think the pipeline ideally should be robust to this, and just report the values. I'm running this with the currently proposed fix (take picard's defaults), to see if it solves it in the real-size problem. The test dataset ran successfully with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEITAOBWVILGJUGZD3AYKBLRS2CGRANCNFSM4NHCJPZA> .

cnluzon · 2020-05-22T15:28:01Z

True @simonelsasser I accidentally mapped a mouse sample to hg38 😅, hence the low number of reads. However I think it's a good finding and the pipeline should be robust to that, so I'd stick with picard fragment estimation for scaled bigwig.

If a mistake like this is made, I think it's better that the pipeline runs and outputs the actual values. Looking at the mapping summary it should be obvious. At least more obvious than the pipeline crashing because of some obscure problem in the BAM alignment.

Also there have been some cases where we have also had pretty low performance in one of the samples, without any mistake in the mapping. No reason for the pipeline to crash because of that.

marcelm · 2020-05-25T12:06:09Z

@cnluzon Nice, great debugging! I wonder if this should be reported upstream (to the bamCoverage developers). My opinion would be that bamCoverage itself should not crash, but instead give a helpful error message ("Too few properly paired reads, could not estimate fragment size").

In any case, even when the insert size is taken from Picard’s reported metrics, it would make sense to have some sanity checks that stop the pipeline if some quality metrics are much too low. See how I did something similar in IgDiscover.

cnluzon · 2020-05-25T12:18:00Z

Thanks! I totally agree with this, I do not think it is very nice behavior in bamCoverage either. It seems quite obscure. I did not open an issue there yet because I was looking for a small reproducible example of this.

Also agree with adding sanity-check rules like that. As much QC as we can get on the output. It's true that for many things there is FastQC, but %mapped reads is probably also a good indicator that is not collected there. Maybe also very high barcode amounts in the unknown category when demultiplexing could be checked for.

cnluzon mentioned this issue May 22, 2020

scaled_bigwig takes fsize from picard #19

Merged

marcelm closed this as completed in #19 May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bamCoverage --extendReads crashes when no default fragment size is provided #17

bamCoverage --extendReads crashes when no default fragment size is provided #17

cnluzon commented May 21, 2020

cnluzon commented May 22, 2020

simonelsasser commented May 22, 2020 via email

cnluzon commented May 22, 2020

simonelsasser commented May 22, 2020 via email

cnluzon commented May 22, 2020

marcelm commented May 25, 2020

cnluzon commented May 25, 2020

bamCoverage --extendReads crashes when no default fragment size is provided #17

bamCoverage --extendReads crashes when no default fragment size is provided #17

Comments

cnluzon commented May 21, 2020

cnluzon commented May 22, 2020

simonelsasser commented May 22, 2020 via email

cnluzon commented May 22, 2020

simonelsasser commented May 22, 2020 via email

cnluzon commented May 22, 2020

marcelm commented May 25, 2020

cnluzon commented May 25, 2020