Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bamCoverage --extendReads crashes when no default fragment size is provided #17

Closed
cnluzon opened this issue May 21, 2020 · 7 comments · Fixed by #19
Closed

bamCoverage --extendReads crashes when no default fragment size is provided #17

cnluzon opened this issue May 21, 2020 · 7 comments · Fixed by #19

Comments

@cnluzon
Copy link
Collaborator

cnluzon commented May 21, 2020

I have found bamCoverage crashes sometimes on the scale_bigwig step.

It seems to be happening when --extendReads is provided with no default fragment size.

I think it does not come from bam file manipulation in the duplicate marking, because I have tried to run the same deepTools command with the mapped equivalent bam file (directly coming from bowtie) and it still crashes.

The error I get is this one from deepTools:

AssertionError: fragment start greater than fragmentend for read NB501365:478:H2HFNBGXF:2:22301:2673:4193_CATGTA

However, if I go and look at the BAM file:

NB501365:478:H2HFNBGXF:2:12207:13587:20077_TGACTA	163	chr7	125003593	42	34M	=	125003631	74	TTCTACTGTCTGAAAGGATCATAATGACAAAAAT	/AA/A/E6EEEEEEEEEEEEE/EEEEEEEEEEEE	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:34	YS:i:0	YT:Z:CP
NB501365:478:H2HFNBGXF:2:12207:13587:20077_TGACTA	83	chr7	125003631	42	36M	=	125003593	-74	GCTACCAATAATATGGGAAAATATTATTTGTTAGAA	E/EEE<EA6/EEEA/EE/</EE/EEEEEEAAE//EE	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:36	YS:i:0	YT:Z:CP

The corresponding lines are perfectly valid, even though they have some peculiarities: First mate coordinates (second in line, because it's on the minus strand, according to the 83 flag field) are indeed less than second mate (125003593).

However I don't think this is the real problem because it seems like a fairly usual situation and some of the scaled bw files are fully produced. I also found one where the paired mate is not mapped and also crashes:

NB501365:478:H2HFNBGXF:2:23111:15285:18671_AAACTT       137     chr17   72329   1       34M     =       72329   0       ACTGAGAAAGTTCAGAGCCTGGAGAGAAGGCTGG    6AAAAEEEEEEEEEEEEEAEEEEEEEAAE/EEEE      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:3 4       YT:Z:UP
NB501365:478:H2HFNBGXF:2:23111:15285:18671_AAACTT       69      chr17   72329   0       *       =       72329   0       ACGACGCTCTTCCGATCTGAAGGGGCACATCTACAGC AAEEEEEEEAEEEEAEEAEEEEEEEE/EEEEE/EEEA   YT:Z:UP

I have also tried to isolate reads that cause this error and look at them on IGV, they also load fine and make sense when I look at them.

The same bamCoverage run does not crash if I do provide a fragment size value:

bamCoverage -p 20 --binSize 1 --extendReads 150 --scaleFactor 69.91912350319042 --bam mapped/bamfile.bam -o scaled/scaled.bw

However I am not sure of the effect this has in the final bigwig file, so I am not sure if this is a fix.

Another thing is that if I run several times with the same BAM file, I get it crashing in different read IDs if -p > 1. This I guess it is due to parallelization of the process, but it suggests that this is a frequent enough problem to keep drawing different reads that make it crash.

The TLEN field is also a bit suspicious in those cases but I need to look into it in more detail.

@cnluzon
Copy link
Collaborator Author

cnluzon commented May 22, 2020

The source of the problem seems to be that in the cases where our data crashed, the amount of mapped, properly-paired reads is very low. DeepTools estimates this fragment length from properly-paired reads, and tends to add a lot of zeros to the "fragment-length" distribution when it doesn't find proper pairs.

The fragment length is estimated as the median of these, so it ends up being 0. There are cases where this estimate is added or subtracted to fragmentStart to compute the fragment, and after that it asserts fragmentStart < fragmentEnd and crashes.

So I think a possible solution is taking the estimate fragment size from picard, which seems to be more robust (at least, it always estimated something in these cases) and put it as parameter for the --extendReads option. Otherwise we stick to 150 as an estimate.

This fragment length estimate number is only used when reads are not paired or not properly paired, so the risk of adding a lot of imprecision to the result I'd say it's manageable. In all the other cases fragments will be made by filling the gap between read ends, so you would get the true fragment.

Perhaps additionally it would be good to warn about the situation where very little reads were mapped, but I am not sure this is necessary, because that will show everywhere in the results, anyway.

@simonelsasser
Copy link
Collaborator

simonelsasser commented May 22, 2020 via email

@cnluzon
Copy link
Collaborator Author

cnluzon commented May 22, 2020

It is a bit strange @simonelsasser, I hadn't noticed before. I was running a fresh mapping of Banu's human data yesterday and a bunch of BAM files crashed. It seems to happen with several barcodes. But I think rather than a paired-reads problem it's few reads in general, which in turn means few paired-reads, the ones deepTools needs.

I think the pipeline ideally should be robust to this, and just report the values.

I'm running this with the currently proposed fix (take picard's defaults), to see if it solves it in the real-size problem. The test dataset ran successfully with it.

@simonelsasser
Copy link
Collaborator

simonelsasser commented May 22, 2020 via email

@cnluzon
Copy link
Collaborator Author

cnluzon commented May 22, 2020

True @simonelsasser I accidentally mapped a mouse sample to hg38 😅, hence the low number of reads. However I think it's a good finding and the pipeline should be robust to that, so I'd stick with picard fragment estimation for scaled bigwig.

If a mistake like this is made, I think it's better that the pipeline runs and outputs the actual values. Looking at the mapping summary it should be obvious. At least more obvious than the pipeline crashing because of some obscure problem in the BAM alignment.

Also there have been some cases where we have also had pretty low performance in one of the samples, without any mistake in the mapping. No reason for the pipeline to crash because of that.

@marcelm
Copy link
Collaborator

marcelm commented May 25, 2020

@cnluzon Nice, great debugging! I wonder if this should be reported upstream (to the bamCoverage developers). My opinion would be that bamCoverage itself should not crash, but instead give a helpful error message ("Too few properly paired reads, could not estimate fragment size").

In any case, even when the insert size is taken from Picard’s reported metrics, it would make sense to have some sanity checks that stop the pipeline if some quality metrics are much too low. See how I did something similar in IgDiscover.

@cnluzon
Copy link
Collaborator Author

cnluzon commented May 25, 2020

Thanks! I totally agree with this, I do not think it is very nice behavior in bamCoverage either. It seems quite obscure. I did not open an issue there yet because I was looking for a small reproducible example of this.

Also agree with adding sanity-check rules like that. As much QC as we can get on the output. It's true that for many things there is FastQC, but %mapped reads is probably also a good indicator that is not collected there. Maybe also very high barcode amounts in the unknown category when demultiplexing could be checked for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants