-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bamCoverage --extendReads crashes when no default fragment size is provided #17
Comments
The source of the problem seems to be that in the cases where our data crashed, the amount of mapped, properly-paired reads is very low. DeepTools estimates this fragment length from properly-paired reads, and tends to add a lot of zeros to the "fragment-length" distribution when it doesn't find proper pairs. The fragment length is estimated as the median of these, so it ends up being 0. There are cases where this estimate is added or subtracted to So I think a possible solution is taking the estimate fragment size from picard, which seems to be more robust (at least, it always estimated something in these cases) and put it as parameter for the This fragment length estimate number is only used when reads are not paired or not properly paired, so the risk of adding a lot of imprecision to the result I'd say it's manageable. In all the other cases fragments will be made by filling the gap between read ends, so you would get the true fragment. Perhaps additionally it would be good to warn about the situation where very little reads were mapped, but I am not sure this is necessary, because that will show everywhere in the results, anyway. |
In what kind of data do we have so few paired reads? seems strange.
S
…On Fri, May 22, 2020 at 10:59 AM C. Navarro ***@***.***> wrote:
The source of the problem seems to be that in the cases where our data
crashed, the amount of mapped, properly-paired reads is very low. DeepTools
estimates this fragment length from properly-paired reads, and tends to add
a lot of zeros to the "fragment-length" distribution when it doesn't find
proper pairs.
The fragment length is estimated as the median of these, so it ends up
being 0. There are cases where this estimate is added or subtracted to
fragmentStart to compute the fragment, and after that it asserts fragmentStart
< fragmentEnd and crashes.
So I think a possible solution is taking the estimate fragment size from
picard, which seems to be more robust (at least, it always estimated
something in these cases) and put it as parameter for the --extendReads
option. Otherwise we stick to 150 as an estimate.
This fragment length estimate number is only used when reads are *not
paired or not properly paired*, so the risk of adding a lot of
imprecision to the result I'd say it's manageable. In all the other cases
fragments will be made by filling the gap between read ends, so you would
get the *true* fragment.
Perhaps additionally it would be good to warn about the situation where
very little reads were mapped, but I am not sure this is necessary, because
that will show everywhere in the results, anyway.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEITAOD3JF6IRAW5APSEXXLRSY5HJANCNFSM4NHCJPZA>
.
|
It is a bit strange @simonelsasser, I hadn't noticed before. I was running a fresh mapping of Banu's human data yesterday and a bunch of BAM files crashed. It seems to happen with several barcodes. But I think rather than a paired-reads problem it's few reads in general, which in turn means few paired-reads, the ones deepTools needs. I think the pipeline ideally should be robust to this, and just report the values. I'm running this with the currently proposed fix (take picard's defaults), to see if it solves it in the real-size problem. The test dataset ran successfully with it. |
maybe input samples? Or are there some mouse samples in it, which
obviously may not map well to human.
S
…On Fri, May 22, 2020 at 4:15 PM C. Navarro ***@***.***> wrote:
It is a bit strange @simonelsasser <https://github.com/simonelsasser>, I
hadn't noticed before. I was running a fresh mapping of Banu's human data
yesterday and a bunch of BAM files crashed. It seems to happen with several
barcodes. But I think rather than a paired-reads problem it's few reads in
general, which in turn means few paired-reads, the ones deepTools needs.
I think the pipeline ideally should be robust to this, and just report the
values.
I'm running this with the currently proposed fix (take picard's defaults),
to see if it solves it in the real-size problem. The test dataset ran
successfully with it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEITAOBWVILGJUGZD3AYKBLRS2CGRANCNFSM4NHCJPZA>
.
|
True @simonelsasser I accidentally mapped a mouse sample to hg38 😅, hence the low number of reads. However I think it's a good finding and the pipeline should be robust to that, so I'd stick with picard fragment estimation for scaled bigwig. If a mistake like this is made, I think it's better that the pipeline runs and outputs the actual values. Looking at the mapping summary it should be obvious. At least more obvious than the pipeline crashing because of some obscure problem in the BAM alignment. Also there have been some cases where we have also had pretty low performance in one of the samples, without any mistake in the mapping. No reason for the pipeline to crash because of that. |
@cnluzon Nice, great debugging! I wonder if this should be reported upstream (to the bamCoverage developers). My opinion would be that In any case, even when the insert size is taken from Picard’s reported metrics, it would make sense to have some sanity checks that stop the pipeline if some quality metrics are much too low. See how I did something similar in IgDiscover. |
Thanks! I totally agree with this, I do not think it is very nice behavior in Also agree with adding sanity-check rules like that. As much QC as we can get on the output. It's true that for many things there is FastQC, but %mapped reads is probably also a good indicator that is not collected there. Maybe also very high barcode amounts in the |
I have found
bamCoverage
crashes sometimes on thescale_bigwig
step.It seems to be happening when
--extendReads
is provided with no default fragment size.I think it does not come from bam file manipulation in the duplicate marking, because I have tried to run the same deepTools command with the
mapped
equivalent bam file (directly coming from bowtie) and it still crashes.The error I get is this one from deepTools:
However, if I go and look at the BAM file:
The corresponding lines are perfectly valid, even though they have some peculiarities: First mate coordinates (second in line, because it's on the minus strand, according to the 83 flag field) are indeed less than second mate (125003593).
However I don't think this is the real problem because it seems like a fairly usual situation and some of the scaled bw files are fully produced. I also found one where the paired mate is not mapped and also crashes:
I have also tried to isolate reads that cause this error and look at them on IGV, they also load fine and make sense when I look at them.
The same
bamCoverage
run does not crash if I do provide a fragment size value:However I am not sure of the effect this has in the final bigwig file, so I am not sure if this is a fix.
Another thing is that if I run several times with the same BAM file, I get it crashing in different read IDs if
-p
> 1. This I guess it is due to parallelization of the process, but it suggests that this is a frequent enough problem to keep drawing different reads that make it crash.The TLEN field is also a bit suspicious in those cases but I need to look into it in more detail.
The text was updated successfully, but these errors were encountered: