-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads used affects results #450
Comments
That definitely should not be happening. Can you attach the full log files
for each run?
…On Tue, Feb 9, 2021 at 3:08 AM Keiran Raine ***@***.***> wrote:
Hi,
This may be a know issue or specific to the version I'm running but I
couldn't see issues relating to this. Docker image downloaded from
dockerhub - version 2.9.4.
I've been running some tests to evaluate the most efficient way to deploy
gridss on our system, different numbers of threads etc.
I'm finding that the calls are varying considerably depending on the
number of threads used:
- output-2 = 2 threads
- output-4 = 4 threads
- output-6 = 6 threads
- output-8 = 8 threads
$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-4/test.vcf.gz | cut -f 1-5 | sort) | wc -l
46039
$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-6/test.vcf.gz | cut -f 1-5 | sort) | wc -l
720
$ comm -23 <(zcat output-2/test.vcf.gz | cut -f 1-5 | sort) <(zcat output-8/test.vcf.gz | cut -f 1-5 | sort) | wc -l
0
Along with the difference in calls I noted the following:
1. The VCFs aren't equally sorted. Extending sort to REF/ALT ensures
stable outputs when multiple events occur at the same position.
2. Although 2 threads vs 8 looks to match there are differences in the
the genotype column.
Executed via:
export CPUS=8 # this value changed for each thread count
rm -rf output-${CPUS} output-${CPUS}-tmp
mkdir -p output-${CPUS} output-${CPUS}-tmp
gridss.sh\
--reference $PWD/ref/genome.fa\
--blacklist $PWD/ref/blacklist.bed\
--threads $CPUS\
--labels test\
--assembly $PWD/output-${CPUS}/test.assembly.bam\
--output $PWD/output-${CPUS}/test.vcf.gz\
--workingdir output-${CPUS}-tmp\
$PWD/inputs/INPUT.bam
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#450>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOBYOGQWGHMPP2ERW74EYTS6AD7FANCNFSM4XJII62A>
.
|
Input data is mapped with 2-cpu-gridss.full.20210205_152922.node-11-6-4.58859.log |
Doing regression testing to see if this is a symptom of #363, or a more widespread issue. |
I am able to confirm that the first point of divergence is the |
Are you using a version of bwa that is not stable w.r.t number of threads used? See https://www.biostars.org/p/90390/ GRIDSS calls bwa during preprocessing to identify split reads (e.g. bwa does not report all split reads & bowtie2 does not do split read alignment at all). Differences in the alignment results return will have a downstream impact on the GRIDSS SV calls. |
We are using the image you have placed on dockerhub: docker pull gridss/gridss:2.9.4 The input is the same file for each execution, the only variable being the number of threads passed to GRIDSS. |
I should mention we are moving to 2.10.2 (or greater), we are not fixed on the 2.9.x branch. |
Just to clarify: are the .sv.bam difference difference in the content of the SAM records themselves, or just md5 differences due to the bwa program group header being different due to different paths/ # thread? |
Ok, the root cause is that bwa 0.7.17-r1188 mapq scores are not stable w.r.t to number of threads used to run bwa |
It is the content of the SAM records themselves (2cpu vs 4cpu):
|
Ok, root cause is that bwa is not stable w.r.t thread count due to dynamic batch size. |
|
Just checking I understand. The Is there a timeline for a release/hotfix with the vcf order and this to be pushed out or the possibility of a docker image being made available for testing? Thanks |
Fix is on the dev branch, but you still need to use ETA for next release is end of next week. |
The |
Can we get an updated ETA on a versioned release please? |
I have a release candidate for v2.11.0 already prepared. Release notes have been written and I'm currently doing regression testing. Should be a few days if all goes as expected, within a week if I pick up any regression bug. |
Hi,
This may be a know issue or specific to the version I'm running but I couldn't see issues relating to this. Docker image downloaded from dockerhub - version 2.9.4.
I've been running some tests to evaluate the most efficient way to deploy gridss on our system, different numbers of threads etc.
I'm finding that the calls are varying considerably depending on the number of threads used:
Along with the difference in calls I noted the following:
Executed via:
The text was updated successfully, but these errors were encountered: