Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastANI skips reference genomes #58

Closed
AlmogAngel opened this issue Jan 5, 2020 · 21 comments
Closed

fastANI skips reference genomes #58

AlmogAngel opened this issue Jan 5, 2020 · 21 comments

Comments

@AlmogAngel
Copy link

AlmogAngel commented Jan 5, 2020

Hi, thank you for this wonderful tool.

I've been using fastANI for a long period now and I've noticed a bug which I find hard to explain or demonstrate, but I will do my best :) :

When I use fastANI on multi-threading with long reference and query list, seems like it skips some reference genomes (they do not appear in the result file at all).

For example, when I use one genome vs. all my reference database (~117K genomes) with -t 1 I get back 2946 hits with ANI >= 95 (same species).

However, when I take multiple genomes (~3000) to compare with my reference, the same genome from the previous example gets only 2780 hits with ANI >=95 and I couldn't find the remaining 166 anywhere in the results.

To validate that they indeed have an ANI value, I ran the same genome again with the 166 missing hits (-t 1) and I got back the appropriate ANI results (~98 ANI).

In addition, I was trying to split my reference dataset into files with 5K genomes, but the problem remains.

I will be glad to provide more information if needed, thanks!

@AlmogAngel AlmogAngel reopened this Jan 5, 2020
@cjain7
Copy link
Member

cjain7 commented Jan 5, 2020

Is this occurring with FastANI v1.3?

@AlmogAngel
Copy link
Author

AlmogAngel commented Jan 6, 2020

Is this occurring with FastANI v1.3?

No, I'm still working with v1.1 because of my primary results based on this version.

I am running the same analysis with v1.3 to check if it solves the problem.

@cjain7
Copy link
Member

cjain7 commented Jan 6, 2020

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

@HAugustijn
Copy link

I encounter the same problem while using version 1.3. Everytime I run fastANI, I get a different amount of output

@AlmogAngel
Copy link
Author

AlmogAngel commented Jan 7, 2020

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug.

Results summary:
version 1.1:
1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)

version 1.3:
1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)

I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

@cjain7
Copy link
Member

cjain7 commented Jan 7, 2020

Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this.
I also have some data from #57

@cjain7
Copy link
Member

cjain7 commented Jan 7, 2020

I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?

@AlmogAngel
Copy link
Author

Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this.
I also have some data from #57

I am afraid that you cannot reproduce the problem with a small set of genomes, it occurs when using large number of genomes in query/reference.

I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ?

What do you mean by the chunk size parameter?
In order to chunk my reference I use bash command split -l 5000 to the reference list file and then I ran all of them in parallel, once finish I reunite the results.

@cjain7
Copy link
Member

cjain7 commented Jan 7, 2020

Got it!
I had misinterpreted the meaning of 5K genome chunks earlier.

@cjain7
Copy link
Member

cjain7 commented Jan 12, 2020

I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using int type for storing something, and that value could be exceeding 2 billion with large genome counts.

In your run with 117K genomes, can you give me some more statistics to narrow down the problem:

  1. What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)

  2. Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.

  3. Please send the complete output log associated with your experiment 1 genome vs. 117K genomes with 1 thread and 1 genome vs. 117K genomes with >1 threads. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.

Please make sure you're using v1.3 for the above. Thanks again for your help!

@AlmogAngel
Copy link
Author

I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using int type for storing something, and that value could be exceeding 2 billion with large genome counts.

In your run with 117K genomes, can you give me some more statistics to narrow down the problem:

  1. What is the total count of contigs in 117K genomes? For example, genome 1 may have 5 contigs, genome 2 may have 30 contigs and so on.. I'm looking for the total sum: (i.e., 5 + 30 + ...)
  2. Can you also share what happens if you run with 2 or 4 threads? You mentioned that you get best output with single thread. I want to know how unreliable the results become more #threads. I'm wondering if there is a bug in multi-threading.
  3. Please send the complete output log associated with your experiment 1 genome vs. 117K genomes with 1 thread and 1 genome vs. 117K genomes with >1 threads. If the ANI output file is not too big, I'd also like those. Please feel free to dump all these in a single folder and send me by email if it's more convenient.

Please make sure you're using v1.3 for the above. Thanks again for your help!

I got the results, sending you via email.

@cindy-tu
Copy link

Hello,

Thank you for the wonderful tool!

I don't know if it helps, but I just want to report that I am experiencing the same/similar thing with 4 threads, and provide more data for your reference.

I did 2 trials using the exact same command with the exact same input. I did 1x500 comparisons repeated for 500 queries and concatenated the result. I am using v1.3. Unfortunately, I don't have the log files anymore.

fastANI -q query.fasta --rl ref_files.txt --fragLen 1500 -o query.out -t 4

Both 1st and 2nd trial gave 89,694 rows of result (the discrepancy was much greater with version 1.1). Below I provide the differences between the two files.

Here are rows found in trial 1 but not in trial 2

header: query seq, ref seq, ANI value, orthologous fragment count, total fragment count

cpxv-jagkre08_1       vacv-lister_btn       97.6994 65      144
cpxv-marlei07_1       mpxv-pch      95.7457 126     141
cpxv-marlei07_1       vacv-lister   96.6591 122     141
**myxv-coomandook_13    myxv-coomandook_13    100       101     108**
myxv-coomandook_13    myxv-cornwall 99.8764 100     108
myxv-olympic_dam_2015   myxv-tol08_18 99.8012 101       108
myxv-perthshire_1818  myxv-port_augusta_2014        99.798  100       107
vacv-agr_mva_572pre   vacv-agr_mva_572pre   100       32      110
vacv-agr_mva_572pre   vacv-dryvax-dpp10     99.1093 56      110
**vacv-l-ivp_14   varv-ben68    96.2336   117     129**
vacv-lc16m8   cmlv-151v     97.1137 96      126
vacv-lc16m8   cpxv-k780     97.9983 116     126
vacv-lc16m8     cpxv-kos_2015 97.9747 49        126
vacv-lister   varv-gin69    96.291    117     126
varv-bgd74_nur        varv-eth72_16 99.879  123       124
varv-ner69    vacv-cva      96.6914 82      124
varv-ner69    vacv-dryvax-dpp21     96.5086 94      124
varv-tza65    vacv-dryvax-dpp11     96.3506 115     123
varv-tza65    vacv-ihdw1    96.2555 117     123
varv-tza65    varv-pak_1969 99.8979 122     123
varv-zaf65_102        varv-jpn46_yam        99.8268 38      124

Here are rows found in trial 2 but not in trial 1

**cpxv-humber07_1       vacv-bra_serro_2      97.4106 119     134**
cpxv-jagkre08_1       vacv-lister_btn       97.7175 123     144
cpxv-marlei07_1       mpxv-pch      96.3041 102     141
cpxv-marlei07_1       vacv-lister   96.56     64      141
**mpxv-sle      vacv-tp5      96.9057 116     132**
myxv-coomandook_13    myxv-cornwall 99.9796 101     108
myxv-olympic_dam_2015   myxv-tol08_18 99.7989 50        108
myxv-perthshire_1818  myxv-port_augusta_2014        99.6614 32      107
vacv-agr_mva_572pre   vacv-agr_mva_572pre   100       108     110
vacv-agr_mva_572pre   vacv-dryvax-dpp10     99.1229 108     110
vacv-lc16m8   cmlv-151v     96.7615 119     126
vacv-lc16m8   cpxv-k780     97.6604 118     126
vacv-lc16m8   cpxv-kos_2015 98.0925 122     126
vacv-lister     varv-gin69    95.4958   32      126
varv-bgd74_nur        varv-eth72_16 99.8782   64      124
varv-ner69    vacv-cva      96.3883 114     124
varv-ner69    vacv-dryvax-dpp21     96.2411 119     124
varv-tza65    vacv-dryvax-dpp11     96.4069 114     123
varv-tza65    vacv-ihdw1    96.3677 67      123
varv-tza65    varv-pak_1969 99.9018 117     123
varv-zaf65_102        varv-jpn46_yam        99.8853 124     124

Rows with ** are additional comparison found in one file but not the other.

Additionally, while the ANI-values may not differ that much for some, there's greater differences with the orthologous fragment count. E.g. vacv-lister -> varv-gin69 has fragment count dropped from 117 to 32 in trial 2.

@cjain7
Copy link
Member

cjain7 commented Jan 19, 2020

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug.

Results summary:
version 1.1:
1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)

version 1.3:
1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)

I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again?
Thank you!

@AlmogAngel
Copy link
Author

Cool, I fixed some bugs in 1.3, hopefully this would resolve.

Unfortunately, v1.3 did not solve the bug.
Results summary:
version 1.1:
1 genome vs. 117K genomes with 1 thread -> 2946 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2780 hits (>= 95 ANI)
version 1.3:
1 genome vs. 117K genomes with 1 thread -> 2962 hits (>= 95 ANI)
~3000 queries (including the same genome) vs. 117K (split into 5K genomes chunks) with 10 threads per run -> 2944 hits (>= 95 ANI)
I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread

@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again?
Thank you!

Unfortunately, the new code performed worse.

Here are my analysis results:

  1. 1 genome vs. 117K with a different number of threads (v1.3):
    <# of threads> <# of hits>
    1 5546
    2 5546
    4 5546
    7 5546
    10 5546
    15 5546
    17 5546
    20 5519
    90 4401

  2. I made another try with 20 threads since I suspected it to be the cutoff for bugs to appear. However, this time I got 5546 hits instead of 5519.

  3. I made the same run with 90 threads, this time with the new code you just uploaded in the master branch and I got 3639 hits this time.

@cjain7
Copy link
Member

cjain7 commented Jan 21, 2020

I see,

what happens when you keep #threads fixed, and do
1 genome vs. 117K genomes
3000 queries (including the same genome) vs. 117K?

@bkille
Copy link

bkille commented Apr 20, 2020

I too am experiencing this behavior. I've noticed something odd as well. I am doing all pairs ANI calculations using a file of 47 genomes. When running with 1 thread, I get the expected results every time. If I run with 60 threads, I am at most missing 1 line from the final output file. Same if I run with 2 threads. However if I run with 30 threads, then I start missing 100 or more lines from the output most times.

@cjain7
Copy link
Member

cjain7 commented Apr 20, 2020

@bkille , possible to share the data and your scripts?

@bkille
Copy link

bkille commented Apr 28, 2020

@cjain7, sorry for the delayed response.

mers49.zip
Running fastANI --rl mers_files.txt --ql mers_files.txt -o fastani_out.txt with different thread counts.

(PS is there a way to tell fastANI to do pairwise calculations from one input list? It would save me half the time)

@cjain7
Copy link
Member

cjain7 commented May 29, 2020

@bkille

I tested fastANI on the genomes you shared above on two different clusters I've access to. I varied thread counts to 1, 2, 4, 8, 16, 32, 64 and 128. In each case, i'm getting consistent output.

$ wc -l */output.txt
   2304 t_128/output.txt
   2304 t_16/output.txt
   2304 t_1/output.txt
   2304 t_2/output.txt
   2304 t_32/output.txt
   2304 t_4/output.txt
   2304 t_64/output.txt
   2304 t_8/output.txt

May be you can try running on a different computer at your end... i'm not sure what's going on. A similar issue was reported in #37

@cjain7
Copy link
Member

cjain7 commented May 29, 2020

I also checked random pairs of genomes in each output file (corresponding to different thread counts), I'm seeing consistent output values at my end.

$ grep -E "Wadi-Ad-Dawasir_1_2013.*Al-Hasa_17_2013.fna" t_*/output.txt

t_128/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_16/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_1/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_2/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_32/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_4/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_64/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10
t_8/output.txt:/home/jainc2/debug_fastani/MrTomRod/mers49/Wadi-Ad-Dawasir_1_2013.fna	/home/jainc2/debug_fastani/MrTomRod/mers49/Al-Hasa_17_2013.fna	99.7004	10	10

PS: I'm using the latest version of FastANI directly cloned from master branch.

$ ./FastANI/fastANI -v
version 1.3

@cjain7
Copy link
Member

cjain7 commented Jul 11, 2020

See #67, this issue should be fixed by the code change 902ce0a
(cc @cindy-tu , @AlmogAngel, @HAugustijn)

@cjain7 cjain7 closed this as completed Jul 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants