-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastANI skips reference genomes #58
Comments
Is this occurring with FastANI v1.3? |
No, I'm still working with v1.1 because of my primary results based on this version. I am running the same analysis with v1.3 to check if it solves the problem. |
Cool, I fixed some bugs in 1.3, hopefully this would resolve. |
I encounter the same problem while using version 1.3. Everytime I run fastANI, I get a different amount of output |
Unfortunately, v1.3 did not solve the bug. Results summary: version 1.3: I also run the same genome vs. the remaining 18 missing above and they all got ANI 97-99. Seems like so far the most reliable results are by using v1.3 with 1 thread |
Thanks for sharing the details. I'll try to figure the cause this week. Meanwhile, please feel free to share the genomes (a small set if possible) where I can reproduce this. |
I've one follow-up question here, are you using different chunk size parameter for the two runs (1 vs. all and 3000 vs. all) ? |
I am afraid that you cannot reproduce the problem with a small set of genomes, it occurs when using large number of genomes in query/reference.
What do you mean by the chunk size parameter? |
Got it! |
I'm still trying to reason why this could be happening... for now I'm guessing there may be integer overflow happening in the code.. I may be using In your run with 117K genomes, can you give me some more statistics to narrow down the problem:
Please make sure you're using v1.3 for the above. Thanks again for your help! |
I got the results, sending you via email. |
Hello, Thank you for the wonderful tool! I don't know if it helps, but I just want to report that I am experiencing the same/similar thing with 4 threads, and provide more data for your reference. I did 2 trials using the exact same command with the exact same input. I did 1x500 comparisons repeated for 500 queries and concatenated the result. I am using v1.3. Unfortunately, I don't have the log files anymore. fastANI -q query.fasta --rl ref_files.txt --fragLen 1500 -o query.out -t 4 Both 1st and 2nd trial gave 89,694 rows of result (the discrepancy was much greater with version 1.1). Below I provide the differences between the two files. Here are rows found in trial 1 but not in trial 2 header: query seq, ref seq, ANI value, orthologous fragment count, total fragment count
Here are rows found in trial 2 but not in trial 1
Rows with ** are additional comparison found in one file but not the other. Additionally, while the ANI-values may not differ that much for some, there's greater differences with the orthologous fragment count. E.g. vacv-lister -> varv-gin69 has fragment count dropped from 117 to 32 in trial 2. |
@AlmogAngel, I've made a fix in the code today.. I am curious to know if that change resolves this issue or not. Would it be possible for you to download the latest code from master branch and run the above experiment again? |
Unfortunately, the new code performed worse. Here are my analysis results:
|
I see, what happens when you keep #threads fixed, and do |
I too am experiencing this behavior. I've noticed something odd as well. I am doing all pairs ANI calculations using a file of 47 genomes. When running with 1 thread, I get the expected results every time. If I run with 60 threads, I am at most missing 1 line from the final output file. Same if I run with 2 threads. However if I run with 30 threads, then I start missing 100 or more lines from the output most times. |
@bkille , possible to share the data and your scripts? |
@cjain7, sorry for the delayed response. mers49.zip (PS is there a way to tell fastANI to do pairwise calculations from one input list? It would save me half the time) |
I tested fastANI on the genomes you shared above on two different clusters I've access to. I varied thread counts to 1, 2, 4, 8, 16, 32, 64 and 128. In each case, i'm getting consistent output.
May be you can try running on a different computer at your end... i'm not sure what's going on. A similar issue was reported in #37 |
I also checked random pairs of genomes in each output file (corresponding to different thread counts), I'm seeing consistent output values at my end.
PS: I'm using the latest version of FastANI directly cloned from master branch.
|
See #67, this issue should be fixed by the code change 902ce0a |
Hi, thank you for this wonderful tool.
I've been using fastANI for a long period now and I've noticed a bug which I find hard to explain or demonstrate, but I will do my best :) :
When I use fastANI on multi-threading with long reference and query list, seems like it skips some reference genomes (they do not appear in the result file at all).
For example, when I use one genome vs. all my reference database (~117K genomes) with -t 1 I get back 2946 hits with ANI >= 95 (same species).
However, when I take multiple genomes (~3000) to compare with my reference, the same genome from the previous example gets only 2780 hits with ANI >=95 and I couldn't find the remaining 166 anywhere in the results.
To validate that they indeed have an ANI value, I ran the same genome again with the 166 missing hits (-t 1) and I got back the appropriate ANI results (~98 ANI).
In addition, I was trying to split my reference dataset into files with 5K genomes, but the problem remains.
I will be glad to provide more information if needed, thanks!
The text was updated successfully, but these errors were encountered: