-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ftp errors when using meta.retrieval #12
Comments
Hi Nicole, Thank you so much for contacting me and making me aware of the problem. I really try my best to provide a useful and fully functional tool and thus only thanks to feedback like yours I can improve I am also in contact with Akshaya Ramesh (@ARamesh123) who pointed out a similar problem to me and who really helped me a lot to trouble shoot and find the bug. To give you a short answer to your question concerning the re-download of genomes which couldn't be downloaded: If you install the developer version of You can download the developer version by typing: # install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr") Following are the problems Akshaya found:
When assessing the problems Akshaya pointed out to me, it seems that there could be a problem on the server side when running hundreds or thousands of access queries to NCBI. The fact, that TIMEOUT is reached always at the same time shows that there must be some kind of query counter from the same IP address implemented on the NCBI server side. I will screen more closely the NCBI guidelines and maybe write an email to the NCBI server maintainers. What I can do from my side is to try to stop the download whenever I don't receive server feedback anymore and then the user (after a while) can try to re-run the I will start working on that now and will come back to you as soon as I found a good solution. I apologize for the inconvenience the server timeout issue might have caused when using Many thanks and best wishes, |
Hi Hajk,
thanks for your fast reply. I do remember that NCBI doesn’t like a lot of single blast searches so I’m guessing you are right that they restrict the number of connections somehow. Maybe there is something like their bulk web access where you send a list of IDs and retrieve all of them in one go?
It seems to work if I break the bigger datasets down into chunks of 100 and use a for loop just like in your meta.retrieval function.
I was also thinking that sorting the bacteria into folders by phyla would be really useful for phylogenetic studies. I’ve just talked to a collaborator yesterday and we’d like to do an analysis, which gives us a result per phylum. I know this is not a trivial thing to do. There is the taxonomy database at NCBI and it should be possible to retrieve the full lineage of an organism using it’s ID but as far as I can remember, the lineages are not incredibly standardised. e.g. you’d want the same number of entries for each lineage but there are cases where somebody has inserted a subgroup of something and it doesn’t work anymore. So I’m not sure if this would be doable.
I’ll definitely recommend your package to others!
Cheers,
Nicole
Dr. Nicole Gruenheit
Research Associate
Faculty of Biology, Medicine and Health
Michael Smith Building
Oxford Road, Manchester, M13 9PT
The University of Manchester
http://thethompsonlab.wordpress.com/
… On 20 Apr 2017, at 16:06, Hajk-Georg Drost ***@***.***> wrote:
Hi Nicole,
Thank you so much for contacting me and making me aware of the problem. I really try my best to provide a useful and fully functional tool and thus only thanks to feedback like yours I can improve biomartr.
I am also in contact with Akshaya Ramesh ***@***.*** <https://github.com/ARamesh123>) who pointed out a similar problem <#6> to me and who really helped me a lot to trouble shoot and find the bug.
To give you a short answer to your question concerning the re-download of genomes which couldn't be downloaded: If you install the developer version of biomartr this re-download functionality is now included and downloads don't start all over again. As soon as I fixed this issue of downloading thousands of bacterial genomes and after passing the on-boarding <ropensci/software-review#93> process to rOpenSci, I will submit the new biomartr version to CRAN.
You can download the developer version by typing:
# install the current version of biomartr on your system
source("http://bioconductor.org/biocLite.R")
biocLite("HajkD/biomartr")
Following are the problems Akshaya found:
I have no problem while downloading smaller datasets; e.g.: all bacterial genbank sequences from the subgroup Cyanobavteria
I still get timeout on larger processes..sometimes it just keeps running without any error message; and is stalling and sometimes I get error message saying:
Error in open.connection(con, open = mode) : Timeout was reached Calls: <Anonymous> ... <Anonymous> -> curl_connection -> open -> open.connection Execution halted
The interesting thing is that I ran the command for different downloads (different subgroup downloading) on different machines and timeout was reached at the SAME time - so there was some miscommunication with the ftp @ one moment leading to timeout @ same time…
The meta.retrieval function works and re-starts download where It originally dropped off, but I was still not able to download all files for larger subgroups e.g.: Bacteroidetes
When assessing the problems Akshaya pointed out to me, it seems that there could be a problem on the server side when running hundreds or thousands of access queries to NCBI. The fact, that TIMEOUT is reached always at the same time shows that there must be some kind of query counter from the same IP address implemented on the NCBI server side. I will screen more closely the NCBI guidelines and maybe write an email to the NCBI server maintainers.
What I can do from my side is to try to stop the download whenever I don't receive server feedback anymore and then the user (after a while) can try to re-run the meta.retrieval function and start downloading from where they left off, just as you and Akshaya proposed.
I will start working on that now and will come back to you as soon as I found a good solution.
I apologize for the inconvenience the server timeout issue might have caused when using biomartr, but I hope that the package will be useful once this issue is solved.
Many thanks and best wishes,
Hajk
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#12 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKxoExm5500n6WmRqr_KzmQnU2Wm9xoiks5rx3SCgaJpZM4NC1uK>.
|
Hi Nicole, Thank you so much for your fast response :) I just found this bit of documentation that corresponds to your great suggestion to I also see that for constructing the queries to do so, NCBI required the common names of organisms, e.g. Anyway, I will see what I can do :) Maybe I can also get in contact with some NCBI people to see if they can help me out with that one. Concerning your request to include the NCBI Taxonomy for phyla classified retrieval of genomes, I think this is a great idea! Since most of my scientific projects have an evolutionary context anyway, I can put this functionality extension on my TODO list. As a future outlook, I plan to write useful interfaces between The idea is to combine my packages such as For example, multiple sequence alignments for a set of genomes can be performed by simply running: biomartr::meta.retrieval() %>% orthologr::multi_aln() %>% ... Or pairwise orthology inference via BLAST reciprocal best hit can then be performed by running: biomartr::meta.retrieval() %>% orthologr::map.generator() %>% ... Or phylogeny inference via: biomartr::meta.retrieval() %>% phylr::tree_infer() %>% ... Unfortunately, this is still work in progress, but on the way, I am always happy to receive input for potential improvements or functionality extensions. Thank you so much for your help and feedback, I truly appreciate it :) I will keep you posted about the new functionalities. Best wishes, |
Please have a look at our new software GenEra which may be able to help here: https://github.com/josuebarrera/GenEra . Cheers, |
Hi,
First, I really like this package, thanks for putting it together!
I tried to use meta.retrieval but, if the list of genomes is quite long (e.g. all alphaproteobacteria) it never finishes because there are several ftp errors. One is that the length of the downloaded file is 0 but when I try to download that file via ftp on the commandline, everything is fine. Another error is that the ftp site is suddenly not responding anymore.
The problem is, that I'd have to restart the download of all those genomes again. Is it possible to catch those errors from get.Genome and download all the genomes that work, then go back to the ones that didn't and retry those once? If they still don't work an error message listing all the failed ones including the ftp sites for those could enable the user to download them manually.
Also, let's say there was some problem with the internet connection and the process was killed halfway through, would it be possible to implement a check that first reads all the files in the folder and then marks the genomes that are already there (basically deletes them from the FinalGenomes vector)?
Cheers,
Nicole
The text was updated successfully, but these errors were encountered: