Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create "slb" and "wd" database #102

Open
brunobrr opened this issue Feb 10, 2022 · 1 comment
Open

Unable to create "slb" and "wd" database #102

brunobrr opened this issue Feb 10, 2022 · 1 comment

Comments

@brunobrr
Copy link

Hi,

I tried to download "slb" and "wd" databases using different versions (2022, 2021, 2020, 2019) but all returned an error. For example:

td_create(provider = "slb", version = 2019, overwrite = TRUE)

"could not find 2019_dwc_slb, 2019_common_slb 
  checking for older versions.
2019_dwc_slb not available2019_common_slb not available"

By inspecting the number of records of each database I noticed that the latest versions of "ncbi" and "col" have fewer records than older versions. I expected that the latest version had more records than the older ones.

taxadb::taxa_tbl("ncbi", version = 2022) %>% summarise(n())    #2950147
taxadb::taxa_tbl("ncbi", version = 2021) %>% summarise(n())    #3461657

taxadb::taxa_tbl("col", version = 2022) %>% summarise(n())    #807599 
taxadb::taxa_tbl("col", version = 2021) %>% summarise(n())    #3615220

Finally, I noticed that probably due to issues related to my internet connection sometimes databases are created with fewer records than expected. For example, "ncbi" (v. 2022) had 32831 records instead of 2950147. I recognize that it is not a real issue, but maybe would be useful to check if the database has the expected number of records before performing queries. Just an idea.

@cboettig
Copy link
Member

Thanks for the report, very helpful.

  • the wikidata and slb databases haven't been ported to the new system. We don't actually have a good mechanism to assemble and update wikidata names, so that will probably be deprecated, slb is just part of my backlog, sorry.

  • Thanks for checking the NCBI / COL numbers, looks like that could actually be an upstream bug. Note that taxadb checks the sha-256 hash of the downloaded file, so if it was a network issue on your end, it would throw an error.

More precisely, it looks like the 2022 versions of NCBI have only the species names tables, names that resolve only to a higher taxon rank are not listed in the scientificName column (though still available from the dedicated rank columns):

> taxadb::taxa_tbl("ncbi") %>% count(taxonRank)
# Source:   lazy query [?? x 2]
# Database: duckdb_connection
  taxonRank       n
  <chr>       <dbl>
1 species   2950147

so I think we need to fix the 2022 tables for NCBI and COL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants