-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated merging of duplicates #74
Comments
Wrote some code, but my requests keep on bouncing back from the API with a "403 Forbidden" response. I reached out to Kansas for help. |
Feedback from Kansas:
And:
|
Tip from @jlegind : Too many requests in a row to the GBIF API might overload it. Adding a few print statements e.g. would give enough delay however. |
Got the authorship lookup to GBIF working, but added a new task that I just remembered from personal communication with @PipBrewer :
|
Runs on test-database were highly successful. Script ready to be unleashed onto the live database. |
Rather than removing duplicates, the taxon spin in the local app db is actually going to be recreated. Eventually, a solution involving direct synchronization via the Specify7 API would be best. |
Running the new code against the test database that was successfully created from a direct mysql-dump of the live database looks promising. It seems that no references are broken after all, so the merge call to the API does sort out things neatly. The problem apparently lies with the crappy Specify6 Backup & Restore tool. I've discussed with @Sosannah that we're gonna do our own backups in the direct way going forward. I will write up a technical manual for this. |
The merging runs well but is really slow and taxing on database performance. I will try to start it up from home after office hours to reduce interference with regular Specify work. Hopefully it will be done over the weekend. The following task is a separate and big issue and needs its own ticket:
|
Added the option of iterating a pre-collected list of taxon ids that are known duplicates for speeding up the process. It works well. |
The run has finished, merging all obvious duplicates and produding an export file of about 30 ambivalent cases. A lot of these are "cf." taxa and were handled accordingly using a SQL script. However, inspecting the database directly, it appears there are at least 3000 name duplicates left, mainly distinguished by parent taxon, it seems. There are also about 400+ taxa with problematic characters in their name. More work needs to be done to eliminate these. |
Cleaned up cf. cases, so we just have the following:
The last set of cases will be thrown to the script one more time. |
All taxa that could be auto-merged have been merged, plus some additional cleaning of cases that fell a bit outside of the different algorithms. The results are the following:
This should be closed and any followup regarding above cases should become separate tickets. |
Issue
For documentation on the solution please see comments. Problematic taxon names have been separated out into #73
Many duplicates still remain in Specify. It's possible to have this automated by a small piece of code making direct API calls. Using some clever SQL, we can pulling out the keys (taxonid's) from any duplicated names of the same rank in the same taxon tree. These keys can then be fed to the API by code thus enabling automated merging of duplicates.
Should recognise records as duplicates even if one has author and the other doesn't. Following merging, the author should be retained.
Requirements:
- [ ] Process to remove duplicates from taxonomic spine stored in local app databaseThe text was updated successfully, but these errors were encountered: