Automated merging of duplicates #74

FedorSteeman · 2022-08-17T11:55:34Z

Issue

For documentation on the solution please see comments. Problematic taxon names have been separated out into #73

Many duplicates still remain in Specify. It's possible to have this automated by a small piece of code making direct API calls. Using some clever SQL, we can pulling out the keys (taxonid's) from any duplicated names of the same rank in the same taxon tree. These keys can then be fed to the API by code thus enabling automated merging of duplicates.

Should recognise records as duplicates even if one has author and the other doesn't. Following merging, the author should be retained.

Requirements:

Code for calling the merge function via the Specify7 API
Executable function that scans for and merges duplicates at genus level & below in given taxon tree
Smart merging of duplicates by taking into account author & parent taxon
Tagging indecisive cases for later investigation
Review of indecisive cases and adding to code if decided
Script for interfacing with GBIF for checking authorship
Code for handling "cfr" and "aff" taxa
~~- [ ] Process to remove duplicates from taxonomic spine stored in local app database~~

FedorSteeman · 2022-08-17T14:22:48Z

Wrote some code, but my requests keep on bouncing back from the API with a "403 Forbidden" response. I reached out to Kansas for help.

FedorSteeman · 2022-08-25T06:30:55Z

Feedback from Kansas:

Our junior programmer believes the 500 error is because you are sending the requests body as JSON rather than form data. Does this sound like it could be the issue to you?
Thank you!
Grant Fitzsimmons

And:

More information from Max:
Some API endpoints accept form data payload rather than JSON object payload.
This endpoint is one of them.
Documentation on how to send form data using the python "requests" library: https://stackoverflow.com/q...
We have a GitHub ticket for fixing this inconsistency and providing more user-friendly errors: specify/specify7#2023
Let us know if that helps or if you're still having issues.
Theresa
Specify Collections Consortium

FedorSteeman · 2022-11-14T09:29:34Z

Tip from @jlegind : Too many requests in a row to the GBIF API might overload it. Adding a few print statements e.g. would give enough delay however.

FedorSteeman · 2022-11-24T17:21:09Z

Got the authorship lookup to GBIF working, but added a new task that I just remembered from personal communication with @PipBrewer :

Code for handling "cfr" and "aff" taxa amending the corresponding determinations accordingly

FedorSteeman · 2022-11-29T05:50:41Z

Runs on test-database were highly successful. Script ready to be unleashed onto the live database.

PipBrewer · 2022-11-30T12:19:30Z

Rather than removing duplicates, the taxon spin in the local app db is actually going to be recreated. Eventually, a solution involving direct synchronization via the Specify7 API would be best.

FedorSteeman · 2022-12-08T11:48:45Z

Running the new code against the test database that was successfully created from a direct mysql-dump of the live database looks promising. It seems that no references are broken after all, so the merge call to the API does sort out things neatly. The problem apparently lies with the crappy Specify6 Backup & Restore tool.

I've discussed with @Sosannah that we're gonna do our own backups in the direct way going forward. I will write up a technical manual for this.

FedorSteeman · 2022-12-09T10:10:42Z

The merging runs well but is really slow and taxing on database performance. I will try to start it up from home after office hours to reduce interference with regular Specify work. Hopefully it will be done over the weekend.

The following task is a separate and big issue and needs its own ticket:

Process to remove duplicates from taxonomic spine stored in local app database

FedorSteeman · 2022-12-15T11:57:08Z

Added the option of iterating a pre-collected list of taxon ids that are known duplicates for speeding up the process. It works well.

FedorSteeman · 2022-12-19T06:43:38Z

The run has finished, merging all obvious duplicates and produding an export file of about 30 ambivalent cases. A lot of these are "cf." taxa and were handled accordingly using a SQL script.

However, inspecting the database directly, it appears there are at least 3000 name duplicates left, mainly distinguished by parent taxon, it seems. There are also about 400+ taxa with problematic characters in their name. More work needs to be done to eliminate these.

FedorSteeman · 2022-12-19T12:47:20Z

Cleaned up cf. cases, so we just have the following:

185 problematic taxa with punctuation in their name
4558 possible duplicates though with different parent taxa
540 possible duplicates with the same parent taxon

The last set of cases will be thrown to the script one more time.
The first set of cases will have to be handled manually.
The middle set may have to be subjected to an expansion of the script that looks up the accepted parent at GBIF and merges accordingly.

FedorSteeman · 2022-12-21T12:32:05Z

All taxa that could be auto-merged have been merged, plus some additional cleaning of cases that fell a bit outside of the different algorithms.

The results are the following:

A spreadsheet with the 137 remaining "problematic taxa" most of which would need a closer look by an expert or even having the original herbarium sheet looked at, because of difficulties with the transcription:
Botany problematic taxa.xlsx
This has been relegated to ticket Fixing taxa with transcription errors #73
A spreadsheet with 5600+ taxa that are duplicate namesakes, but placed under different parent taxa:
Botany Duplicates Diff Parents.xlsx
We need to discuss how this could be resolved, because Workbench may have issues picking the right taxon for the import. I could e.g. still merge on the basis of GBIF, but then log then decisions to be presented to curators to veto, so these can be reversed
A spreadsheet with ambivalent cases that could not be auto-merged due to various reasons:
(Coming later)

This should be closed and any followup regarding above cases should become separate tickets.

FedorSteeman added backend backend Specify Related to (interactions with) the Specify SW system synchronization ... taxonomy labels Aug 17, 2022

FedorSteeman added this to the Sprint 1 milestone Aug 17, 2022

FedorSteeman self-assigned this Aug 17, 2022

jlegind removed this from the Sprint 1 milestone Aug 19, 2022

FedorSteeman added a commit that referenced this issue Aug 19, 2022

Attempt at auto-merging taxa #74

e9c0264

FedorSteeman added 3 priority 3 and removed backend backend labels Sep 1, 2022

jlegind added 1 priority 1 and removed 3 priority 3 labels Sep 7, 2022

jlegind added this to the Sprint6 milestone Sep 7, 2022

FedorSteeman self-assigned this Nov 11, 2022

FedorSteeman modified the milestones: Sprint 8, Sprint 11 Nov 23, 2022

FedorSteeman added a commit that referenced this issue Nov 28, 2022

Finished up merging of taxon duplicates #74

571c6d7

FedorSteeman modified the milestones: Sprint 11, Sprint 12 Dec 2, 2022

FedorSteeman mentioned this issue Dec 9, 2022

Remove duplicates from taxonomic spine #214

Closed

FedorSteeman added a commit that referenced this issue Dec 9, 2022

adjustments merge duplicates #74

158fd03

FedorSteeman modified the milestones: Sprint 12, Sprint 13 Dec 21, 2022

FedorSteeman mentioned this issue Dec 21, 2022

Fixing taxa with transcription errors #73

Closed

3 tasks

FedorSteeman added a commit that referenced this issue Dec 21, 2022

Final tweaks #74

fb04786

FedorSteeman closed this as completed Dec 21, 2022

PipBrewer added this to Mass Digitization App Jul 23, 2024

PipBrewer moved this to Done in Mass Digitization App Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated merging of duplicates #74

Automated merging of duplicates #74

FedorSteeman commented Aug 17, 2022 •

edited by jlegind

Loading

FedorSteeman commented Aug 17, 2022

FedorSteeman commented Aug 25, 2022

FedorSteeman commented Nov 14, 2022

FedorSteeman commented Nov 24, 2022 •

edited

Loading

FedorSteeman commented Nov 29, 2022

PipBrewer commented Nov 30, 2022

FedorSteeman commented Dec 8, 2022

FedorSteeman commented Dec 9, 2022

FedorSteeman commented Dec 15, 2022

FedorSteeman commented Dec 19, 2022 •

edited

Loading

FedorSteeman commented Dec 19, 2022

FedorSteeman commented Dec 21, 2022 •

edited

Loading

Automated merging of duplicates #74

Automated merging of duplicates #74

Comments

FedorSteeman commented Aug 17, 2022 • edited by jlegind Loading

Issue

FedorSteeman commented Aug 17, 2022

FedorSteeman commented Aug 25, 2022

FedorSteeman commented Nov 14, 2022

FedorSteeman commented Nov 24, 2022 • edited Loading

FedorSteeman commented Nov 29, 2022

PipBrewer commented Nov 30, 2022

FedorSteeman commented Dec 8, 2022

FedorSteeman commented Dec 9, 2022

FedorSteeman commented Dec 15, 2022

FedorSteeman commented Dec 19, 2022 • edited Loading

FedorSteeman commented Dec 19, 2022

FedorSteeman commented Dec 21, 2022 • edited Loading

FedorSteeman commented Aug 17, 2022 •

edited by jlegind

Loading

FedorSteeman commented Nov 24, 2022 •

edited

Loading

FedorSteeman commented Dec 19, 2022 •

edited

Loading

FedorSteeman commented Dec 21, 2022 •

edited

Loading