Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated merging of duplicates #74

Closed
7 tasks done
FedorSteeman opened this issue Aug 17, 2022 · 29 comments
Closed
7 tasks done

Automated merging of duplicates #74

FedorSteeman opened this issue Aug 17, 2022 · 29 comments
Assignees
Labels
1 priority 1 Specify Related to (interactions with) the Specify SW system synchronization ... taxonomy
Milestone

Comments

@FedorSteeman
Copy link
Contributor

FedorSteeman commented Aug 17, 2022

Issue

For documentation on the solution please see comments. Problematic taxon names have been separated out into #73

Many duplicates still remain in Specify. It's possible to have this automated by a small piece of code making direct API calls. Using some clever SQL, we can pulling out the keys (taxonid's) from any duplicated names of the same rank in the same taxon tree. These keys can then be fed to the API by code thus enabling automated merging of duplicates.

Should recognise records as duplicates even if one has author and the other doesn't. Following merging, the author should be retained.

Requirements:

  • Code for calling the merge function via the Specify7 API
  • Executable function that scans for and merges duplicates at genus level & below in given taxon tree
  • Smart merging of duplicates by taking into account author & parent taxon
  • Tagging indecisive cases for later investigation
  • Review of indecisive cases and adding to code if decided
  • Script for interfacing with GBIF for checking authorship
  • Code for handling "cfr" and "aff" taxa
    - [ ] Process to remove duplicates from taxonomic spine stored in local app database
@FedorSteeman FedorSteeman added backend backend Specify Related to (interactions with) the Specify SW system synchronization ... taxonomy labels Aug 17, 2022
@FedorSteeman FedorSteeman added this to the Sprint 1 milestone Aug 17, 2022
@FedorSteeman FedorSteeman self-assigned this Aug 17, 2022
@FedorSteeman
Copy link
Contributor Author

Wrote some code, but my requests keep on bouncing back from the API with a "403 Forbidden" response. I reached out to Kansas for help.

@jlegind jlegind removed this from the Sprint 1 milestone Aug 19, 2022
FedorSteeman added a commit that referenced this issue Aug 19, 2022
@FedorSteeman
Copy link
Contributor Author

Feedback from Kansas:

Our junior programmer believes the 500 error is because you are sending the requests body as JSON rather than form data. Does this sound like it could be the issue to you?
Thank you!
Grant Fitzsimmons

And:

More information from Max:
Some API endpoints accept form data payload rather than JSON object payload.
This endpoint is one of them.
Documentation on how to send form data using the python "requests" library: https://stackoverflow.com/q...
We have a GitHub ticket for fixing this inconsistency and providing more user-friendly errors: specify/specify7#2023
Let us know if that helps or if you're still having issues.
Theresa
Specify Collections Consortium

@FedorSteeman FedorSteeman added 3 priority 3 and removed backend backend labels Sep 1, 2022
@jlegind jlegind added 1 priority 1 and removed 3 priority 3 labels Sep 7, 2022
@jlegind jlegind added this to the Sprint6 milestone Sep 7, 2022
@FedorSteeman FedorSteeman self-assigned this Nov 11, 2022
@FedorSteeman
Copy link
Contributor Author

Tip from @jlegind : Too many requests in a row to the GBIF API might overload it. Adding a few print statements e.g. would give enough delay however.

@FedorSteeman FedorSteeman modified the milestones: Sprint 8, Sprint 11 Nov 23, 2022
@FedorSteeman
Copy link
Contributor Author

FedorSteeman commented Nov 24, 2022

Got the authorship lookup to GBIF working, but added a new task that I just remembered from personal communication with @PipBrewer :

  • Code for handling "cfr" and "aff" taxa amending the corresponding determinations accordingly

@FedorSteeman
Copy link
Contributor Author

Runs on test-database were highly successful. Script ready to be unleashed onto the live database.

@PipBrewer
Copy link
Collaborator

Rather than removing duplicates, the taxon spin in the local app db is actually going to be recreated. Eventually, a solution involving direct synchronization via the Specify7 API would be best.

@FedorSteeman FedorSteeman modified the milestones: Sprint 11, Sprint 12 Dec 2, 2022
@FedorSteeman
Copy link
Contributor Author

Running the new code against the test database that was successfully created from a direct mysql-dump of the live database looks promising. It seems that no references are broken after all, so the merge call to the API does sort out things neatly. The problem apparently lies with the crappy Specify6 Backup & Restore tool.

I've discussed with @Sosannah that we're gonna do our own backups in the direct way going forward. I will write up a technical manual for this.

@FedorSteeman
Copy link
Contributor Author

The merging runs well but is really slow and taxing on database performance. I will try to start it up from home after office hours to reduce interference with regular Specify work. Hopefully it will be done over the weekend.

The following task is a separate and big issue and needs its own ticket:

  • Process to remove duplicates from taxonomic spine stored in local app database

@FedorSteeman
Copy link
Contributor Author

Added the option of iterating a pre-collected list of taxon ids that are known duplicates for speeding up the process. It works well.

@FedorSteeman
Copy link
Contributor Author

FedorSteeman commented Dec 19, 2022

The run has finished, merging all obvious duplicates and produding an export file of about 30 ambivalent cases. A lot of these are "cf." taxa and were handled accordingly using a SQL script.

However, inspecting the database directly, it appears there are at least 3000 name duplicates left, mainly distinguished by parent taxon, it seems. There are also about 400+ taxa with problematic characters in their name. More work needs to be done to eliminate these.

@FedorSteeman
Copy link
Contributor Author

Cleaned up cf. cases, so we just have the following:

  • 185 problematic taxa with punctuation in their name
  • 4558 possible duplicates though with different parent taxa
  • 540 possible duplicates with the same parent taxon

The last set of cases will be thrown to the script one more time.
The first set of cases will have to be handled manually.
The middle set may have to be subjected to an expansion of the script that looks up the accepted parent at GBIF and merges accordingly.

@FedorSteeman
Copy link
Contributor Author

FedorSteeman commented Dec 21, 2022

All taxa that could be auto-merged have been merged, plus some additional cleaning of cases that fell a bit outside of the different algorithms.

The results are the following:

  1. A spreadsheet with the 137 remaining "problematic taxa" most of which would need a closer look by an expert or even having the original herbarium sheet looked at, because of difficulties with the transcription:
    Botany problematic taxa.xlsx
    This has been relegated to ticket Fixing taxa with transcription errors #73

  2. A spreadsheet with 5600+ taxa that are duplicate namesakes, but placed under different parent taxa:
    Botany Duplicates Diff Parents.xlsx
    We need to discuss how this could be resolved, because Workbench may have issues picking the right taxon for the import. I could e.g. still merge on the basis of GBIF, but then log then decisions to be presented to curators to veto, so these can be reversed

  3. A spreadsheet with ambivalent cases that could not be auto-merged due to various reasons:
    (Coming later)

This should be closed and any followup regarding above cases should become separate tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 priority 1 Specify Related to (interactions with) the Specify SW system synchronization ... taxonomy
Projects
Development

No branches or pull requests

4 participants