-
Notifications
You must be signed in to change notification settings - Fork 5
GNparser with OpenRefine
Version 0.1 | 2023-09-26
Contributors: @amandawhitmire, @dimus
- OpenRefine user manual
- OpenRefine documentation on reconciliation
- GBIF documentation about loading data and basic OpenRefine functionality (faceting, editing, etc.): [Use of OpenRefine] (https://docs.gbif.org/course-data-mobilization/course-docs/OpenRefine-Exercise3c-EN.pdf)
- Name reconciliation with GNverifier and OpenRefine
OpenRefine is a very powerful way to massage, normalize, combine, transform data. For scientific names it is very useful to normalize them using parsers. In this example we extract canonical for to use it for reconciliation with Wikidata, however the same approach can be used to extract, authorship, years, etc from scientific names.
This documentation starts at the step where you have already created an OpenRefine project and have a column of scientific names (e.g., “Verbatim Name” below). If needed, see the documentation linked above to get started with installing OpenRefine and creating a project.
Wikidata endpoint: https://wikidata.reconci.link/en/api
In case if reconciliation via Wikidata endpoint is desirable, we have to use a canonical form of a name (without authorship).
When such a scientific name contains authorship, it is possible to transform such names using gnparser
API.
The gnparser
understands inner structure of scientific names and separates their components into separate fields.
First create a new column using Add column by fetching URL...
Change expression to the following line: "https://parser.globalnames.org/api/v1/" + escape(value, "url")
.
Also change Throttle delay
from 500
to 5
.
At the end of the column creation, it will be filled with JSON-encoded data that looks like this:
[{"parsed":true,"quality":2,"qualityWarnings":[{"quality":2,"warning":"Ex authors are not required (ICZN only)"}],"verbatim":"Bulimus canarius Philippi, in Pfeiffer, 1867","normalized":"Bulimus canarius Philippi ex Pfeiffer 1867","canonical":{"stemmed":"Bulimus canar","simple":"Bulimus canarius","full":"Bulimus canarius"},"cardinality":2,"authorship":{"verbatim":"Philippi, in Pfeiffer, 1867","normalized":"Philippi ex Pfeiffer 1867","year":"1867","authors":["Philippi","Pfeiffer"]},"id":"eb69fa41-52ca-5d6b-9d16-d76d97a6dddc","parserVersion":"v1.7.4"}]
Now we need to extract canonical form
from the parsed JSON-encoded data. Use "Edit cells->Transform" menu.
In the Expression field put parseJson(value)[0]["canonical"]["simple"]
This new column will now have canonical form of names and can be used for reconciliation via Wikidata.