GNparser with OpenRefine

Reconciling taxonomic names in OpenRefine via Global Names

Version 0.1 | 2023-09-26

Contributors: @amandawhitmire, @dimus

Existing OpenRefine Documentation

OpenRefine user manual
OpenRefine documentation on reconciliation
GBIF documentation about loading data and basic OpenRefine functionality (faceting, editing, etc.): [Use of OpenRefine] (https://docs.gbif.org/course-data-mobilization/course-docs/OpenRefine-Exercise3c-EN.pdf)
Name reconciliation with GNverifier and OpenRefine

Parse scientific names in OpenRefine with GNparser

OpenRefine is a very powerful way to massage, normalize, combine, transform data. For scientific names it is very useful to normalize them using parsers. In this example we extract canonical for to use it for reconciliation with Wikidata, however the same approach can be used to extract, authorship, years, etc from scientific names.

This documentation starts at the step where you have already created an OpenRefine project and have a column of scientific names (e.g., “Verbatim Name” below). If needed, see the documentation linked above to get started with installing OpenRefine and creating a project.

Wikidata endpoint: https://wikidata.reconci.link/en/api

In case if reconciliation via Wikidata endpoint is desirable, we have to use a canonical form of a name (without authorship). When such a scientific name contains authorship, it is possible to transform such names using gnparser API. The gnparser understands inner structure of scientific names and separates their components into separate fields.

First create a new column using Add column by fetching URL...

Change expression to the following line: "https://parser.globalnames.org/api/v1/" + escape(value, "url"). Also change Throttle delay from 500 to 5.

At the end of the column creation, it will be filled with JSON-encoded data that looks like this:

[{"parsed":true,"quality":2,"qualityWarnings":[{"quality":2,"warning":"Ex authors are not required (ICZN only)"}],"verbatim":"Bulimus canarius Philippi, in Pfeiffer, 1867","normalized":"Bulimus canarius Philippi ex Pfeiffer 1867","canonical":{"stemmed":"Bulimus canar","simple":"Bulimus canarius","full":"Bulimus canarius"},"cardinality":2,"authorship":{"verbatim":"Philippi, in Pfeiffer, 1867","normalized":"Philippi ex Pfeiffer 1867","year":"1867","authors":["Philippi","Pfeiffer"]},"id":"eb69fa41-52ca-5d6b-9d16-d76d97a6dddc","parserVersion":"v1.7.4"}]

Now we need to extract canonical form from the parsed JSON-encoded data. Use "Edit cells->Transform" menu.

In the Expression field put parseJson(value)[0]["canonical"]["simple"]

This new column will now have canonical form of names and can be used for reconciliation via Wikidata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNparser with OpenRefine

Reconciling taxonomic names in OpenRefine via Global Names

Existing OpenRefine Documentation

Parse scientific names in OpenRefine with GNparser

Clone this wiki locally