Skip to content

GNparser with OpenRefine

Dmitry Mozzherin edited this page Sep 26, 2023 · 4 revisions

Reconciling taxonomic names in OpenRefine via Global Names

Version 0.1 | 2023-09-26

Contributors: @amandawhitmire, @dimus

Existing OpenRefine Documentation

Parse scientific names in OpenRefine with GNparser

OpenRefine is a very powerful way to massage, normalize, combine, transform data. For scientific names it is very useful to normalize them using parsers. In this example we extract canonical for to use it for reconciliation with Wikidata, however the same approach can be used to extract, authorship, years, etc from scientific names.

This documentation starts at the step where you have already created an OpenRefine project and have a column of scientific names (e.g., “Verbatim Name” below). If needed, see the documentation linked above to get started with installing OpenRefine and creating a project.

Wikidata endpoint: https://wikidata.reconci.link/en/api

In case if reconciliation via Wikidata endpoint is desirable, we have to use a canonical form of a name (without authorship). When such a scientific name contains authorship, it is possible to transform such names using gnparser API. The gnparser understands inner structure of scientific names and separates their components into separate fields.

First create a new column using Add column by fetching URL...

image

Change expression to the following line: "https://parser.globalnames.org/api/v1/" + escape(value, "url"). Also change Throttle delay from 500 to 5.

image

At the end of the column creation, it will be filled with JSON-encoded data that looks like this:

[{"parsed":true,"quality":2,"qualityWarnings":[{"quality":2,"warning":"Ex authors are not required (ICZN only)"}],"verbatim":"Bulimus canarius Philippi, in Pfeiffer, 1867","normalized":"Bulimus canarius Philippi ex Pfeiffer 1867","canonical":{"stemmed":"Bulimus canar","simple":"Bulimus canarius","full":"Bulimus canarius"},"cardinality":2,"authorship":{"verbatim":"Philippi, in Pfeiffer, 1867","normalized":"Philippi ex Pfeiffer 1867","year":"1867","authors":["Philippi","Pfeiffer"]},"id":"eb69fa41-52ca-5d6b-9d16-d76d97a6dddc","parserVersion":"v1.7.4"}]

Now we need to extract canonical form from the parsed JSON-encoded data. Use "Edit cells->Transform" menu.

image

In the Expression field put parseJson(value)[0]["canonical"]["simple"]

image

This new column will now have canonical form of names and can be used for reconciliation via Wikidata.

image