-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Names of subgenera don't get parsed if subgen. is included in the scientific name value #232
Comments
hm, interesting, @gdower, do you know how often such names happen? |
It's not super-common in the datasets that I work with, but I do see it occasionally. In World Plants/World Ferns, 0.83% of the lines include "subgen." Often in that source it's included like this: subgen. Filago (without the genus included). |
Sorry, I should have provided more context. There are 38 subgenera with this scientificName structure in the current version of COL (2022-07-12). These are derived from two different sources, both entomological: |
@KatjaSchulz and @gdower, thank you for the information! I am on a fence about this particular parsing. If there are only so few of them, does it make sense to slow down parsing for the vast majority of other names by checking this specific case? I'll try to figure out a faster approach to check the first word. |
I'm not really in a position to evaluate whether it would be worth it. For the time being, we can handle these through post-processing. We can revisit if we encounter more cases with this usage. |
@yroskov and I are aiming to fix those names in the Sept release of CoL. |
Parsing it now like this: {
"parsed": true,
"quality": 2,
"qualityWarnings": [
{
"quality": 2,
"warning": "Uninomial prepended by its rank"
}
],
"verbatim": "subgen. Psammophrynopsis Koch, 1953",
"normalized": "subgen. Psammophrynopsis Koch 1953",
"canonical": {
"stemmed": "Psammophrynopsis",
"simple": "Psammophrynopsis",
"full": "subgen. Psammophrynopsis"
},
"cardinality": 1,
"rank": "subgen.",
"authorship": {
"verbatim": "Koch, 1953",
"normalized": "Koch 1953",
"year": "1953",
"authors": [
"Koch"
],
"originalAuth": {
"authors": [
"Koch"
],
"year": {
"year": "1953"
}
}
},
"details": {
"uninomial": {
"uninomial": "Psammophrynopsis",
"rank": "subgen.",
"authorship": {
"verbatim": "Koch, 1953",
"normalized": "Koch 1953",
"year": "1953",
"authors": [
"Koch"
],
"originalAuth": {
"authors": [
"Koch"
],
"year": {
"year": "1953"
}
}
}
}
},
"words": [
{
"verbatim": "subgen.",
"normalized": "subgen.",
"wordType": "RANK",
"start": 0,
"end": 7
},
{
"verbatim": "Psammophrynopsis",
"normalized": "Psammophrynopsis",
"wordType": "UNINOMIAL",
"start": 8,
"end": 24
},
{
"verbatim": "Koch",
"normalized": "Koch",
"wordType": "AUTHOR_WORD",
"start": 25,
"end": 29
},
{
"verbatim": "1953",
"normalized": "1953",
"wordType": "YEAR",
"start": 31,
"end": 35
}
],
"id": "1b8f7c8c-16c8-5411-a992-f7945f0e3838",
"parserVersion": "v1.9.2-5-g93e2782"
} |
part of v1.10.0 release now |
Psammophanes subgen. Psammophrynopsis Koch, 1953 parses fine, with a warning.
subgen. Psammophrynopsis Koch, 1953 doesn't get parsed at all.
It would be good if names preceded by subgen. would get parsed. They do occur in the wild, e.g. in current Catalogue of Life files.
The text was updated successfully, but these errors were encountered: