-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any chances to search datasources by id to get accepted names? #85
Comments
@abubelinha, first of all thanks a lot for reporting #83, it would be very hard to find without your spotting of the problem. Q1: Currently gnverifier database is rebuilt from resolver databse. As a result they are almost always in sync. Currently I wait for fixes in Parasite Tracker dataset, so I will sync databases when I get update from them. Q2: I would prefer to solve the problem by fixing gnverifier, would it be a reasonable solution for you? Such manual 1 time tasks of matching legacy and current tools is not a good use of time I think. It is better to solve the problems for all users. Related issues are: |
Spotting #83 was just by pure chance, but a consequence of me comparing results returned by verifier & resolver. I did it blindly assuming that, except for their api differences, the results should be identical (same datasources, algorithms and matching rules, but implemented with a faster programming language). I didn't made any comparisons by that time, and trusted that the same list processed by the two services, would keep returning exactly the same matches. But now I am finding many examples proving that was not the case. And to my surprise, when I found differences the winner was always resolver (I mean that resolver's matching results were much closer to my human matching criteria, compared to verifier's matching results. There may be examples were resolver loses, but I still didn't find anyone). I am just guessing, but from your recent answers I understand those differences could not always be "bugs" -as I thought-, but expected behaviour? (different design, focused into a more speedy matching process).
Indeed I agree. But I am not sure if mine is a actually a problem for other users. I also love many aspects of verifier compared to resolver (extra information about accepted/synonym status of names in some datasources, which is crucial for my use case; plus a more atomized information about matching scores, which is so good to know). It was in this context that I asked about the possibility of sticking myself to resolver "goodness" and stop bothering with "verifier issues" which might not be such for most of the people. Those IMHO wrong "PartialExact" matches might sometimes be perhaps related to an "overloadDetected" warning?: "Too many variants (possibly strains), some results are truncated" (i.e. Quercus or Pinus for same queries above). Resolver has not this overload problem, and just returns the correct species by "fuzzy match".
Indeed, but I don't want to bother or go against majority.
|
Sometimes differences are, indeed, speed requirements, but in other cases are just playing with parameters. I will go through your examples, and analyze their behavior in gnverifier. For cases where you know that letters are missing you can try to use search instead verify:
The goal with resolver was to make it work So your feedback is very important, and helps to find where algorithms in verifier need to be tweaked |
OK understood. Thanks.
Well, those weren't real cases. I was showing a simple example of "easy fuzzy matching tests" where resolver had success, while verifier failed (to my surprise). It's easy to find other similar examples, by making different tiny changes in the epithet string. Another example of the relevance of catching these single-letter changes are orthographic variants of botanical names, not uncommon in literature and herbarium labels and databases. Again, resolver handles (by fuzzy matches) much better than verifier (i.e. if I try to match Jonopsidium abulense, resolver correctly fuzzy matches Ionopsidium abulense in all my 3 preferred data sources; but verifier does not match any of them, and just returns a bestResult from those datasources which used the "J" orthovariant). But I wonder why this resolver behaviour is not always the same. For some other names, if I change just the first letter, resolver does not fuzzy match them either (example: Ulva lactuca matches in ds 195; but changed to Vlva lactuca there is no resolver fuzzy match). I am pretty confused about that. Anyway ... IMHO those one-letter changes are cases that should always return a fuzzy match in either resolver or verifier. I had thought about making a simple comparator between resolver and verifier outputs. Some script which followed these steps:
But I was already finding differences quite easily by hand, so this script became unnecessary (you might be interested in doing it, though). |
| Try Quercus robur, Pinus pinaster, Zea mays, Oryza sativa ... but in verifier. Some of these do not work because fuzzy match works with stemmed versions. So, for example In general it can be solved by using edit distance 2 as a threshold (about 10 times slower than 1). Besides being slow, edit distance 2 also introduces much more false positives. But in cases like yours, when you check the results manually, it would not be a problem as well. Note that 1 or 2 edit distance is for stems, so the final edit distance can actually be 2, 3, or even 4 Probably the parameter should be called |
This happens because there is a quota on how many errors are allowed per so many letters. I recall for resolver the quota is 1 error per 6 letters, so Vlva does not get through. The purpose of the quota is to remove false positives. For gnverifier the quota is 1 error per 5 letters. I will try to reduce it to 4 for the next release. |
Comparing resolver and gnverifier: And feedback from you and others is the most useful for me in understanding usecases, and trying to bring functionality to cover as many usecases as it is possible within existing constrains. |
I see ... quite complicated to tune the tool for every need. As for understanding my use case. I want to validate a draft list of species known to exist in a particular region:
It would be great if gndiff has finally options to fine-grain "configure its fuzziness" with all those numerical parameters you mention, plus let user select output detail (i.e. choosing parsed info columns and so on). I think it would be an easier solution, rather than changing server-side variables which could improve my results but be wrong for other use cases. |
Indeed. My goal is more to figure out an intersect space from existing usecases where gn tool can be of help to move towards that space.
BTW gnverifier already checks authors. The field scoreDetails -> authorMatchScore probably can help you. |
Would you make an issue at gndiff describing this? |
I did it two weeks ago ... but perhaps it was toooooo verbose? |
ah sorry about that, I thought it was general thoughts, and I just did not get to it yet:) I'll take a look. As a rule of thumb, when there is a concrete task, it is better to keep it separate from the rest, it allows me to create focused commit that addresses that particular problem. |
OK, I'll try to do that.
Thanks! I opened issue #86 about this.
I have a new concern about the many use cases: #87 I also see gndiff as a great oportunity for letting users tune default parameters without affecting apis development (scoring behaviour, return parameters required, bandwith used ...) |
I am considering to temporarily go back and use resolver.globalnames.org api , as long as it keeps returning better fuzzy/stem matches (for the few example names I have tested, at least).
However, doing this puts me in some other problems because my final objective is creating a regional checklist of accepted names (based on some trusted data sources).
Question 1: I believe datasource versions are identical in both resolver and gnverifier. Correct? (not sure because resolver shows a no.200 which is not in gnverifier).
Anyway, I have realized that resolver returns a (matched) "name_string" and a (matched) "taxon_id", but it does not provide any information about that name status according to datasource (accepted, synonym, ...)
verifier, on the other hand, returns a "matchedName", a "recordId" (equivalent to those of resolver), PLUS a "currentName" and "currentRecordId" at least when data source list provides them.
So question 2: is there any way to reprocess resolver's output, generating a list of "taxon_id"+"data_source_id" (I'll do this myself, of course), and send this list to gnverifier or any other gnames product which can return me back a list with "currentName" and "currentRecordId"?
If that's not possible, any other suggestion on how to do this?
i.e., any chances to download a full given dataset from gnames in order to run this match locally? (by "full dataset" I mean a simple
datasourceID.csv
with at least these 4 columns: name, recordId, currentName, currentRecordId)I guess there might be other apis out there (GBIF, COL) but I prefer to avoid possible problems which would arise if datasource versions are different to gnames (see #81), so some records and name's statuses could be different in both versions (or even missing in one of them).
Thanks a lot in advance
The text was updated successfully, but these errors were encountered: