-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add curation_rule to SSSOM #258
Conversation
Hi. The value type for curation rule, i.e. EntityReference (namely, a URI), turns to be too restrictive. It is particularly inadequate for users publishing their mapping set in the TSV format, and potentially not speaking RDF. This is what D2KAB parterns expressed. I'd suggest, if possible, to have either text or a URI. This would allow to publish rules either directly in the mapping set, as a code refering to a piece of text in a side document, or as a RDF resource. |
I hear your argument @saubin78 and you are not the only one that suggested this. The reason why I at least here opted for a resource based solution was to open the door for sharing curation rules in the wider future. Machines will not be able to do anything with a natural language curation rule (so far these have gone into the "comment" field). What about this as a compromise: we add two fields, curation_rule_id and curation_rule_text. this is analogous to author_id and author_label etc. This keeps the door open for structured curation rule objects. We can even employ tools that extract the text from a structured curation rule and adds it to the curation_rule_text field. |
@saubin78 what do you think now? |
About curation rules, in the future it would be great to share them in order to reuse them. Unfortunatelly we are at the beginning so we do not know if people will express curation rules in the same way, that the reason why it is first necessary to collect some textual example of curation rules before setting them more formally using RDF language. |
@croussey thank you very much for your input. I think I have come around to agree with you. This suggestion is now reflected in this PR by having two elements. Do you like the proposed solution? |
I am super hesitant about adding any new rules without some convincing and well-described examples (both on the human prose level and also actual example SSSOM files!). What I've seen right now are nice ideas, but I can't see how they will translate to SSSOM in reality. For example, I have yet to see and example that faithfully uses a controlled vocabulary to refer to a curation rule. Do such vocabularies exist? Or will this just be another field that gets abused? |
@cthoyt and you should be hesitant. Here is an issue that illustrates "shareable curation rules": mapping-commons/disease-mappings#16 Here is another, complex discussion on the matter: mapping-commons/mh_mapping_initiative#14 The main, but not only, gap that curation rules fill is where "a human said so" is not enough. You want to capture some of the intuition behind the human curation, like: "From the labels and definitions provided, I made a call", or: two chemicals from different databases are considered "the same" if they have the same molecular structure, or: we determine that two disease entities refer to the same thing based on their identical phenotypic profile. There is no such vocabulary, and building one will take years - I suggest not to worry about this for now from the sssom side. Users have been asking for months now to get a way to capture these rules, so my idea is to see what will practically be curated, and if it is worth it, we can, in the mid-term, make a review of existing mapping sets and capture the curation rules in a specific vocabulary. I hope that makes some sense. Thank you for taking the time to looking at this! |
Due to popular demand I will merge this on Wednesday the 15th of March next week, and make a new release. |
I still think that there should be several examples of this being used included in the PR before even considering that it can be merged -> like actual examples, appearing in SSSOM files. |
I think I will start building a library of examples for other justifications as well. @saubin78 would you be able to help me by creating a small sssom file with two or three curation rules in it? We could put it here: https://github.com/mapping-commons/sssom/tree/master/examples/external Or you can just share it as a google sheet and I will add it there. I will add the ones from the phenotype ticket above as well. |
#creator_id: orcid:0000-0002-7356-1779 | ||
#license: "https://creativecommons.org/publicdomain/zero/1.0/" | ||
#mapping_provider: "https://w3id.org/sssom/core_team" | ||
#comment: This is an example file for the SSSOM for illustration only. Its contents are entirely fabricated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i want to go as far as saying that we should enforce that the examples given are REAL and not fabricated... Or am I asking too much that people should explain in detail and give real concrete use cases before SSSOM gets polluted with lots of indecipherable fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, https://w3id.org/sssom/commons/disease/curation-rules/MPR2 does not resolve to anything, therefore I can not understand what this means, and can not review the merits of this field based on this example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By fabricated I mean: this is not to be used / maintained for any practical purposes. I am totally out of steam on this one - we can push for incremental improvements moving forward. This example has real curation rules, I just didn't apply them to real data..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( I am sorry that my feedback has exhausted you.
we can push for incremental improvements moving forward
You and I both know what this means :p
How can we move forwards so the burden of making good, actionable improvements to SSSOM is distributed from you to community members who are requesting them? E.g., improving the new field request template, adding more CI/CD is a great start.
@saubin78 for a start, can you help alleviate some of this burden? Can you help provide actionable examples of how this might work (e.g., improve the example files Nico made to actually be meaningful examples)? Then, Nico won't feel so much burden from me giving important (but ultimately difficult to address) feedback in addition to the project-based pressure to just "get this done" that if done prematurely, could erode trust and sustainability of SSSOM (and more burn out)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.
I don't disagree with this. But I think it's reasonable to ask that people who are making proposals of new fields to go above and beyond the minimum requirements of PURLs to give context to these fields for the purpose of an example. As it is, without a herculean effort of reading through meandering GitHub issue conversations and documentation in various places, it's really not obvious to understand what's going on here. If we can't agree on this, can we at least agree that there should be a detailed explanation in the preamble of the SSSOM document explaining what the new predicates are supposed to be for?
Hi. Is it ok if I add an example file this afternoon? |
here is a real mapping example file. For the sake of simplicity, I copied the simplified rules directly in the TSV file. In the real complete mapping set, we'll prefer to have them in a separate file with examples. |
Thank you @saubin78, the formatting is great, but I think I don't understand the mapping rule exactly:
|
Right! Just let me clarify this with the mapping authors for the few examples I sent you. |
See https://bioregistry.io/registry/co_321 (it's also in the prefix map) |
The example as given is very un-self-explanatory, sorry. I'm not sure if I've gotten the wrong end of the multiple curation_rule sticks in play, or if it's just that it's still in development, of if it's extraordinarily domain-specific, or if it's just badly worded in the original. It is perhaps an example of why made-up examples are often better than actual examples. TLDR version (to skip the mild rant that follows): Don't overspecify this, and don't overexemplify it either. Allow it to breathe.
I don't think I agree with what you think this means. I think it is not only typical, but sometimes necessary, to leave specifications deliberately vague. When people get more experience they know much better what they want and the spec can be, and does get, improved. Most specs that get used heavily (as this one already is) get updated, improved, extended, and profiled. This will be that.
I think that it is reasonable to consider what a realistic specification would have to go into in order to make the requested capability function in a consistent way, and to recognize that there are multiple requirements people have in mind for these two parameters, and that for those reasons maybe they are a bridge too far for this level of the specification. Because I promise you that consistency in the definition (especially computational consistency from resolved IRIs) would be a long, long ride from where we are today. But if proposed fields are deemed so useful that we must allow for their possible use(s), then it is entirely reasonable to provide a fuzzy solution. For example, curation_rule_text_comment is about as fuzzy a field as you could imagine, and therefore IMHO not more useful than a With that definition the curation_id is fully adoptable by a variety of different communities, each with their own goals and implementation strategies. That in no way lessens the specification's value—it is very clear what is, and is not, being specified. (And going the other route, detailed specs and examples often don't control what happens in the wild. Consider how people use standards generally (DOIs, 'same_as', 'xref') to see how supposedly fully specified and exemplified standards get used and abused.) So I'm not sure I love curation_x fields in general, but I argue the only way you can make them viable and appropriately general-purpose in the short term is to specify them in a simple way that allows them to evolve. Otherwise, without the community effort to really work them through, a lot of the community will end up using them differently than specified or desired, which will make that part of the spec ineffective. |
@graybeal it is a little bit of each. Here are the same examples (real) with a rewording of the curation rules so that they are more self-explanatory, and hopefully more useful for the SSSOM users. |
@saubin78 thank you very much; you have added these examples:
I think your curation rules makes sense, especially paired with skos:closeMatch. I will give other an opportunity to react. @graybeal I full-heartedly agree with your sentiment ("Don't overspecify this, and don't overexemplify it either") - it is my approach as well; I value however the challenging questions @cthoyt is raising by being a bit pedantic as well. I think iteration is always better than perfection, but the step-size for every iteration must be reasonably large, which means that perfection and "iterate fast" need to keep negotiating no matter how painful. All your comments on this PR (@graybeal, @saubin78, @croussey and @cthoyt) have already led to significant improvements of SSSOM:
There are some things we have not achieved:
I would like to call it a day on this one: |
Thanks Nico for the recap and clarification and @ALL for the great discussion. |
Example is much improved, thank you. |
Thank you everyone, we made a new release: https://github.com/mapping-commons/sssom/releases/tag/0.11.0 @saubin78 we really appreciate your contributions (not just on this ticket, but various ones across the tracker); if you like to be added to SSSOM core team to also be included in publications and mentioned on presentations let me know! |
@matentzn Thanks you so much for your awesome work on this new release! I'd be glad to join the SSSOM core team and help spreading the word at INRAE and within the Agricultural, French and EOSC communities (with my participation to the FAIR IMPACT project) |
I have added you! Welcome to the 20th member of the core team! If you send me an email I will also invite you to OBO slack where we have a sssom channel. |
Fixes #166
docs/
have been added/updated if necessarymake test
has been run locallyThe need for representing specific curation rules is everywhere, see #166. It will be very difficult, if not impossible, to standardise curation rules, so I would advocate we leave this totally open for now. Representing the rules as a resource basically makes them an open ended enum - which gives us more flexibility for adding structure later. @saubin78 suggested to create a class "MappingRule" and have curation rules being instances of mapping rules.
We also should decide reasonably soon if we want to rename curation rule into mapping rule altogether, if we agree with @saubin78 assertion in #166 that computational rules can also be curation rules (I think that's fair!).