Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add curation_rule to SSSOM #258

Merged
merged 12 commits into from
Mar 16, 2023
Merged

Add curation_rule to SSSOM #258

merged 12 commits into from
Mar 16, 2023

Conversation

matentzn
Copy link
Collaborator

@matentzn matentzn commented Feb 16, 2023

Fixes #166

  • docs/ have been added/updated if necessary
  • make test has been run locally
  • tests have been added/updated (if applicable)
  • CHANGELOG.md has been updated.

The need for representing specific curation rules is everywhere, see #166. It will be very difficult, if not impossible, to standardise curation rules, so I would advocate we leave this totally open for now. Representing the rules as a resource basically makes them an open ended enum - which gives us more flexibility for adding structure later. @saubin78 suggested to create a class "MappingRule" and have curation rules being instances of mapping rules.

We also should decide reasonably soon if we want to rename curation rule into mapping rule altogether, if we agree with @saubin78 assertion in #166 that computational rules can also be curation rules (I think that's fair!).

@saubin78
Copy link

saubin78 commented Mar 2, 2023

Hi. The value type for curation rule, i.e. EntityReference (namely, a URI), turns to be too restrictive. It is particularly inadequate for users publishing their mapping set in the TSV format, and potentially not speaking RDF. This is what D2KAB parterns expressed.

I'd suggest, if possible, to have either text or a URI. This would allow to publish rules either directly in the mapping set, as a code refering to a piece of text in a side document, or as a RDF resource.

@matentzn
Copy link
Collaborator Author

matentzn commented Mar 2, 2023

I hear your argument @saubin78 and you are not the only one that suggested this. The reason why I at least here opted for a resource based solution was to open the door for sharing curation rules in the wider future. Machines will not be able to do anything with a natural language curation rule (so far these have gone into the "comment" field).

What about this as a compromise: we add two fields, curation_rule_id and curation_rule_text. this is analogous to author_id and author_label etc. This keeps the door open for structured curation rule objects. We can even employ tools that extract the text from a structured curation rule and adds it to the curation_rule_text field.

@matentzn
Copy link
Collaborator Author

matentzn commented Mar 3, 2023

@saubin78 what do you think now?

@croussey
Copy link

croussey commented Mar 6, 2023

About curation rules, in the future it would be great to share them in order to reuse them. Unfortunatelly we are at the beginning so we do not know if people will express curation rules in the same way, that the reason why it is first necessary to collect some textual example of curation rules before setting them more formally using RDF language.

@matentzn
Copy link
Collaborator Author

matentzn commented Mar 6, 2023

@croussey thank you very much for your input. I think I have come around to agree with you. This suggestion is now reflected in this PR by having two elements. Do you like the proposed solution?

@cthoyt
Copy link
Member

cthoyt commented Mar 7, 2023

I am super hesitant about adding any new rules without some convincing and well-described examples (both on the human prose level and also actual example SSSOM files!). What I've seen right now are nice ideas, but I can't see how they will translate to SSSOM in reality. For example, I have yet to see and example that faithfully uses a controlled vocabulary to refer to a curation rule. Do such vocabularies exist? Or will this just be another field that gets abused?

@matentzn
Copy link
Collaborator Author

matentzn commented Mar 8, 2023

@cthoyt and you should be hesitant.

Here is an issue that illustrates "shareable curation rules": mapping-commons/disease-mappings#16

Here is another, complex discussion on the matter: mapping-commons/mh_mapping_initiative#14

The main, but not only, gap that curation rules fill is where "a human said so" is not enough. You want to capture some of the intuition behind the human curation, like: "From the labels and definitions provided, I made a call", or: two chemicals from different databases are considered "the same" if they have the same molecular structure, or: we determine that two disease entities refer to the same thing based on their identical phenotypic profile.

There is no such vocabulary, and building one will take years - I suggest not to worry about this for now from the sssom side. Users have been asking for months now to get a way to capture these rules, so my idea is to see what will practically be curated, and if it is worth it, we can, in the mid-term, make a review of existing mapping sets and capture the curation rules in a specific vocabulary.

I hope that makes some sense. Thank you for taking the time to looking at this!

@matentzn
Copy link
Collaborator Author

Due to popular demand I will merge this on Wednesday the 15th of March next week, and make a new release.

@cthoyt
Copy link
Member

cthoyt commented Mar 10, 2023

I still think that there should be several examples of this being used included in the PR before even considering that it can be merged -> like actual examples, appearing in SSSOM files.

@matentzn
Copy link
Collaborator Author

I think I will start building a library of examples for other justifications as well.

@saubin78 would you be able to help me by creating a small sssom file with two or three curation rules in it? We could put it here: https://github.com/mapping-commons/sssom/tree/master/examples/external

Or you can just share it as a google sheet and I will add it there.

I will add the ones from the phenotype ticket above as well.

@matentzn matentzn requested a review from cthoyt March 14, 2023 11:21
#creator_id: orcid:0000-0002-7356-1779
#license: "https://creativecommons.org/publicdomain/zero/1.0/"
#mapping_provider: "https://w3id.org/sssom/core_team"
#comment: This is an example file for the SSSOM for illustration only. Its contents are entirely fabricated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i want to go as far as saying that we should enforce that the examples given are REAL and not fabricated... Or am I asking too much that people should explain in detail and give real concrete use cases before SSSOM gets polluted with lots of indecipherable fields?

Copy link
Member

@cthoyt cthoyt Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, https://w3id.org/sssom/commons/disease/curation-rules/MPR2 does not resolve to anything, therefore I can not understand what this means, and can not review the merits of this field based on this example

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By fabricated I mean: this is not to be used / maintained for any practical purposes. I am totally out of steam on this one - we can push for incremental improvements moving forward. This example has real curation rules, I just didn't apply them to real data..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( I am sorry that my feedback has exhausted you.

we can push for incremental improvements moving forward

You and I both know what this means :p

How can we move forwards so the burden of making good, actionable improvements to SSSOM is distributed from you to community members who are requesting them? E.g., improving the new field request template, adding more CI/CD is a great start.

@saubin78 for a start, can you help alleviate some of this burden? Can you help provide actionable examples of how this might work (e.g., improve the example files Nico made to actually be meaningful examples)? Then, Nico won't feel so much burden from me giving important (but ultimately difficult to address) feedback in addition to the project-based pressure to just "get this done" that if done prematurely, could erode trust and sustainability of SSSOM (and more burn out)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.

I don't disagree with this. But I think it's reasonable to ask that people who are making proposals of new fields to go above and beyond the minimum requirements of PURLs to give context to these fields for the purpose of an example. As it is, without a herculean effort of reading through meandering GitHub issue conversations and documentation in various places, it's really not obvious to understand what's going on here. If we can't agree on this, can we at least agree that there should be a detailed explanation in the preamble of the SSSOM document explaining what the new predicates are supposed to be for?

@saubin78
Copy link

I think I will start building a library of examples for other justifications as well.

@saubin78 would you be able to help me by creating a small sssom file with two or three curation rules in it? We could put it here: https://github.com/mapping-commons/sssom/tree/master/examples/external

Or you can just share it as a google sheet and I will add it there.

I will add the ones from the phenotype ticket above as well.

Hi. Is it ok if I add an example file this afternoon?

@saubin78
Copy link

here is a real mapping example file. For the sake of simplicity, I copied the simplified rules directly in the TSV file. In the real complete mapping set, we'll prefer to have them in a separate file with examples.
I let you check that the format is correct before adding it to the example directory. Sorry but I am not familiar with pushing files on github, yet I'm learning...
wheat_CR_example.txt

@matentzn
Copy link
Collaborator Author

Thank you @saubin78, the formatting is great, but I think I don't understand the mapping rule exactly:

Rule 4 is about the tolerance and resistance to abiotic environmental conditions / environmental condition + resistance  (WTO) <-> environmental condition + tolerance.  (CO321)
  • What exactly is: CO321

  • You say Rule 4, but what exactly is are the conditions for this rule? is about is a bit too weak to be meaningful - If it is a rule it should something like: "these are the assumptions / the set of assumptions that lead me, the mapping author, to record this mapping."

@saubin78
Copy link

Right! Just let me clarify this with the mapping authors for the few examples I sent you.

@cthoyt
Copy link
Member

cthoyt commented Mar 14, 2023

See https://bioregistry.io/registry/co_321 (it's also in the prefix map)

@graybeal
Copy link

graybeal commented Mar 14, 2023

The example as given is very un-self-explanatory, sorry. I'm not sure if I've gotten the wrong end of the multiple curation_rule sticks in play, or if it's just that it's still in development, of if it's extraordinarily domain-specific, or if it's just badly worded in the original. It is perhaps an example of why made-up examples are often better than actual examples.

TLDR version (to skip the mild rant that follows): Don't overspecify this, and don't overexemplify it either. Allow it to breathe.

we can push for incremental improvements moving forward

You and I both know what this means :p

I don't think I agree with what you think this means. I think it is not only typical, but sometimes necessary, to leave specifications deliberately vague. When people get more experience they know much better what they want and the spec can be, and does get, improved. Most specs that get used heavily (as this one already is) get updated, improved, extended, and profiled. This will be that.

I think it's reasonable to ask that people who are making proposals of new fields to go above and beyond the minimum requirements of PURLs to give context to these fields for the purpose of an example.

I think that it is reasonable to consider what a realistic specification would have to go into in order to make the requested capability function in a consistent way, and to recognize that there are multiple requirements people have in mind for these two parameters, and that for those reasons maybe they are a bridge too far for this level of the specification. Because I promise you that consistency in the definition (especially computational consistency from resolved IRIs) would be a long, long ride from where we are today.

But if proposed fields are deemed so useful that we must allow for their possible use(s), then it is entirely reasonable to provide a fuzzy solution. For example, curation_rule_text_comment is about as fuzzy a field as you could imagine, and therefore IMHO not more useful than a comment field that says "Curation Rule: follows xyz pattern"). Not that it's bad, it's just not computable and not likely to produce consistently human-friendly results for a while, until practices evolve. Another fuzzy solution example: a curation_id solution that is defined as "an IRI that uniquely identifies a curation rule, ideally in a resolvable and computable resource."

With that definition the curation_id is fully adoptable by a variety of different communities, each with their own goals and implementation strategies. That in no way lessens the specification's value—it is very clear what is, and is not, being specified. (And going the other route, detailed specs and examples often don't control what happens in the wild. Consider how people use standards generally (DOIs, 'same_as', 'xref') to see how supposedly fully specified and exemplified standards get used and abused.)

So I'm not sure I love curation_x fields in general, but I argue the only way you can make them viable and appropriately general-purpose in the short term is to specify them in a simple way that allows them to evolve. Otherwise, without the community effort to really work them through, a lot of the community will end up using them differently than specified or desired, which will make that part of the spec ineffective.

@saubin78
Copy link

The example as given is very un-self-explanatory, sorry. I'm not sure if I've gotten the wrong end of the multiple curation_rule sticks in play, or if it's just that it's still in development, of if it's extraordinarily domain-specific, or if it's just badly worded in the original. It is perhaps an example of why made-up examples are often better than actual examples.

@graybeal it is a little bit of each. Here are the same examples (real) with a rewording of the curation rules so that they are more self-explanatory, and hopefully more useful for the SSSOM users.
wheat_CR_example.txt

@matentzn
Copy link
Collaborator Author

matentzn commented Mar 15, 2023

@saubin78 thank you very much; you have added these examples:

subject_id subject_label predicate_id object_id object_label curation_rule_text
wto:WTO_0000304 cold resistance skos:closeMatch CO_321:0000080 Cold tolerance Rule 4: We consider that "tolerance" and "resistance" are almost equivalent when applied to abiotic environmental conditions.
wto:WTO_0000450 aluminium toxicity skos:closeMatch CO321:0000079 Aluminum tolerance Rule 3: We consider that the user of the information retrieval function interested in plant traits related to metal toxicity (WTO) also wants to retrieve observational data measuring the plant tolerance to the same metal (CO_321). The rule metal + toxicity (WTO) <-> metal + tolerance (CO321) is valid for any kind of metal.
wto:WTO_0000296 aphid resistance skos:closeMatch CO321:0000085 Aphid damage Rule 2: We consider that the user of the information retrieval function interested in plant traits related to damages caused by some animal, insect, nematode, etc. also wants to retrieve observational data mentioning resistance to the same living organism.
wto:WTO_0000281 Armyworm resistance skos:closeMatch CO321:0000086 Armyworm damage Rule 2: We consider that the user of the information retrieval function interested in plant traits related to damages caused by some animal, insect, nematode, etc. also wants to retrieve observational data mentioning resistance to the same living organism.
wto:WTO_0000452 bacterial leaf blight resistance skos:closeMatch CO321:0000932 Bacterial leaf blight severity Rule 1.3: We consider that the user of the information retrieval function, given a pathogen or a disease, would like to retrieve all data, independently of the way the affection is observed. In observational data, a severity score is represented by two digits representing the vertical disease progress and an estimate of severity. The capacity of resistance to a disease would be deduced from the severity of this one on the plant.

I think your curation rules makes sense, especially paired with skos:closeMatch. I will give other an opportunity to react.

@graybeal I full-heartedly agree with your sentiment ("Don't overspecify this, and don't overexemplify it either") - it is my approach as well; I value however the challenging questions @cthoyt is raising by being a bit pedantic as well. I think iteration is always better than perfection, but the step-size for every iteration must be reasonably large, which means that perfection and "iterate fast" need to keep negotiating no matter how painful. All your comments on this PR (@graybeal, @saubin78, @croussey and @cthoyt) have already led to significant improvements of SSSOM:

  1. A reasonable definition of curation rule ("not overspecified") has been found
  2. Some reasonable examples have been proposed (not "great" or "correct", just "reasonable" in the sense that they convey the intention of the field)
  3. An improvement to documentation (new element proposals must now be accompanied but some kind of reasonable example, and metadata elements should link to their respective public issue tracker discussions
  4. QC has been activated to validate any SSSOM example files checked into this repository

There are some things we have not achieved:

  1. A formal definition of curation rules, such that for every curation rule CR, everybody will joyfully exclaim: "Yes, this is a valid curation rule" (or, universe forbid, a "great" one). We need to start collecting cases for curation rules over time, and hash out more clear guidelines once we understand how the community intends to use them.
  2. A formal bridge between curation_rule and curation_rule_text, which now live side by side without any significant connection between them (in the metamodel). We want this for now to allow for the flexibility needed for a better more formal definition in the future.

I would like to call it a day on this one:

  1. @croussey @saubin78 @cthoyt @graybeal as the main contributors to this discussion, are we a (albeit perhaps grudging) go on this PR?
  2. Is there any concern for me checking in @saubin78 examples into the examples directory?

@saubin78
Copy link

Thanks Nico for the recap and clarification and @ALL for the great discussion.
1 . This is exactly what I have in mind and the level of flexibility that we need to make the mapping set producers adhere to providing curation rules (are they for human reading or for machines).
2. Please add my example in the directory if it is fine for everyone.

@graybeal
Copy link

Example is much improved, thank you.

@matentzn matentzn merged commit 70fd252 into master Mar 16, 2023
@matentzn matentzn deleted the issue166-curationrule branch March 16, 2023 12:23
@matentzn
Copy link
Collaborator Author

matentzn commented Mar 18, 2023

Thank you everyone, we made a new release:

https://github.com/mapping-commons/sssom/releases/tag/0.11.0

@saubin78 we really appreciate your contributions (not just on this ticket, but various ones across the tracker); if you like to be added to SSSOM core team to also be included in publications and mentioned on presentations let me know!

@saubin78
Copy link

@matentzn Thanks you so much for your awesome work on this new release! I'd be glad to join the SSSOM core team and help spreading the word at INRAE and within the Agricultural, French and EOSC communities (with my participation to the FAIR IMPACT project)

@matentzn
Copy link
Collaborator Author

I have added you! Welcome to the 20th member of the core team! If you send me an email I will also invite you to OBO slack where we have a sssom channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New metadata element]: curation_rule
5 participants