Add curation_rule to SSSOM #258

matentzn · 2023-02-16T14:32:23Z

Fixes #166

docs/ have been added/updated if necessary
make test has been run locally
tests have been added/updated (if applicable)
CHANGELOG.md has been updated.

The need for representing specific curation rules is everywhere, see #166. It will be very difficult, if not impossible, to standardise curation rules, so I would advocate we leave this totally open for now. Representing the rules as a resource basically makes them an open ended enum - which gives us more flexibility for adding structure later. @saubin78 suggested to create a class "MappingRule" and have curation rules being instances of mapping rules.

We also should decide reasonably soon if we want to rename curation rule into mapping rule altogether, if we agree with @saubin78 assertion in #166 that computational rules can also be curation rules (I think that's fair!).

saubin78 · 2023-03-02T10:59:17Z

Hi. The value type for curation rule, i.e. EntityReference (namely, a URI), turns to be too restrictive. It is particularly inadequate for users publishing their mapping set in the TSV format, and potentially not speaking RDF. This is what D2KAB parterns expressed.

I'd suggest, if possible, to have either text or a URI. This would allow to publish rules either directly in the mapping set, as a code refering to a piece of text in a side document, or as a RDF resource.

matentzn · 2023-03-02T12:56:37Z

I hear your argument @saubin78 and you are not the only one that suggested this. The reason why I at least here opted for a resource based solution was to open the door for sharing curation rules in the wider future. Machines will not be able to do anything with a natural language curation rule (so far these have gone into the "comment" field).

What about this as a compromise: we add two fields, curation_rule_id and curation_rule_text. this is analogous to author_id and author_label etc. This keeps the door open for structured curation rule objects. We can even employ tools that extract the text from a structured curation rule and adds it to the curation_rule_text field.

matentzn · 2023-03-03T15:11:15Z

@saubin78 what do you think now?

croussey · 2023-03-06T12:34:50Z

About curation rules, in the future it would be great to share them in order to reuse them. Unfortunatelly we are at the beginning so we do not know if people will express curation rules in the same way, that the reason why it is first necessary to collect some textual example of curation rules before setting them more formally using RDF language.

matentzn · 2023-03-06T18:36:21Z

@croussey thank you very much for your input. I think I have come around to agree with you. This suggestion is now reflected in this PR by having two elements. Do you like the proposed solution?

cthoyt · 2023-03-07T14:23:20Z

I am super hesitant about adding any new rules without some convincing and well-described examples (both on the human prose level and also actual example SSSOM files!). What I've seen right now are nice ideas, but I can't see how they will translate to SSSOM in reality. For example, I have yet to see and example that faithfully uses a controlled vocabulary to refer to a curation rule. Do such vocabularies exist? Or will this just be another field that gets abused?

matentzn · 2023-03-08T09:54:39Z

@cthoyt and you should be hesitant.

Here is an issue that illustrates "shareable curation rules": mapping-commons/disease-mappings#16

Here is another, complex discussion on the matter: mapping-commons/mh_mapping_initiative#14

The main, but not only, gap that curation rules fill is where "a human said so" is not enough. You want to capture some of the intuition behind the human curation, like: "From the labels and definitions provided, I made a call", or: two chemicals from different databases are considered "the same" if they have the same molecular structure, or: we determine that two disease entities refer to the same thing based on their identical phenotypic profile.

There is no such vocabulary, and building one will take years - I suggest not to worry about this for now from the sssom side. Users have been asking for months now to get a way to capture these rules, so my idea is to see what will practically be curated, and if it is worth it, we can, in the mid-term, make a review of existing mapping sets and capture the curation rules in a specific vocabulary.

I hope that makes some sense. Thank you for taking the time to looking at this!

matentzn · 2023-03-10T14:49:52Z

Due to popular demand I will merge this on Wednesday the 15th of March next week, and make a new release.

cthoyt · 2023-03-10T14:56:50Z

I still think that there should be several examples of this being used included in the PR before even considering that it can be merged -> like actual examples, appearing in SSSOM files.

matentzn · 2023-03-10T15:01:09Z

I think I will start building a library of examples for other justifications as well.

@saubin78 would you be able to help me by creating a small sssom file with two or three curation rules in it? We could put it here: https://github.com/mapping-commons/sssom/tree/master/examples/external

Or you can just share it as a google sheet and I will add it there.

I will add the ones from the phenotype ticket above as well.

cthoyt · 2023-03-14T11:21:24Z

examples/schema/curation_rule.sssom.tsv

+#creator_id: orcid:0000-0002-7356-1779
+#license: "https://creativecommons.org/publicdomain/zero/1.0/"
+#mapping_provider: "https://w3id.org/sssom/core_team"
+#comment: This is an example file for the SSSOM for illustration only. Its contents are entirely fabricated.


i want to go as far as saying that we should enforce that the examples given are REAL and not fabricated... Or am I asking too much that people should explain in detail and give real concrete use cases before SSSOM gets polluted with lots of indecipherable fields?

Further, https://w3id.org/sssom/commons/disease/curation-rules/MPR2 does not resolve to anything, therefore I can not understand what this means, and can not review the merits of this field based on this example

By fabricated I mean: this is not to be used / maintained for any practical purposes. I am totally out of steam on this one - we can push for incremental improvements moving forward. This example has real curation rules, I just didn't apply them to real data..

This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.

:( I am sorry that my feedback has exhausted you.

we can push for incremental improvements moving forward

You and I both know what this means :p

How can we move forwards so the burden of making good, actionable improvements to SSSOM is distributed from you to community members who are requesting them? E.g., improving the new field request template, adding more CI/CD is a great start.

@saubin78 for a start, can you help alleviate some of this burden? Can you help provide actionable examples of how this might work (e.g., improve the example files Nico made to actually be meaningful examples)? Then, Nico won't feel so much burden from me giving important (but ultimately difficult to address) feedback in addition to the project-based pressure to just "get this done" that if done prematurely, could erode trust and sustainability of SSSOM (and more burn out)?

This is what I tried to explain in our last call - PURLs should, but don't have to, resolve to something. Its not part of SSSOM to prescribe what goes in these curation rules. Different communities will decide to create shareable representations, and they will decide to provide resolveable resources and examples. When you review a PR of a mapping set for use, you can, in your organisation, apply whatever quality thresholds you want during the review. On SSSOM metadata level we just say: there is an element to represent curation rules, this is how you do it.

I don't disagree with this. But I think it's reasonable to ask that people who are making proposals of new fields to go above and beyond the minimum requirements of PURLs to give context to these fields for the purpose of an example. As it is, without a herculean effort of reading through meandering GitHub issue conversations and documentation in various places, it's really not obvious to understand what's going on here. If we can't agree on this, can we at least agree that there should be a detailed explanation in the preamble of the SSSOM document explaining what the new predicates are supposed to be for?

saubin78 · 2023-03-14T11:23:41Z

I think I will start building a library of examples for other justifications as well.

@saubin78 would you be able to help me by creating a small sssom file with two or three curation rules in it? We could put it here: https://github.com/mapping-commons/sssom/tree/master/examples/external

Or you can just share it as a google sheet and I will add it there.

I will add the ones from the phenotype ticket above as well.

Hi. Is it ok if I add an example file this afternoon?

saubin78 · 2023-03-14T15:40:45Z

here is a real mapping example file. For the sake of simplicity, I copied the simplified rules directly in the TSV file. In the real complete mapping set, we'll prefer to have them in a separate file with examples.
I let you check that the format is correct before adding it to the example directory. Sorry but I am not familiar with pushing files on github, yet I'm learning...
wheat_CR_example.txt

matentzn · 2023-03-14T16:05:16Z

Thank you @saubin78, the formatting is great, but I think I don't understand the mapping rule exactly:

Rule 4 is about the tolerance and resistance to abiotic environmental conditions / environmental condition + resistance  (WTO) <-> environmental condition + tolerance.  (CO321)

What exactly is: CO321
You say Rule 4, but what exactly is are the conditions for this rule? is about is a bit too weak to be meaningful - If it is a rule it should something like: "these are the assumptions / the set of assumptions that lead me, the mapping author, to record this mapping."

saubin78 · 2023-03-14T16:26:13Z

Right! Just let me clarify this with the mapping authors for the few examples I sent you.

cthoyt · 2023-03-14T16:32:43Z

See https://bioregistry.io/registry/co_321 (it's also in the prefix map)

graybeal · 2023-03-14T23:29:53Z

The example as given is very un-self-explanatory, sorry. I'm not sure if I've gotten the wrong end of the multiple curation_rule sticks in play, or if it's just that it's still in development, of if it's extraordinarily domain-specific, or if it's just badly worded in the original. It is perhaps an example of why made-up examples are often better than actual examples.

TLDR version (to skip the mild rant that follows): Don't overspecify this, and don't overexemplify it either. Allow it to breathe.

we can push for incremental improvements moving forward

You and I both know what this means :p

I don't think I agree with what you think this means. I think it is not only typical, but sometimes necessary, to leave specifications deliberately vague. When people get more experience they know much better what they want and the spec can be, and does get, improved. Most specs that get used heavily (as this one already is) get updated, improved, extended, and profiled. This will be that.

I think it's reasonable to ask that people who are making proposals of new fields to go above and beyond the minimum requirements of PURLs to give context to these fields for the purpose of an example.

I think that it is reasonable to consider what a realistic specification would have to go into in order to make the requested capability function in a consistent way, and to recognize that there are multiple requirements people have in mind for these two parameters, and that for those reasons maybe they are a bridge too far for this level of the specification. Because I promise you that consistency in the definition (especially computational consistency from resolved IRIs) would be a long, long ride from where we are today.

But if proposed fields are deemed so useful that we must allow for their possible use(s), then it is entirely reasonable to provide a fuzzy solution. For example, curation_rule_text_comment is about as fuzzy a field as you could imagine, and therefore IMHO not more useful than a comment field that says "Curation Rule: follows xyz pattern"). Not that it's bad, it's just not computable and not likely to produce consistently human-friendly results for a while, until practices evolve. Another fuzzy solution example: a curation_id solution that is defined as "an IRI that uniquely identifies a curation rule, ideally in a resolvable and computable resource."

With that definition the curation_id is fully adoptable by a variety of different communities, each with their own goals and implementation strategies. That in no way lessens the specification's value—it is very clear what is, and is not, being specified. (And going the other route, detailed specs and examples often don't control what happens in the wild. Consider how people use standards generally (DOIs, 'same_as', 'xref') to see how supposedly fully specified and exemplified standards get used and abused.)

So I'm not sure I love curation_x fields in general, but I argue the only way you can make them viable and appropriately general-purpose in the short term is to specify them in a simple way that allows them to evolve. Otherwise, without the community effort to really work them through, a lot of the community will end up using them differently than specified or desired, which will make that part of the spec ineffective.

saubin78 · 2023-03-15T09:26:16Z

The example as given is very un-self-explanatory, sorry. I'm not sure if I've gotten the wrong end of the multiple curation_rule sticks in play, or if it's just that it's still in development, of if it's extraordinarily domain-specific, or if it's just badly worded in the original. It is perhaps an example of why made-up examples are often better than actual examples.

@graybeal it is a little bit of each. Here are the same examples (real) with a rewording of the curation rules so that they are more self-explanatory, and hopefully more useful for the SSSOM users.
wheat_CR_example.txt

matentzn · 2023-03-15T10:14:39Z

@saubin78 thank you very much; you have added these examples:

subject_id	subject_label	predicate_id	object_id	object_label	curation_rule_text
wto:WTO_0000304	cold resistance	skos:closeMatch	CO_321:0000080	Cold tolerance	Rule 4: We consider that "tolerance" and "resistance" are almost equivalent when applied to abiotic environmental conditions.
wto:WTO_0000450	aluminium toxicity	skos:closeMatch	CO321:0000079	Aluminum tolerance	Rule 3: We consider that the user of the information retrieval function interested in plant traits related to metal toxicity (WTO) also wants to retrieve observational data measuring the plant tolerance to the same metal (CO_321). The rule metal + toxicity (WTO) <-> metal + tolerance (CO321) is valid for any kind of metal.
wto:WTO_0000296	aphid resistance	skos:closeMatch	CO321:0000085	Aphid damage	Rule 2: We consider that the user of the information retrieval function interested in plant traits related to damages caused by some animal, insect, nematode, etc. also wants to retrieve observational data mentioning resistance to the same living organism.
wto:WTO_0000281	Armyworm resistance	skos:closeMatch	CO321:0000086	Armyworm damage	Rule 2: We consider that the user of the information retrieval function interested in plant traits related to damages caused by some animal, insect, nematode, etc. also wants to retrieve observational data mentioning resistance to the same living organism.
wto:WTO_0000452	bacterial leaf blight resistance	skos:closeMatch	CO321:0000932	Bacterial leaf blight severity	Rule 1.3: We consider that the user of the information retrieval function, given a pathogen or a disease, would like to retrieve all data, independently of the way the affection is observed. In observational data, a severity score is represented by two digits representing the vertical disease progress and an estimate of severity. The capacity of resistance to a disease would be deduced from the severity of this one on the plant.

I think your curation rules makes sense, especially paired with skos:closeMatch. I will give other an opportunity to react.

@graybeal I full-heartedly agree with your sentiment ("Don't overspecify this, and don't overexemplify it either") - it is my approach as well; I value however the challenging questions @cthoyt is raising by being a bit pedantic as well. I think iteration is always better than perfection, but the step-size for every iteration must be reasonably large, which means that perfection and "iterate fast" need to keep negotiating no matter how painful. All your comments on this PR (@graybeal, @saubin78, @croussey and @cthoyt) have already led to significant improvements of SSSOM:

A reasonable definition of curation rule ("not overspecified") has been found
Some reasonable examples have been proposed (not "great" or "correct", just "reasonable" in the sense that they convey the intention of the field)
An improvement to documentation (new element proposals must now be accompanied but some kind of reasonable example, and metadata elements should link to their respective public issue tracker discussions
QC has been activated to validate any SSSOM example files checked into this repository

There are some things we have not achieved:

A formal definition of curation rules, such that for every curation rule CR, everybody will joyfully exclaim: "Yes, this is a valid curation rule" (or, universe forbid, a "great" one). We need to start collecting cases for curation rules over time, and hash out more clear guidelines once we understand how the community intends to use them.
A formal bridge between curation_rule and curation_rule_text, which now live side by side without any significant connection between them (in the metamodel). We want this for now to allow for the flexibility needed for a better more formal definition in the future.

I would like to call it a day on this one:

@croussey @saubin78 @cthoyt @graybeal as the main contributors to this discussion, are we a (albeit perhaps grudging) go on this PR?
Is there any concern for me checking in @saubin78 examples into the examples directory?

saubin78 · 2023-03-15T11:16:50Z

Thanks Nico for the recap and clarification and @ALL for the great discussion.
1 . This is exactly what I have in mind and the level of flexibility that we need to make the mapping set producers adhere to providing curation rules (are they for human reading or for machines).
2. Please add my example in the directory if it is fine for everyone.

graybeal · 2023-03-15T22:47:48Z

Example is much improved, thank you.

matentzn · 2023-03-18T11:43:27Z

Thank you everyone, we made a new release:

https://github.com/mapping-commons/sssom/releases/tag/0.11.0

@saubin78 we really appreciate your contributions (not just on this ticket, but various ones across the tracker); if you like to be added to SSSOM core team to also be included in publications and mentioned on presentations let me know!

saubin78 · 2023-03-20T09:34:34Z

@matentzn Thanks you so much for your awesome work on this new release! I'd be glad to join the SSSOM core team and help spreading the word at INRAE and within the Agricultural, French and EOSC communities (with my participation to the FAIR IMPACT project)

matentzn · 2023-03-20T10:02:01Z

I have added you! Welcome to the 20th member of the core team! If you send me an email I will also invite you to OBO slack where we have a sssom channel.

Update sssom_schema.yaml

7ddcf62

matentzn added 2 commits March 3, 2023 17:06

Merge branch 'master' into issue166-curationrule

8c9b65e

Update sssom_schema.yaml

097a6bc

matentzn added 4 commits March 14, 2023 12:16

Merge branch 'master' into issue166-curationrule

b2bb43d

Update sssom_schema.yaml

0d3ccec

Create curation_rule.sssom.tsv

bbf4188

Add more examples

5c6b622

matentzn mentioned this pull request Mar 14, 2023

Adding QC checks for example SSSOM files #268

Merged

adding link to example files to schema

c8c97ee

matentzn requested a review from cthoyt March 14, 2023 11:21

cthoyt reviewed Mar 14, 2023

View reviewed changes

matentzn mentioned this pull request Mar 14, 2023

Add some examples of using see_also to SSSOM schema #267

Merged

matentzn added 4 commits March 16, 2023 13:08

Merge branch 'master' into issue166-curationrule

e97d0bb

Update examples

9243fa9

Merge branch 'master' into issue166-curationrule

09c6543

Update curation_rule_text2.sssom.tsv

a301cfe

matentzn merged commit 70fd252 into master Mar 16, 2023

matentzn deleted the issue166-curationrule branch March 16, 2023 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add curation_rule to SSSOM #258

Add curation_rule to SSSOM #258

matentzn commented Feb 16, 2023 •

edited

Loading

saubin78 commented Mar 2, 2023

matentzn commented Mar 2, 2023

matentzn commented Mar 3, 2023

croussey commented Mar 6, 2023

matentzn commented Mar 6, 2023

cthoyt commented Mar 7, 2023

matentzn commented Mar 8, 2023

matentzn commented Mar 10, 2023

cthoyt commented Mar 10, 2023 •

edited

Loading

matentzn commented Mar 10, 2023

cthoyt Mar 14, 2023

cthoyt Mar 14, 2023 •

edited

Loading

matentzn Mar 14, 2023

matentzn Mar 14, 2023

cthoyt Mar 14, 2023

cthoyt Mar 14, 2023

saubin78 commented Mar 14, 2023

saubin78 commented Mar 14, 2023

matentzn commented Mar 14, 2023

saubin78 commented Mar 14, 2023

cthoyt commented Mar 14, 2023 •

edited

Loading

graybeal commented Mar 14, 2023 •

edited

Loading

saubin78 commented Mar 15, 2023

matentzn commented Mar 15, 2023 •

edited

Loading

saubin78 commented Mar 15, 2023

graybeal commented Mar 15, 2023

matentzn commented Mar 18, 2023 •

edited

Loading

saubin78 commented Mar 20, 2023

matentzn commented Mar 20, 2023

Add curation_rule to SSSOM #258

Add curation_rule to SSSOM #258

Conversation

matentzn commented Feb 16, 2023 • edited Loading

saubin78 commented Mar 2, 2023

matentzn commented Mar 2, 2023

matentzn commented Mar 3, 2023

croussey commented Mar 6, 2023

matentzn commented Mar 6, 2023

cthoyt commented Mar 7, 2023

matentzn commented Mar 8, 2023

matentzn commented Mar 10, 2023

cthoyt commented Mar 10, 2023 • edited Loading

matentzn commented Mar 10, 2023

cthoyt Mar 14, 2023

Choose a reason for hiding this comment

cthoyt Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

matentzn Mar 14, 2023

Choose a reason for hiding this comment

matentzn Mar 14, 2023

Choose a reason for hiding this comment

cthoyt Mar 14, 2023

Choose a reason for hiding this comment

cthoyt Mar 14, 2023

Choose a reason for hiding this comment

saubin78 commented Mar 14, 2023

saubin78 commented Mar 14, 2023

matentzn commented Mar 14, 2023

saubin78 commented Mar 14, 2023

cthoyt commented Mar 14, 2023 • edited Loading

graybeal commented Mar 14, 2023 • edited Loading

saubin78 commented Mar 15, 2023

matentzn commented Mar 15, 2023 • edited Loading

saubin78 commented Mar 15, 2023

graybeal commented Mar 15, 2023

matentzn commented Mar 18, 2023 • edited Loading

saubin78 commented Mar 20, 2023

matentzn commented Mar 20, 2023

matentzn commented Feb 16, 2023 •

edited

Loading

cthoyt commented Mar 10, 2023 •

edited

Loading

cthoyt Mar 14, 2023 •

edited

Loading

cthoyt commented Mar 14, 2023 •

edited

Loading

graybeal commented Mar 14, 2023 •

edited

Loading

matentzn commented Mar 15, 2023 •

edited

Loading

matentzn commented Mar 18, 2023 •

edited

Loading