mapping-commons · matentzn · Jul 25, 2023 · Jul 20, 2023
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -36,7 +36,9 @@ nav:
     - How to contribute?: contributing.md
     - Code of Conduct: code_of_conduct.md
   - Resources for users:
-    - Use cases: usecases.md
+    - Use cases: 
+      - Overview: usecases.md
+      - How to gradually enrich OMOP mappings with SSSOM: tutorials/omop-mappings.md
     - Workshops: workshops.md
     - Presentations: presentations.md
     - Basic Tutorial: tutorial.md

diff --git a/src/docs/tutorials/omop-mappings.md b/src/docs/tutorials/omop-mappings.md
@@ -0,0 +1,73 @@
+# How to gradually enrich OMOP mappings with SSSOM
+
+This document is a guide for OMOP ETL developers to think about gradually improving the (documentation of the) strength of evidence for their vocabulary mappings.
+
+## Example table from OMOP
+
+Generated manually with Athena on the 20th July 2023. The start and end dates are invented.
+
+| concept_id_1 | concept_id_2 | relationship_id | valid_start_date | valid_end_date | invalid_reason |
+|--------------|--------------|-----------------|------------------|----------------|----------------|
+| 44499396        | 4028717        | Maps to         | 19700101         | 20991231       |                |
+| 45586281        | 4028717        | Maps to         | 73754         | 20991231       |                |
+
+## Level 1, basic mapping table, basic provenance
+
+The SSSOM metadata provided is conceptually correct, but fictious. 
+
+The reader should imagine this being provided as a separate CONCEPT_MAPPINGS.CSV table that can be joined on `subject_id`->`concept_id_1`, `object_id`->`concept_id_2` for all rows with a `Maps to` `relationship_id` (this is assuming that the `concept_id_1`,`concept_id_2` tuple is unique for `Maps to`).
+
+| subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id |
+|---|---|---|---|---|---|---|---|---|
+| OMOP:44499396 | OMOP:4028717 | omoprel:mapsTo | OHDSI:Odysseus | | | semapv:ManualMappingCuration | | ORCID:0000-0003-4147-1485 |
+| OMOP:45586281 | OMOP:4028717 | omoprel:mapsTo | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 |
+| OMOP:45610575 | OMOP:441554 | omoprel:mapsTo | OHDSI:UMLS | | | semapv:UnspecifiedMatching | | |
+
+What we see here:
+
+1. all identifiers are prefixed to make sure they are interpreted correctly when they are reused. This includes OMOP ids (e.g. `OMOP:44499396`) as well as ORCIDs (OPTIONAL)
+1. "Maps to" is encoded using a proper identifier rather than a string (OPTIONAL)
+1. All three mappings have a `mapping_justification` to distinguish for example if the mapping was determined by human manual curation (`semapv:ManualMappingCuration`) or lexical matching (`semapv:LexicalMatching`). Many other justifications exist and/or can be created. If the justification for the mapping is unknown, we can make our lack of knowledge transparent by using `semapv:UnspecifiedMatching`.
+1. `author_id`, in the case of `semapv:ManualMappingCuration`, tells us who the person is that determined the mapping. This is basic provenance. If the identity of the author can be connected with an public record such as ORCID, this can help mapping users to increase trust in a mapping. `reviewer_id` tells us that some human looked at the mapping after it was proposed by a tool, and "signed off" on it. This can be valueable, again, to increase trust.
+1. If the match was generated by the tool, some basic provenance is added (`mapping_tool`, `mapping_tool_version`).
+
+## Level 2: Curate semantic mapping predicate
+
+| subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id |
+|---|---|---|---|---|---|---|---|---|
+| OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | | | semapv:ManualMappingCuration | | ORCID:0000-0003-4147-1485 |
+| OMOP:45586281 | OMOP:4028717 | skos:exactMatch | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 |
+| OMOP:45610575 | OMOP:441554 | skos:exactMatch | OHDSI:UMLS | | | semapv:UnspecifiedMatching | | |
+
+What do we see here?
+
+1. Rather than `Maps to`, the mapping predicate (e.g. `skos:exactMatch`) is a semantic mapping predicate from a standardised vocabulary ([SKOS](https://www.w3.org/TR/skos-reference)). Here, we distinguish between `skos:exactMatch` and `skos:broadMatch`, but there are other predicates, see for example in the [Semantic Mapping Vocabulary](https://github.com/mapping-commons/semantic-mapping-vocabulary/blob/main/semapv-properties.tsv).
+
+## Level 3: Document confidence widely
+
+`confidence` is an incredibly useful metric for downstream users, including ETL engineers and data analysts. In an ideal world, all mappings have some kind of `confidence` associated with them. `confidence` scores should be read as "the strength of evidence provided in this record/table row (i.e mapping justification) leads us to believe the mapping (e.g. `OMOP:44499396 --[skos:broadMatch]--> OMOP:4028717`) is correct with 90% confidence.
+
+| subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id | confidence |
+|---|---|---|---|---|---|---|---|---|---|
+| OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | | | semapv:ManualMappingCuration | | ORCID:0000-0003-4147-1485 | 0.9 |
+| OMOP:45586281 | OMOP:4028717 | skos:exactMatch | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 | 0.8 |
+| OMOP:45610575 | OMOP:441554 | skos:exactMatch | OHDSI:UMLS | | | semapv:UnspecifiedMatching | | | 0.6 |
+
+What do we see here?
+
+- For matching tools, confidence can be calculated by proxies such as "lexical similarity", "edit distance", "cosine similarity of node embedding" and other metrics. IN the example above, Usagi has determined that the subject and objects match, but it was only 80% sure (we dont know why - this is [more advance SSSOM](mapping-justifications.md))
+- For case where an external mapping is re-used using ETL, `confidence` describes the level of trust you as an ETL expert have in the fidelty of the mapping provided by the source.
+
+## Level 4: Document curation rules
+
+| subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id | confidence | curation_rule |
+|---|---|---|---|---|---|---|---|---|---|---|
+| OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | | | semapv:ManualMappingCuration | | ORCID:0000-0003-4147-1485 | 0.9 | OHDSI_CURATION_RULE:19 |
+
+What do we see here?
+
+- For manual matches, it is often unclear by what criteria a match was established. Documenting the curation rules can help increasing consistency for manual curation, and transparency for downstream users.
+- `OHDSI_CURATION_RULE:19` is a rule defined by your own curation rulebook. This can be _anything_. For example `OHDSI_CURATION_RULE:19` could correspond to the following rule: 
+```
+OHDSI_CURATION_RULE:19 = If the subject concept does not have an exact match in the object source vocabulary, we select the nearest broad ("up-hill") concept applicable. Conceptually, if both terms would exist in the same terminology, the subject concept can be defined as a subconcept of the object concept. The determination for both criteria (nearest broad, conceptally subconcept) is performed through medical expert judgement.
+```