Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for EPC representation for the text-mined association KG #78

Open
bill-baumgartner opened this issue Jan 13, 2021 · 1 comment
Assignees
Labels
in review status - This issue has been addressed and is now undergoing review

Comments

@bill-baumgartner
Copy link
Collaborator

bill-baumgartner commented Jan 13, 2021

For each text-mined Biolink association, we would like to provide relevant EPC data including:

  • The sentence from which the assertion was mined
  • An identifier for the document that contains the sentence
  • The character offsets (relative to the sentence) for the text mentions of the subject and object of the assertion
  • A confidence score for this specific text-mined assertion (right now this is the score reported by the classifier that identified the relation)

This goal of this issue is to discuss how to represent the EPC data using the Attribute object that is defined in the TRAPI specification.

An initial proposal for Attribute representation is available in this document.

The proposal in this issue builds off of the original, and specifically addresses a need to group EPC into individual packets that contain the sentence and other relevant information so that multiple EPC packets can be associated with a single assertion.

Data for a text-mined assertion

graph": {
          "nodes": [
            {
              "id": "n0",
              "type": "biolink:ChemicalSubstance",
              "curie": "CHEBI:3215"      # bupivacaine
            },
            {
              "id": "n1",
              "type": "biolink:GeneOrGeneProduct",
              "curie": "PR:000031567"    # LRRC3B 
            }
          ],
          "edges": [
            {
              "id": "e0",
              "source_id": "n0",
              "target_id": "n1",
              "type": "biolink:negatively_regulates_entity_to_entity"
            }
          ]
     }
# This assertion is supported by two sentences in the literature
      {
        'publication': 'PMID:29085514', 
        'score': '0.99956816', 
        'sentence': 'The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells.', 
        'subject_spans': 'start: 31, end: 42', 
        'object_spans': 'start: 104, end: 110', 
        'provided_by': 'TMProvider'
      }

      {
        'publication': 'PMID:12345678', 
        'score': '0.876', 
        'sentence': 'This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.', 
        'subject_spans': 'start: 42, end: 53', 
        'object_spans': 'start: 75, end: 81', 
        'provided_by': 'TMProvider'
      }

Proposed Attribute representation

The proposed Attribute representation models this assertion as a single edge between bupivacaine and LRRC3B with two accompanying Attributes representing the EPC data. Nested Attributes are used to allow each packet of sentence information to be self-contained. Also demonstrated are attributes representing a confidence score for the concept recognition of each node (concept), and an aggregate confidence score computed for each edge.

nodes:
  - id: CHEBI:3215
     category: biolink:ChemicalSubstance
     name: "bupivacaine"
     attributes:
        - attribute_type_id: SEPIO:0000168  # confidence_score
           attribute_from_source: "has confidence score"
           value: 0.7578
           value_type_id: biolink:ConfidenceLevel
           value_type_from_source: "confidence score"
           value_source: TMProvider

  - id: PR:000031567
     category: biolink:GeneOrGeneProduct
     name: "LRRC3B"
     attributes:
        - attribute_type_id: SEPIO:0000168  # confidence_score
           attribute_from_source: "has confidence score"
           value: 0.5467
           value_type_id: biolink:ConfidenceLevel
           value_type_from_source: "confidence score"
           value_source: TMProvider
            

edges: 
  - id: tmkp.Association001
    category: biolink:ChemicalToGeneAssociation
    subject: CHEBI:3215          # bupivacaine
    predicate: biolink:negatively_regulates_entity_to_entity
    object: PR:000031567       # LRRC3B 
    attributes:

    - attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
       attribute_from_source:  "source publication"    # what the source might have called the relationship
       value: PMID:29085514
       value_type_id: biolink:Publication          # here a biolink term is used to type the value.
       value_type_from_source: "PMID"
       value_source: TMProvider
       attributes:
          - attribute_type_id: SIO:000028  # has part
             value: "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
             value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'       
             value_type_from_source:  sentence text   
             attributes:
                 - attribute_type_id: SIO:000028  # has part
                    value: '31|42'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  subject span   
                 - attribute_type_id: SIO:000028  # has part
                    value: '104|110'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  object span               
                 - attribute_type_id: SEPIO:0000440  # has_supporting_evidence  
                    value: 0.99956816
                    value_type_id: EDAM:data_1772     # score 
                    value_type_from_source:  sentence confidence score          
                    value_source: TMProvider BERT model v0.1

    - attribute_type_id: SEPIO:0000438  # has_supporting_evidence_from_source
       attribute_from_source:  "source publication"    # what the source might have called the relationship
       value: PMID:12345678
       value_type_id: biolink:Publication          # here a biolink term is used to type the value.
       value_type_from_source: "PMID"
       value_source: TMProvider
       attributes:
          - attribute_type_id: SIO:000028  # has part
             value: "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B.'"
             value_type_id: EDAM:data_3671     # text, or SIO:000113 'sentence'       
             value_type_from_source:  sentence text   
             attributes:
                 - attribute_type_id: SIO:000028  # has part
                    value: '42|53'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  subject span   
                 - attribute_type_id: SIO:000028  # has part
                    value: '75|81'
                    value_type_id: SIO:001056 # character position
                    value_type_from_source:  object span               
                 - attribute_type_id: SEPIO:0000440  # has_supporting_evidence  
                    value: 0.876
                    value_type_id: EDAM:data_1772     # score 
                    value_type_from_source:  sentence confidence score          
                    value_source: TMProvider BERT model v0.1

    - attribute_type_id: SEPIO:0000168  # confidence_score
       attribute_from_source: "has aggregate confidence score"
       value: 0.64711234
       value_type_id: biolink:ConfidenceLevel
       value_type_from_source: "aggregate confidence score"
       value_source: TMProvider

@bill-baumgartner bill-baumgartner added incoming status - This issue has been submitted and is awaiting approval/triage in review status - This issue has been addressed and is now undergoing review and removed incoming status - This issue has been submitted and is awaiting approval/triage labels Jan 13, 2021
@bill-baumgartner
Copy link
Collaborator Author

For comparison purposes, shown below is an alternative approach that uses no nesting of Attributes, and instead makes use of arrays to specify attribute values. For a given EPC packet, the sentence, score, subject & object spans, and PMID are inherently connected based on the array index used to store their values.

Note: This is the current output format used by the Service Provider to serve up the Text Mining Provider text-mined Biolink association KG.

edges:
  - id: 9445e98f72ada21aa572559e303e4d5ac414650f
    predicate: biolink:negatively_regulates,
    subject: CHEBI:3215          # bupivacaine
    object: PR:000031567       # LRRC3B
    attributes:
      - type: biolink:provided_by
        name: provided_by
        value: Text Mining KP
      - type: bts:api
        name: api
        value: Text Mining Targeted Association API
      - type: bts:score
        name: score
        value: 
          - 0.99956816
          - 0.876
      - type: bts:sentence
        name: sentence
        value: 
          - "The administration of 50 µg/ml bupivacaine promoted maximum breast cancer cell invasion, and suppressed LRRC3B mRNA expression in cells."
          - "This is a second sentence indicating that bupivacaine negatively regulates LRRC3B."
      - type: bts:subject_spans
        name: subject_spans
        value: 
          - "31|42"
          - "42|53"
      - type: bts:object_spans
        name: object_spans
        value: 
          - "104|110"
          - "75|81"
      - type: bts:publications
        name: publications
        value: 
          - PMID:29085514
          - PMID:12345678             
 

@bill-baumgartner bill-baumgartner pinned this issue Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in review status - This issue has been addressed and is now undergoing review
Projects
None yet
Development

No branches or pull requests

3 participants