Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Merging Proposal #70

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

uhbrar
Copy link

@uhbrar uhbrar commented Apr 25, 2022

Here is the prepared edge merging proposal and example. As you can see, there are many issues that must be considered when addressing edge merging. This proposal discusses some of these issues and offers possible solutions to them. Relevant issues will also need to be discussed by the necessary working groups.

Please let me know what concerns anyone might have regarding this proposal. I will also be happy to answer questions regarding it.

For comment only. Not to merge.

@uhbrar uhbrar marked this pull request as draft April 25, 2022 03:17
Copy link

@ehinderer ehinderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under "attributes" would it not make more sense to link to the unique provenance ID which produced the attribute (e.g. "p4"), rather than the infores ID of the source for "attribute_source"? It's a subtle distinction, but it would disambiguate at which point in the provenance path an attribute was added and would track how an attribute was added in cases where an ARA might be doing more complex operations.

@gregr
Copy link
Member

gregr commented Apr 26, 2022

Thanks for writing this proposal, @uhbrar. Here are some concerns and comments I originally posted in Slack.

When a service does decide to merge, I agree that it is important to have a standard set of edge components that determine which edges are eligible for merging.

For the sake of ARA flexibility and Translator robustness, the decision of whether to merge should be left to a service's discretion, and not be required. Downstream services should be able to cope with unmerged results. I can imagine ARAs that choose to perform normalization differently, as part of their approach to reasoning. They may choose to perform fewer or more forms of normalization. Such deviation won't be compatible with downstream services expecting globally-mandated merging guarantees, though it should still be compatible with global agreement on edge key components.

Communicating hash values directly as part of responses seems fragile, and does not seem necessary for efficient lookup, despite the claim in the proposal.

Fragility concerns:

  • All services will have to agree on a single hash function.
    • If we decide to change how hashing works, successfully upgrading Translator will require global coordination. Otherwise there will be a period where services are no longer able to understand each other's hash values.
  • A service which inadvertently or intentionally normalizes in a different way may compute different hash values.

The proposal's efficiency claim is that hash values will prevent a service from having to look through all the results in order to bucket them appropriately. But this can't be true for a couple of reasons:

  • A service already has to parse the entire JSON response text. This means the service is guaranteed to be doing work proportional to the size of every component of the response.
  • Even if each edge has a hash value attached, each edge still needs to be considered for insertion/merging into the hash table. This requires a service to again perform work proportional to the number of edges.

Without transmitting hash values, each service can still locally hash the components that make up an edge's "key", with work proportional to the size of each key, in order to quickly identify edge buckets. No global agreement on hash function is necessary for this. The cost of this local hashing should not dominate any of the other already-required work as described above (parsing JSON and iterating over edges).

@jeffhhk
Copy link

jeffhhk commented Apr 26, 2022

I've provided my comments in the form of a document:

https://docs.google.com/document/d/1zL37e5LZOLw0hOtCcSFgUzLuj5zZAi8xb-XZYjEIXsc/edit#

@edeutsch
Copy link
Collaborator

Qualifiers may be elevated out of edge attributes to be a new top level predicate for edges, based on recent discussiong from the TRAPI group. However, with qualifiers, a decision must be made on what qualifiers can be merged with one another, if any. The simplest solution would be to not merge any edges with different qualifiers, but that might not necessarily be the best idea. This would require investigation by the Data Modelling Working Group before a final decision can be made.

In my current understanding, qualifiers will be a new level of semantic meaning more complex than subject-object-predicate, so when considering to merge edges, it will be crucial to consider qualifiers as being just as important as subject-object-predicate. i.e. two assertions with the same subject-object-predicate but different qualifiers MUST be treated as different assertions.

@uhbrar
Copy link
Author

uhbrar commented Apr 28, 2022

@edeutsch - Yes, qualifiers are required for equivalence. In this proposal, they are one of the 5 values that must be the equal to merge any equivalent edges. The proposal dictates that you must have the same subject, predicate, object, qualifiers, and original/primary knowledge source.

@ehinderer - I actually like this idea a bit, but at the same time, that unique provenance id may actually be the infores id of the service anyway. I'm actually leaning towards that, since I don't see why a kp would report different methods of attaining an edge to two different ARA's.

@jeffhhk There's a few points I'd like to address:

  1. I'm not sure if containment would be easier to read and understand. However, the main reason I avoided containment was to avoid things being too deeply nested. If we use the infores id for the provenance keys, I do think that it would be fairly straightforward to understand the path taken if we use infores id, and it would prevent unnecessary repetition of information. Additionally, with the idea that this would include merging responses from ARA's, referencing in true graph form seems to me like it would be even more effective. This makes it very clear that the ARA's both received this edge from the same KP's, and the KP's received it from the same knowledge source.
  2. I agree that versioning is something we should consider adding. The fields I listed in the provenance aren't a totality of the fields that can or should be included, just the ones that were necessary to illustrate the proposed schema.
  3. The method field isn't strictly necessary, and can be removed. It's really just meant to make clear how the information was attained, but really the only field that is strictly necessary is "parent". I don't see any issue with removing it, if it's believed that it does more harm than good, or doesn't do any good at all.
  4. When we were explaining growth when presenting on merging, it wasn't to do with performance, but with message size. The size of the message would grow much larger than necessary because we'd have the same edge returned by different ARA's, that had queried the same KP's. Of course, different KP's would also return the same edge, even though they'd retrieved it from the same knowledge source.

OR
* "biolink: primary_knowledge_source"

All edges must have one (and only one) of these attributes listed, and to merge two edges, they must share the same value for the attribute.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "must have one (and only one) of these attributes listed" . . . there will be 0, 1, or many qualifiers on a given Edge . . . so the one and only one rule cannot apply here.

@vdancik
Copy link
Collaborator

vdancik commented May 10, 2022

There is no need to polute TRAPI schema since the retrieval chain can easily and cleanly be expressed using attributes:

KP1

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_1",
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
      ],
    }
  }
}

KP2

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_2"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
      ],
    }
  }
}

KP3

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_3",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_3"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          }
      ],
    }
  }
}

ARA1

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_1",
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_2"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_1"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_1",
                     "value_type_id": "biolink:InformationResource",
                 },
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_2",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
      ],
    }
  }
}

ARA2

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_1",
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_3",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_3"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_2"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_1",
                     "value_type_id": "biolink:InformationResource",
                 },
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_3",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
      ],
    }
  }
}

WR

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_1",
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_2"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_1"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_1",
                     "value_type_id": "biolink:InformationResource",
                 },
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_2",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_3",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_3"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:DB",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_2"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_1",
                     "value_type_id": "biolink:InformationResource",
                 },
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:KP_3",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:WR",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:WR"
              "attributes": [
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:ARA_1",
                     "value_type_id": "biolink:InformationResource",
                 },
                 {
                     "attribute_type_id": "previous_knowledge_source",
                     "value": "infores:ARA_2",
                     "value_type_id": "biolink:InformationResource",
                 }
              ]
          },
      ],
    }
  }
}

We can add previous_knowledge_source for those who desire the encode the retrieval chain. But I believe such complication is not necessary and current way of expressing the knowledge sources works just fine, e.g. WR it would be:

{
  "edges": {
    "e719492": {
      "subject": "RXCUI:1544384",
      "predicate": "biolink:correlated_with",
      "object": "MONDO:0008383",
      "attributes": [
          {
              "attribute_type_id": "original_knowledge_source",
              "value": "infores:DB",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:DB"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_1",
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_2"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_1",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_1"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:KP_3",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:KP_3"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:ARA_2",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:ARA_2"
          },
          {
              "attribute_type_id": "aggregator_knowledge_source",
              "value": "infores:WR",
              "value_type_id": "biolink:InformationResource",
              "attribute_source": "infores:WR"
          },
      ],
    }
  }
}

@mbrush
Copy link

mbrush commented May 10, 2022

Thanks Vlado - this is exactly how I envision an attribute-based representation using the current source retrieval properties looking. Level 1 attributes can provide a list of all Resources through which the knowledge expressed in the Edge passed at some point - and indicate which was primary/original. If more info about the order/graph of retrieval is needed, a second level could be introduced using nested attributes. These additional levels could hold more details about when, by whom, how retrievals were performed, if this proves useful.

Like you, I would like to better understand why the current, simple flat list of Attributes holding Resources and their roles as primary/original vs aggregator is insufficient. Folks have pointed out that this representation is 'lossy' - but that is only a problem if the information lost is needed for a compelling use case. Do we have use cases requiring a full/ordered retrieval graph? As I understand, this complexity is not required to support Edge Merging - where we only need to know which Resource was primary/original.

Regardless of what level of detail we end up providing, I think there are a couple arguments for defining a dedicated structure to hold Source Provenance info that doesn't use Attributes. Namely, this metadata is of high import and capturing it in a dedicated structure highlights this critical role, and makes it easier to find/operate on. And more generally does there come a time that specific Edge metadata rises to a level of import that adding dedicated fields/objects to the TRAPI spec is warranted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants