Collections #719

baiirun · 2024-06-03T20:24:46Z

baiirun
Jun 3, 2024
Maintainer

Collections are a new data model for organizing lists of entities in the knowledge graph.

Motivation

There are many features that require grouping entities together into a list, both ordered and unordered. You might want a list of rich content blocks to render on a page, a list of your favorite places, a list of images, a list of navigation items for your app, or a list of anything else. Users may want to curate lists or reference other users' lists. Collections provide a powerful new primitive for listing things, but also building other systems on top of these lists.

Constraints

Entities grouped into a list should not be affected by the list itself. i.e., there should be a decoupling from the entity and any list that contains the entity. This is so triples describing the list don't pollute the entity
Ordering should be a first-class property of the list
Lists should be searchable and filterable

Design

A Collection in the knowledge graph is built from three components

Collection entity: the entity representing the list. It primarily acts as an id in which collection items reference
Collection item: an in-between entity that connects the Collection entity and the entity that is added to the collection. Any metadata relevant to the list but not the entity is added here, such as ordering.
Entity: This is any arbitrary entity that is added to the collection. Could be an image, a person, a place, an idea, or anything, as long as it's an entity. Theoretically you could make a collection of collections since a collection is itself an entity.

A list of entities in a collection effectively works like a Junction Table or Associative Entity in a relational database. You can think of each Collection Item as a row in the junction table where the collection item references both the collection the item belongs to, and the entity the item is meant to represent in the list.

Collection items are useful in that we can apply any usage-specific metadata about an entity in the list to the collection item without polluting the entity itself. An example of this is the order of the entity in the list. If an entity is an many lists at once, you don't want a triple representing the order of the entity for each list stored on the entity itself. It would make more sense to be stored on the associative entity since it's context/usage-aware.

Referencing/consuming a collection

Collections are entities, so anywhere that you can reference an entity can also reference a collection. There might be situations where we need to know that the entity we're referencing is a collection – and not some other type of entity – which is where the COLLECTION Native Value Type and Universal Renderable Entity Types come in to provide hints as to what the entity is.

For v1.0.0 collections will primarily be referenced by triples using the COLLECTION NVT. In the near-term Collections will also be able to be used as a data source for a Data Block. Collections are simply a way to create a list of arbitrary entities, so any data model that consumes a list of entities can theoretically consume a collection for any use case.

Indexing a collection

In the Geo data service we store all collections and collection items into a special system table. When the data service encounters a triple with the Collection or Collection Item type its entity id gets inserted into the appropriate table. Conversely when these types are deleted, the row is deleted from the appropriate tables.

Collection items are specially handled as there are several triples that are required for a collection item to be "valid." We need the triple pointing to the collection, the triple pointing to the entity, and the triple defining the order of the collection. Each value of these triples are validated and stored into a column representing each piece of data on the collection item row.

Storing collections and collection items this way means that we can aggregate the triples data for each into a representation that's easier to query and render. Rather than having to parse each triple on the collection item all of the values are automatically exposed as part of the query.

This approach does add some complexity at index-time as we need to listen to every triple processed in the substream to see if it's part of a collection or collection item. If it's a collection item we have to check if all of the other triples that require exist in the set of triples and transform those into the shape that the database expects for a collection item.

Another side-effect is that we also need to represent collections somehow for in-flight proposals. A proposal might introduce a new collection. We don't want to add this to the collection table until after the proposal is approved. Same thing for collection items. Additionally a collection might already exist but a proposal is adding a new item to it. For diffs we probably need to merge the proposed collection item with the existing collection items so we can show the entire context of the collection.

Querying, searching, and filtering collections

Since collections and collection items are indexed into their own system tables we can do complex queries on them.

We might want to query for a collection directly and return the data for each collection item, including the entire entity that it references.

We might want to query for any collections that contain a specific entity. Or filter a specific collection for the entities whose name starts with a specific string.

The value for a COLLECTION Native Value Type is a foreign key to the Collection table, so any triples with the COLLECTION NVT automatically have the reference to the collection at query time.

Ordering

A collection can have ordering applied to it. By default collections are unordered. We apply the order to the item anyway so if we change to an ordered collection later we don't have to update every collection item in the collection.

Ordering is calculated by using a technique called "Fractional Indexing." The main idea is that you can apply ordering to a list of items by applying a value called an "index" to each item in the list. To change the ordering of an item in the list, you re-calculate the index of that item by taking the mid-point of the item before its new position, and the item after its new position.

It's a fairly complex topic on its own, so here's some links to other people talking about fractional indexing in depth.

Currently we use a fractional indexing library developed by the folks at Replicache, who build real-time sync systems. This library uses alphanumeric values to handle indexing, which seem to compress the precision of the mid-point calculation better than floating point values do. You can read more about the precision constraint in the above blog posts.

There's some advanced techniques with fractional indexing we may want to implement, especially for real-time collaborative features in the future. For now changes to the knowledge graph aren't real-time and are likely slow enough to not have to implement things like jitter/randomness to the index.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collections #719

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Collections #719

baiirun Jun 3, 2024 Maintainer

Motivation

Constraints

Design

Referencing/consuming a collection

Indexing a collection

Querying, searching, and filtering collections

Ordering

Replies: 0 comments

baiirun
Jun 3, 2024
Maintainer