Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python API for embeddings #191

Closed
simonw opened this issue Aug 28, 2023 · 11 comments
Closed

Python API for embeddings #191

simonw opened this issue Aug 28, 2023 · 11 comments
Labels

Comments

@simonw
Copy link
Owner

simonw commented Aug 28, 2023

Split from:

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

I think there are two parts to this: embedding a string, and managing collections.

For embedding strings the existing get_embedding_model(...) API is most of the way there:

model = llm.get_embedding_model("ada-002")
floats = model.embed("text goes here")

I think the decode and encode functions for turning them into binary could go in the llm namespace directly.

Collections are a bit harder. A collection should be able to store embeddings and run similarity, see #190 - eventually also manage indexes.

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Sketching an initial idea:

collection = llm.Collection(db, "name-of-collection")
# If the collection does not exist it would be created with the default embedding model

if collection.exists():
    # Already exists in the DB
    print("Contains {} items".format(collection.count())

# Or specify the model specifically:
model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "posts", model)

# Or pass the model ID using a named parameter:
collection = llm.Collection(db, "posts", model_id="ada-002")

Once you've got the collection:

collection.embed("id", "text to embed goes here")
# Add store=True to store the text in the content column

# With metadata:
collection.embed("id", "text to embed goes here", {"metadata": "here"})

# Or for multiple things at once:
collection.embed_multi({
    "id1": "text for id1",
    "id2": "text for id2"
})
# Add store=True to store the text in the content column

But what if you want to store metadata as well? Not 100% sure about that, maybe:

collection.embed_multi({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": "text for id2"
})

Not crazy about an API design where it accepts a dictionary with either strings or tuples as keys though.

Maybe this:

collection.embed_multi_with_metadata({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": ("text for id2", {"more": "metadata"}),
})

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

And for retrieval:

ids_and_scores = collection.similar_by_id("id", number=5)

Or:

ids_and_scores = collection.similar("text to be embedded", number=5)

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

For embedding models that take options (not a thing yet) I think I'll add options=dict parameters to some of these methods, as opposed to using **kwargs which could clash with other keyword arguments like store=True.

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Need to implement the similar methods next:

llm/llm/embeddings.py

Lines 138 to 162 in 6f76170

def similar_by_id(self, id: str, number: int = 5) -> List[Tuple[str, float]]:
"""
Find similar items in the collection by a given ID.
Args:
id (str): ID to search by
number (int, optional): Number of similar items to return
Returns:
list: List of (id, score) tuples
"""
raise NotImplementedError
def similar(self, text: str, number: int = 5) -> List[Tuple[str, float]]:
"""
Find similar items in the collection by a given text.
Args:
text (str): Text to search by
number (int, optional): Number of similar items to return
Returns:
list: List of (id, score) tuples
"""
raise NotImplementedError

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

mypy errors:

llm/embeddings.py:46: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/embeddings.py:50: error: Item "None" of "EmbeddingModel | None" has no attribute "model_id"  [union-attr]
llm/embeddings.py:55: error: Incompatible return value type (got "Any | None", expected "int")  [return-value]
llm/embeddings.py:105: error: Item "None" of "EmbeddingModel | None" has no attribute "embed"  [union-attr]
llm/embeddings.py:106: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/default_plugins/openai_models.py:71: error: Return type "list[list[float]]" of "embed_batch" incompatible with return type "Iterator[list[float]]" in supertype "EmbeddingModel"  [override]
llm/default_plugins/openai_models.py:71: error: Argument 1 of "embed_batch" is incompatible with supertype "EmbeddingModel"; supertype defines the argument type as "Iterable[str]"  [override]
llm/default_plugins/openai_models.py:71: note: This violates the Liskov substitution principle
llm/default_plugins/openai_models.py:71: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides

simonw added a commit that referenced this issue Sep 1, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Also need to refactor the embed CLI command to use llm.Collection.

simonw added a commit that referenced this issue Sep 1, 2023
simonw added a commit that referenced this issue Sep 1, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

I haven't implemented these methods yet:

llm/llm/embeddings.py

Lines 128 to 148 in 212cd61

def embed_multi(self, id_text_map: Dict[str, str], store: bool = False) -> None:
"""
Embed multiple texts and store them in the collection with given IDs.
Args:
id_text_map (dict): Dictionary mapping IDs to texts
store (bool, optional): Whether to store the text in the content column
"""
raise NotImplementedError
def embed_multi_with_metadata(
self,
id_text_metadata_map: Dict[str, Tuple[str, Dict[str, Union[str, int, float]]]],
) -> None:
"""
Embed multiple texts along with metadata and store them in the collection with given IDs.
Args:
id_text_metadata_map (dict): Dictionary mapping IDs to (text, metadata) tuples
"""
raise NotImplementedError

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

These were both addressed in:

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

OK, this is ready now: https://llm.datasette.io/en/latest/embeddings/python-api.html

@simonw simonw closed this as completed Sep 2, 2023
simonw added a commit that referenced this issue Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant