Python API for embeddings #191

simonw · 2023-08-28T05:28:10Z

Split from:

Initial abstraction for running embeddings #185

simonw · 2023-09-01T15:57:15Z

I think there are two parts to this: embedding a string, and managing collections.

For embedding strings the existing get_embedding_model(...) API is most of the way there:

model = llm.get_embedding_model("ada-002")
floats = model.embed("text goes here")

I think the decode and encode functions for turning them into binary could go in the llm namespace directly.

Collections are a bit harder. A collection should be able to store embeddings and run similarity, see #190 - eventually also manage indexes.

simonw · 2023-09-01T16:03:33Z

Sketching an initial idea:

collection = llm.Collection(db, "name-of-collection")
# If the collection does not exist it would be created with the default embedding model

if collection.exists():
    # Already exists in the DB
    print("Contains {} items".format(collection.count())

# Or specify the model specifically:
model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "posts", model)

# Or pass the model ID using a named parameter:
collection = llm.Collection(db, "posts", model_id="ada-002")

Once you've got the collection:

collection.embed("id", "text to embed goes here")
# Add store=True to store the text in the content column

# With metadata:
collection.embed("id", "text to embed goes here", {"metadata": "here"})

# Or for multiple things at once:
collection.embed_multi({
    "id1": "text for id1",
    "id2": "text for id2"
})
# Add store=True to store the text in the content column

But what if you want to store metadata as well? Not 100% sure about that, maybe:

collection.embed_multi({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": "text for id2"
})

Not crazy about an API design where it accepts a dictionary with either strings or tuples as keys though.

Maybe this:

collection.embed_multi_with_metadata({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": ("text for id2", {"more": "metadata"}),
})

simonw · 2023-09-01T16:07:41Z

And for retrieval:

ids_and_scores = collection.similar_by_id("id", number=5)

Or:

ids_and_scores = collection.similar("text to be embedded", number=5)

simonw · 2023-09-01T16:08:15Z

For embedding models that take options (not a thing yet) I think I'll add options=dict parameters to some of these methods, as opposed to using **kwargs which could clash with other keyword arguments like store=True.

simonw · 2023-09-01T20:04:54Z

Need to implement the similar methods next:

llm/llm/embeddings.py

Lines 138 to 162 in 6f76170

    
               def similar_by_id(self, id: str, number: int = 5) -> List[Tuple[str, float]]: 
        
                   """ 
        
                   Find similar items in the collection by a given ID. 
        
                   Args: 
        
                       id (str): ID to search by 
        
                       number (int, optional): Number of similar items to return 
        
                   Returns: 
        
                       list: List of (id, score) tuples 
        
                   """ 
        
                   raise NotImplementedError 
        
               def similar(self, text: str, number: int = 5) -> List[Tuple[str, float]]: 
        
                   """ 
        
                   Find similar items in the collection by a given text. 
        
                   Args: 
        
                       text (str): Text to search by 
        
                       number (int, optional): Number of similar items to return 
        
                   Returns: 
        
                       list: List of (id, score) tuples 
        
                   """ 
        
                   raise NotImplementedError

simonw · 2023-09-01T20:06:12Z

mypy errors:

llm/embeddings.py:46: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/embeddings.py:50: error: Item "None" of "EmbeddingModel | None" has no attribute "model_id"  [union-attr]
llm/embeddings.py:55: error: Incompatible return value type (got "Any | None", expected "int")  [return-value]
llm/embeddings.py:105: error: Item "None" of "EmbeddingModel | None" has no attribute "embed"  [union-attr]
llm/embeddings.py:106: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/default_plugins/openai_models.py:71: error: Return type "list[list[float]]" of "embed_batch" incompatible with return type "Iterator[list[float]]" in supertype "EmbeddingModel"  [override]
llm/default_plugins/openai_models.py:71: error: Argument 1 of "embed_batch" is incompatible with supertype "EmbeddingModel"; supertype defines the argument type as "Iterable[str]"  [override]
llm/default_plugins/openai_models.py:71: note: This violates the Liskov substitution principle
llm/default_plugins/openai_models.py:71: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides

Refs #191 (comment)

simonw · 2023-09-01T20:18:11Z

Also need to refactor the embed CLI command to use llm.Collection.

Refactor llm embed command to use the new Python API #204

simonw · 2023-09-02T00:26:24Z

I haven't implemented these methods yet:

llm/llm/embeddings.py

Lines 128 to 148 in 212cd61

    
               def embed_multi(self, id_text_map: Dict[str, str], store: bool = False) -> None: 
        
                   """ 
        
                   Embed multiple texts and store them in the collection with given IDs. 
        
                   Args: 
        
                       id_text_map (dict): Dictionary mapping IDs to texts 
        
                       store (bool, optional): Whether to store the text in the content column 
        
                   """ 
        
                   raise NotImplementedError 
        
               def embed_multi_with_metadata( 
        
                   self, 
        
                   id_text_metadata_map: Dict[str, Tuple[str, Dict[str, Union[str, int, float]]]], 
        
               ) -> None: 
        
                   """ 
        
                   Embed multiple texts along with metadata and store them in the collection with given IDs. 
        
                   Args: 
        
                       id_text_metadata_map (dict): Dictionary mapping IDs to (text, metadata) tuples 
        
                   """ 
        
                   raise NotImplementedError

simonw · 2023-09-02T00:27:12Z

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

simonw · 2023-09-02T02:02:09Z

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

These were both addressed in:

Implement, test and document store=True and metadata=... mechanisms #203

simonw · 2023-09-02T03:18:42Z

OK, this is ready now: https://llm.datasette.io/en/latest/embeddings/python-api.html

Refs #185, #190, #191

simonw added enhancement New feature or request python-api embeddings labels Aug 28, 2023

simonw mentioned this issue Sep 1, 2023

llm similar command for searching against embeddings #190

Closed

simonw added a commit that referenced this issue Sep 1, 2023

Initial Collection class plus test, refs #191

6f76170

simonw added a commit that referenced this issue Sep 1, 2023

Fix mypy errors

7a4429f

Refs #191 (comment)

simonw added a commit that referenced this issue Sep 1, 2023

Collection.similar methods, refs #191

0ec5165

simonw added a commit that referenced this issue Sep 1, 2023

Fixed tests for similar, refs #191

ae1e7da

simonw added a commit that referenced this issue Sep 2, 2023

Initial Python embeddings API docs, refs #191

212cd61

This was referenced Sep 2, 2023

Implement collection.embed_multi and collection.embed_multi_with_metadata #202

Closed

Implement, test and document store=True and metadata=... mechanisms #203

Closed

simonw added this to the 0.9 - embeddings milestone Sep 2, 2023

simonw closed this as completed Sep 2, 2023

simonw added a commit that referenced this issue Sep 2, 2023

Release 0.9a0

1cd4596

Refs #185, #190, #191

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API for embeddings #191

Python API for embeddings #191

simonw commented Aug 28, 2023

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

Python API for embeddings #191

Python API for embeddings #191

Comments

simonw commented Aug 28, 2023

simonw commented Sep 1, 2023 • edited Loading

simonw commented Sep 1, 2023 • edited Loading

simonw commented Sep 1, 2023

simonw commented Sep 1, 2023 • edited Loading

simonw commented Sep 1, 2023 • edited Loading

simonw commented Sep 1, 2023

simonw commented Sep 1, 2023 • edited Loading

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

simonw commented Sep 2, 2023

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading

simonw commented Sep 1, 2023 •

edited

Loading