-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python API for embeddings #191
Comments
I think there are two parts to this: embedding a string, and managing collections. For embedding strings the existing model = llm.get_embedding_model("ada-002")
floats = model.embed("text goes here") I think the Collections are a bit harder. A collection should be able to store embeddings and run similarity, see #190 - eventually also manage indexes. |
Sketching an initial idea: collection = llm.Collection(db, "name-of-collection")
# If the collection does not exist it would be created with the default embedding model
if collection.exists():
# Already exists in the DB
print("Contains {} items".format(collection.count())
# Or specify the model specifically:
model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "posts", model)
# Or pass the model ID using a named parameter:
collection = llm.Collection(db, "posts", model_id="ada-002") Once you've got the collection: collection.embed("id", "text to embed goes here")
# Add store=True to store the text in the content column
# With metadata:
collection.embed("id", "text to embed goes here", {"metadata": "here"})
# Or for multiple things at once:
collection.embed_multi({
"id1": "text for id1",
"id2": "text for id2"
})
# Add store=True to store the text in the content column But what if you want to store metadata as well? Not 100% sure about that, maybe: collection.embed_multi({
"id1": ("text for id1", {"metadata": "goes here"}),
"id2": "text for id2"
}) Not crazy about an API design where it accepts a dictionary with either strings or tuples as keys though. Maybe this: collection.embed_multi_with_metadata({
"id1": ("text for id1", {"metadata": "goes here"}),
"id2": ("text for id2", {"more": "metadata"}),
}) |
And for retrieval: ids_and_scores = collection.similar_by_id("id", number=5) Or: ids_and_scores = collection.similar("text to be embedded", number=5) |
For embedding models that take options (not a thing yet) I think I'll add |
Need to implement the Lines 138 to 162 in 6f76170
|
|
Also need to refactor the |
I haven't implemented these methods yet: Lines 128 to 148 in 212cd61
|
I also haven't tested and documented the Plus there's no way to get BACK the metadata/stored content yet. |
These were both addressed in: |
OK, this is ready now: https://llm.datasette.io/en/latest/embeddings/python-api.html |
Split from:
The text was updated successfully, but these errors were encountered: