Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Addition of KDB.AI vector database as data store. #386

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
f6a7995
Initial commit.
bu2 Jul 24, 2023
808f96b
KXI-28991 initial commit
Aug 25, 2023
3be119b
KXI-28991 newclientversion
Aug 29, 2023
67b205d
KXI-28991 working-upsert
Aug 30, 2023
bb4994a
Remove __pycache__ dirs from Git.
bu2 Aug 30, 2023
0ae6dd2
Clean up repo and tweak pyproject.toml.
bu2 Aug 30, 2023
30fead2
Clean up .pyc file and minor refactoring.
bu2 Aug 30, 2023
437213b
Get the ChatGPT Plugin working end to end with the last KDBAI Python …
bu2 Aug 31, 2023
fe40b9f
Add support for KDBAI API key.
bu2 Aug 31, 2023
6b9df03
KXI-28991 working cloud version
Sep 7, 2023
e718e18
KXI-28991 demo notebook
Sep 7, 2023
4613245
KXI-28991 delete
Sep 7, 2023
e65bf81
Merge remote-tracking branch 'second-repo/main' into mergeGbt
alexgiannak Sep 7, 2023
35a878a
kdbai as vectorstore
alexgiannak Sep 7, 2023
d8b1a4c
Fix delete functionality
alexgiannak Sep 7, 2023
c679c9f
update notebook
alexgiannak Sep 7, 2023
1b24214
Update kdbai-client python dependency.
bu2 Sep 8, 2023
aca79fa
Update notebook to try out the KDB.AI ChatGPT Retrieval Plugin.
bu2 Sep 8, 2023
e970c64
Merge branch 'kdbai' into 'KXI-28991'
bu2kx Sep 8, 2023
82a46f9
Fix getting start instructions for the KDB.aI ChatGPT Retrieval Plugin.
bu2 Sep 8, 2023
110ee90
KXI-28991 updated pyproject.toml to kdbai-client=^0.1.1
Sep 12, 2023
6464128
KDB.AI updated QA notebook
Sep 12, 2023
09973ae
notebook and diagram
Sep 12, 2023
2668224
improved diagram
Sep 12, 2023
eda7960
KDB.AI notebook with examples
Sep 13, 2023
9a6d413
Merge last main in KDB.AI branch.
bu2 Nov 9, 2023
005e516
Fix pyproject.toml and refresh poetry.lock.
bu2 Nov 9, 2023
5878832
Upgrade the sample KDB.AI notebook to the new OpenAI API.
bu2 Nov 9, 2023
a75801d
Amend README and add setup documentation for KDB.AI vector database.
bu2 Nov 9, 2023
94af5e1
merge main
alexgiannak Apr 9, 2024
83462f5
Update kdbai_datastore.py
alexgiannak Apr 9, 2024
c3e498a
Update kdb.AI example notebook
alexgiannak Apr 9, 2024
3a07213
Merge pull request #1 from alexgiannak/KDB.AI-CustomActions
bu2kx Apr 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,4 +138,4 @@ dmypy.json
.pyre/

# macOS .DS_Store files
.DS_Store
.DS_Store
1 change: 1 addition & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TBD
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,10 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
export MILVUS_USER=<your_milvus_username>
export MILVUS_PASSWORD=<your_milvus_password>

# KDB.AI
export KDBAI_ENDPOINT=<KDB.AI_endpoint>
export KDBAI_API_KEY=<KDB.AI_API_key>

# Qdrant
export QDRANT_URL=<your_qdrant_url>
export QDRANT_PORT=<your_qdrant_port>
Expand Down Expand Up @@ -388,6 +392,10 @@ For more detailed instructions on setting up and using each vector database prov

[Milvus](https://milvus.io/) is an open-source, cloud-native vector database that scales to billions of vectors. It is the open-source version of Zilliz and shares many of its features, such as various indexing algorithms, distance metrics, scalar filtering, time travel searches, rollback with snapshots, multi-language SDKs, storage and compute separation, and cloud scalability. For detailed setup instructions, refer to [`/docs/providers/milvus/setup.md`](/docs/providers/milvus/setup.md).

#### KDB.AI

[KDB.AI](https://kdb.ai) is a powerful knowledge-based vector database and search engine that allows developers to build scalable, reliable and real-time applications by providing advanced search, recommendation and personalization for AI applications, using real-time data. For detailed setup instructions, refer to [`/docs/providers/kdbai/setup.md`](/docs/providers/kdbai/setup.md).

#### Qdrant

[Qdrant](https://qdrant.tech/) is a vector database capable of storing documents and vector embeddings. It offers both self-hosted and managed [Qdrant Cloud](https://cloud.qdrant.io/) deployment options, providing flexibility for users with different requirements. For detailed setup instructions, refer to [`/docs/providers/qdrant/setup.md`](/docs/providers/qdrant/setup.md).
Expand Down
10 changes: 7 additions & 3 deletions datastore/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,15 @@ async def get_datastore() -> DataStore:
case "elasticsearch":
from datastore.providers.elasticsearch_datastore import (
ElasticsearchDataStore,
)
)
return ElasticsearchDataStore()
case "kdbai":
from datastore.providers.kdbai_datastore import KDBAIDataStore

return KDBAIDataStore()

return ElasticsearchDataStore()
case _:
raise ValueError(
f"Unsupported vector database: {datastore}. "
f"Try one of the following: llama, elasticsearch, pinecone, weaviate, milvus, zilliz, redis, azuresearch, or qdrant"
f"Try one of the following: llama, elasticsearch, pinecone, weaviate, milvus, zilliz, redis, azuresearch, kdbai or qdrant"
)
171 changes: 171 additions & 0 deletions datastore/providers/kdbai_datastore.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
import os
from typing import Dict, List, Optional

from loguru import logger
import pandas as pd

from services.date import to_unix_timestamp
from datastore.datastore import DataStore

from models.models import (
DocumentChunk,
DocumentChunkWithScore,
DocumentMetadataFilter,
QueryResult,
QueryWithEmbedding,
)

try:
import pykx as kx
logger.info('PyKX version: ' + kx.__version__)

except ImportError:
raise ValueError(
'Could not import pykx package.'
'Please add it to the dependencies.'
)

try:
import kdbai_client as kdbai
logger.info('KDBAI client version: ' + kdbai.__version__)

except ImportError:
raise ValueError(
'Could not import kdbai_client package.'
'Please add it to the dependencies.'
)


KDBAI_ENDPOINT = os.environ.get('KDBAI_ENDPOINT', 'http://localhost:8082')
KDBAI_API_KEY = os.environ.get('KDBAI_API_KEY', '')

if KDBAI_API_KEY == '':
KDBAI_API_KEY = None

DEFAULT_DIMS = 3072
BATCH_SIZE = 100

DEFAULT_SCHEMA = dict(
columns=[
dict(name='document_id', pytype='str'),
dict(name='text', pytype='bytes'),
dict(name='vecs', vectorIndex=dict(type='flat', metric='L2', dims=DEFAULT_DIMS)),
])

SCHEMA = os.environ.get('KDBAI_SCHEMA', DEFAULT_SCHEMA)
TABLE = os.environ.get('KDBAI_TABLE', 'documents')


class KDBAIDataStore(DataStore):

def __init__(self) -> None:
try:
logger.info('Creating KDBAI datastore...')
self._session = kdbai.Session(endpoint=KDBAI_ENDPOINT, api_key=KDBAI_API_KEY)

if TABLE in self._session.list():
self._table = self._session.table(TABLE)
else:
self._table = self._session.create_table(TABLE, SCHEMA)

except Exception as e:
logger.error(f'Error while initializing KDBAI datastore: {e}.')
raise e


async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
"""Upsert chunks into the datastore.

Args:
chunks (Dict[str, List[DocumentChunk]]): A list of DocumentChunks to insert

Raises:
e: Error in upserting data.

Returns:
List[str]: The document_id's that were inserted.
"""
try:
docs = []
for doc_id, chunk_list in chunks.items():
for chunk in chunk_list:
docs.append(dict(
document_id=doc_id,
text=chunk.text,
vecs=chunk.embedding,
))
df = pd.DataFrame(docs)

for i in range((len(df)-1)//BATCH_SIZE+1):
batch = df.iloc[i*BATCH_SIZE:(i+1)*BATCH_SIZE]
try:
self._table.insert(batch, warn=False)
except Exception as e:
logger.exception('Failed to insert the batch of documents into the data store.')

return list(df['document_id'])

except Exception as e:
logger.exception(f'Failed to insert documents into the data store.')
return []


async def _query(
self,
queries: List[QueryWithEmbedding],
) -> List[QueryResult]:
"""Query

Search the embedding and its filter in the collection.

Args:
queries (List[QueryWithEmbedding]): The list of searches to perform.

Returns:
List[QueryResult]: Results for each search.
"""
out = []
for query in queries:
try:
resdf = self._table.search(vectors=[query.embedding], n=query.top_k)[0]
except Exception as e:
logger.exception(f"Error while processing queries.")
raise e

docs = []
for result in resdf.to_dict(orient='records'):
docs.append(DocumentChunkWithScore(
id=result['document_id'],
text=result['text'],
metadata=dict(),
score=result['__nn_distance'],
))
out.append(QueryResult(query=query.query, results=docs))

return out


async def delete(
self,
ids: Optional[List[str]] = None,
filter: Optional[DocumentMetadataFilter] = None,
delete_all: Optional[bool] = None,
) -> bool:

"""
Delete all vectors and assosiated index.
"""
# Delete all vectors and assosiated index

try:
if delete_all:
self._table.drop()
logger.info(f"Deleted all vectors and index successfully")
return True
else:
logger.error("Functionality is not implemented yet")

except Exception as e:
logger.error("Failed to delete records, error: {}".format(e))
return []

16 changes: 16 additions & 0 deletions docs/providers/kdbai/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# KDB.AI

[KDB.AI](https://kdb.ai) is a powerful knowledge-based vector database and search engine that allows developers to build scalable, reliable and real-time applications by providing advanced search, recommendation and personalization for AI applications, using real-time data. You can register for Free Trial on https://kdb.ai.

You can find a sample notebook to use the ChatGPT Retrieval Plugin backed by KDB.AI vector database [here](https://github.com/KxSystems/chatgpt-retrieval-plugin/blob/KDB.AI/examples/providers/kdbai/ChatGPT_QA_Demo.ipynb) and instructions to get started [here](https://code.kx.com/kdbai/integrations/openai.html).

**Environment Variables:**

| Name | Required | Description | Default |
| ------------------- | -------- | ----------------------------------------------------------- | ------------------ |
| `DATASTORE` | Yes | Datastore name, set to `kdbai` | |
| `BEARER_TOKEN` | Yes | Secret token | |
| `OPENAI_API_KEY` | Yes | OpenAI API key | |
| `KDBAI_ENDPOINT` | Yes | KDB.AI endpoint | |
| `KDBAI_API_KEY` | Yes | KDB.AI API key | |
| `KDBAI_TABLE` | Optional | TCP port for Qdrant GRPC communication | `documents` |
Loading