The whyhow_rbr
package helps create customized RAG pipelines. It is built on top
of the following technologies (and their respective Python SDKs)
- OpenAI - text generation
- Milvus - vector database
Please import some essential package
from pymilvus import DataType
from src.whyhow_rbr.rag_milvus import ClientMilvus
The central object is a ClientMilvus
. It manages all necessary resources
and provides a simple interface for all the RAG related tasks.
First of all, to instantiate it one needs to provide the following credentials:
OPENAI_API_KEY
Milvus_URI
Milvus_API_TOKEN
Initialize the ClientMilvus like this:
# Set up your Milvus Cloud information
YOUR_MILVUS_CLOUD_END_POINT="YOUR_MILVUS_CLOUD_END_POINT"
YOUR_MILVUS_CLOUD_TOKEN="YOUR_MILVUS_CLOUD_TOKEN"
# Initialize the ClientMilvus
milvus_client = ClientMilvus(
milvus_uri=YOUR_MILVUS_CLOUD_END_POINT,
milvus_token=YOUR_MILVUS_CLOUD_TOKEN
)
This tutorial whyhow_rbr
uses Milvus for everything related to vector databses.
# Define collection name
COLLECTION_NAME="YOUR_COLLECTION_NAME" # take your own collection name
# Define vector dimension size
DIMENSION=1536 # decide by the model you use
Before inserting any data into Milvus database, we need to first define the data field, which is called schema in here. Through create object CollectionSchema
and add data field through addd_field()
, we can control our data type and their characteristics. This step is required.
schema = milvus_client.create_schema(auto_id=True) # Enable id matching
schema = milvus_client.add_field(schema=schema, field_name="id", datatype=DataType.INT64, is_primary=True)
schema = milvus_client.add_field(schema=schema, field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)
We only defined id
and embedding
here because we need to define a primary field for each collection. For embedding, we need to define the dimension. We allow enable_dynamic_field
which support auto adding schema, but we still encourage you to add schema by yourself. This method is a thin wrapper around the official Milvus implementation (official docs)
For each schema, it is better to have an index so that the querying will be much more efficient. To create an index, we first need an index_params and later add more index data on this IndexParams
object.
# Start to indexing data field
index_params = milvus_client.prepare_index_params()
index_params = milvus_client.add_index(
index_params=index_params, # pass in index_params object
field_name="embedding",
index_type="AUTOINDEX", # use autoindex instead of other complex indexing method
metric_type="COSINE", # L2, COSINE, or IP
)
This method is a thin wrapper around the official Milvus implementation (official docs).
After defining all the data field and indexing them, we now need to create our database collection so that we can access our data quick and precise. What's need to be mentioned is that we initialized the enable_dynamic_field
to be true so that you can upload any data freely. The cost is the data querying might be inefficient.
# Create Collection
milvus_client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
index_params=index_params
)
After creating a collection, we are ready to populate it with documents. In
whyhow_rbr
this is done using the upload_documents
method of the MilvusClient
.
It performs the following steps under the hood:
- Preprocessing: Reading and splitting the provided PDF files into chunks
- Embedding: Embedding all the chunks using an OpenAI model
- Inserting: Uploading both the embeddings and the metadata to a Milvus collection
See below an example of how to use it.
# get pdfs
pdfs = ["harry-potter.pdf", "game-of-thrones.pdf"] # replace to your pdfs path
# Uploading the PDF document
milvus_client.upload_documents(
collection_name=COLLECTION_NAME,
documents=pdfs
)
Now we can finally move to retrieval augmented generation.
In whyhow_rbr
with Milvus, it can be done via the search
method.
- Simple example:
# Search data and implement RAG!
res = milvus_client.search(
question='What food does Harry Potter like to eat?',
collection_name=COLLECTION_NAME,
anns_field='embedding',
output_fields='text'
)
print(res['answer'])
print(res['matches'])
The result
is a dictionary that has the following keys
answer
- the the answer to the questionmatches
- thelimit
most relevant documents from the index
Note that the number of matches will be in general equal to limit
which
can be specified as a parameter.
At last, after implemented all the instructuons, you can clean up the database
by calling drop_collection()
.
# Clean up
milvus_client.drop_collection(
collection_name=COLLECTION_NAME
)
In the previous example, every single document in our index was considered.
However, sometimes it might be beneficial to only retrieve documents satisfying some
predefined conditions (e.g. filename=harry-potter.pdf
). In whyhow_rbr
through Milvus, this
can be done via adjusting searching parameters.
A rule can control the following metadata attributes
filename
- name of the filepage_numbers
- list of integers corresponding to page numbers (0 indexing)id
- unique identifier of a chunk (this is the most "extreme" filter)- Other rules base on Boolean Expressions
Rules Example:
# RULES(search on book harry-potter on page 8):
PARTITION_NAME='harry-potter' # search on books
page_number='page_number == 8'
# first create a partitions to store the book and later search on this specific partition:
milvus_client.crate_partition(
collection_name=COLLECTION_NAME,
partition_name=PARTITION_NAME # separate base on your pdfs type
)
# search with rules
res = milvus_client.search(
question='Tell me about the greedy method',
collection_name=COLLECTION_NAME,
partition_names=PARTITION_NAME,
filter=page_number, # append any rules follow the Boolean Expression Rule
anns_field='embedding',
output_fields='text'
)
print(res['answer'])
print(res['matches'])
In this example, we first create a partition that store harry-potter related pdfs, and through searching within this partition, we can get the most direct information. Also, we apply page number as a filter to specify the exact page we wish to search on. Remember, the filer parameter need to follow the boolean rule.
That's all for the Milvus implementation of Rule-based Retrieval.