-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunk_id and document_id not accessible #50
Comments
Hi @undo76, thanks for submitting this issue! It's actually already possible to get access to the RAG sources as follows: from raglite import RAGLiteConfig, hybrid_search, rerank_chunks, rag
# Configure RAGLite (here, we use the default config):
my_config = RAGLiteConfig()
# Search for chunks:
prompt = "What does it mean for two events to be simultaneous?"
chunk_ids_hybrid, _ = hybrid_search(prompt, num_results=20, config=my_config)
# Rerank chunks:
chunks_reranked = rerank_chunks(prompt, chunks_hybrid, config=my_config)
# Pass the retrieved chunks as context for RAG:
stream = rag(prompt, search=chunks_reranked, config=my_config) The # Access the RAG sources:
from raglite._database import create_database_engine
from sqlmodel import Session
with Session(create_database_engine()) as session:
reranked_chunks = [session.merge(chunk) for chunk in reranked_chunks] # Reattach the chunks to a Session.
documents = [chunk.document for chunk in reranked_chunks] That said, this API certainly isn't perfect yet. What do you think about the following improvements?
|
The v0.3.0 release resulting from #52 fixes this. The README now documents the improved RAG pipeline API and how to access the source documents: from raglite import create_rag_instruction, rag, retrieve_rag_context
# Retrieve relevant chunk spans with hybrid search and reranking:
user_prompt = "How is intelligence measured?"
chunk_spans = retrieve_rag_context(query=user_prompt, num_chunks=5, config=my_config)
# Append a RAG instruction based on the user prompt and context to the message history:
messages = [] # Or start with an existing message history.
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))
# Stream the RAG response:
stream = rag(messages, config=my_config)
for update in stream:
print(update, end="")
# Access the documents cited in the RAG response:
chunks = [chunk for chunk_span in chunk_spans for chunk in chunk_span.chunks]
documents = [chunk_span.document for chunk_span in chunk_spans] |
The problem
The current implementation of
rag
andasync_rag
don't return thechunk_id
nor thedocument_id
. This prevents creating proper citation sources in the response.Solution
_contexts
andretrieve_segments
should return the (original)chunk_id
s used for composing the segments and the document_ids instead of a list of strings.A possible solution would be to return tuples(document_id, segment_str):
Maybe a better solution would be to create a proper type for
Segment
similar toChunk
.Some considerations
We don't want to give as sources all the available segments, just the ones that the model decided to use. Also,
we can't just use the list of original
chunk_id
s anddocument_id
and zip them with the segments because theretrieve_segments
method merges continuous chunks, resulting in a many to one mapping between chunks and segments that we cannot reverse. In addition, providing the model with the document_id/chunk_id directly will potentially simplify the formatting of sources and allow other use cases (function calling using these ids).The text was updated successfully, but these errors were encountered: