chunk_id and document_id not accessible #50

undo76 · 2024-11-23T14:28:30Z

The problem

The current implementation of rag and async_rag don't return the chunk_id nor the document_id. This prevents creating proper citation sources in the response.

Solution

_contexts and retrieve_segments should return the (original) chunk_ids used for composing the segments and the document_ids instead of a list of strings.

A possible solution would be to return tuples(document_id, segment_str):

    # Convert the segments into tuples of (document_id, segment_text)
    segments_with_ids = [
        (
            segment[0].document_id,  # Get document_id from first chunk in segment
            segment[0].headings.strip() + "\n\n" + "".join(chunk.body for chunk in segment).strip()
        )
        for segment in segments
    ]

Maybe a better solution would be to create a proper type for Segment similar to Chunk.

Some considerations

We don't want to give as sources all the available segments, just the ones that the model decided to use. Also,
we can't just use the list of original chunk_ids and document_id and zip them with the segments because the retrieve_segments method merges continuous chunks, resulting in a many to one mapping between chunks and segments that we cannot reverse. In addition, providing the model with the document_id/chunk_id directly will potentially simplify the formatting of sources and allow other use cases (function calling using these ids).

The text was updated successfully, but these errors were encountered:

lsorber · 2024-11-25T08:56:16Z

Hi @undo76, thanks for submitting this issue!

It's actually already possible to get access to the RAG sources as follows:

from raglite import RAGLiteConfig, hybrid_search, rerank_chunks, rag

# Configure RAGLite (here, we use the default config):
my_config = RAGLiteConfig()

# Search for chunks:
prompt = "What does it mean for two events to be simultaneous?"
chunk_ids_hybrid, _ = hybrid_search(prompt, num_results=20, config=my_config)

# Rerank chunks:
chunks_reranked = rerank_chunks(prompt, chunks_hybrid, config=my_config)

# Pass the retrieved chunks as context for RAG:
stream = rag(prompt, search=chunks_reranked, config=my_config)

The chunks_reranked list contains a lot of information on the sources, but if you need more information about the underlying document you could do this:

# Access the RAG sources:
from raglite._database import create_database_engine
from sqlmodel import Session

with Session(create_database_engine()) as session:
    reranked_chunks = [session.merge(chunk) for chunk in reranked_chunks]  # Reattach the chunks to a Session.
    documents = [chunk.document for chunk in reranked_chunks]

That said, this API certainly isn't perfect yet. What do you think about the following improvements?

We expose the _max_contexts method to compute the maximum number of Chunks that will fit in the LLM context, given the user prompt, system prompt, and message history.
The developer retrieves and reranks Chunks according to the example above.
The developer transforms the Chunks to segments with retrieve_segments (which expands Chunks with their neighbours and concatenates them into contiguous segments).
We modify rag and async_rag to accept segments.
We modify the rag and async_rag prompt to be able to reference segments by number (e.g., "According to [3], ...").

lsorber · 2024-12-03T18:34:31Z

The v0.3.0 release resulting from #52 fixes this. The README now documents the improved RAG pipeline API and how to access the source documents:

from raglite import create_rag_instruction, rag, retrieve_rag_context

# Retrieve relevant chunk spans with hybrid search and reranking:
user_prompt = "How is intelligence measured?"
chunk_spans = retrieve_rag_context(query=user_prompt, num_chunks=5, config=my_config)

# Append a RAG instruction based on the user prompt and context to the message history:
messages = []  # Or start with an existing message history.
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))

# Stream the RAG response:
stream = rag(messages, config=my_config)
for update in stream:
    print(update, end="")

# Access the documents cited in the RAG response:
chunks = [chunk for chunk_span in chunk_spans for chunk in chunk_span.chunks]
documents = [chunk_span.document for chunk_span in chunk_spans]

undo76 mentioned this issue Nov 25, 2024

feat: support prompt caching and apply Anthropic's long-context prompt format #52

Merged

lsorber closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk_id and document_id not accessible #50

chunk_id and document_id not accessible #50

undo76 commented Nov 23, 2024 •

edited

Loading

lsorber commented Nov 25, 2024

lsorber commented Dec 3, 2024

chunk_id and document_id not accessible #50

chunk_id and document_id not accessible #50

Comments

undo76 commented Nov 23, 2024 • edited Loading

The problem

Solution

Some considerations

lsorber commented Nov 25, 2024

lsorber commented Dec 3, 2024

undo76 commented Nov 23, 2024 •

edited

Loading