A Streamlit RAG Chatbot for querying PDF documents using a structure-aware chunking approach.
The PDF Indexer uses the LLMSherpa API internally for parsing the PDF document. The main sections thus obtained are split recursively using a chunk size of 2048 characters, into subsections that fit within this limit. The hierarchical structure of the document is maintained by returning entire sections rather than arbitrary slices of text. The resulting text chunks are used for building a LlamaIndex query engine, on top of an in-memory VectorStoreIndex.
The Streamlit Chatbot allows users to:
- Input an OpenAI API key
- Upload a PDF document
- Ask questions about the document
-
Install the nlm-ingestor server:
Follow the instructions at https://github.com/nlmatics/nlm-ingestor -
The local
llmsherpa_url
will be:http://localhost:5001/api/parseDocument?renderFormat=all
- Clone the repository
git clone https://github.com/IoanaDragan/rag-insight
- Install dependencies
pip install -r requirements.txt
-
(Optional) Set up environment variables: Create a
.env
file and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here -
Launch the Streamlit server
cd rag-insight
streamlit run rag-chatbot.py
- Access the app at
http://localhost:8501
in your browser
The index_pdf
method of the PDFIndexer accepts the following parameters:
Parameter | Description | Default |
---|---|---|
chunk_size |
Size of document chunks | 2048 |
first_n_chunks |
Number of chunks to index (for testing) | None (all) |
add_summary |
Add chunk summaries as metadata | False |
retrieve_top_k |
Number of similar documents to retrieve per query | 2 |
similarity_threshold |
Minimum similarity score for retrieved documents | 0.8 |
Distributed under the MIT License. See LICENSE for more information.