Skip to content

Commit

Permalink
feat: simplify document insertion (#6)
Browse files Browse the repository at this point in the history
  • Loading branch information
lsorber authored Aug 16, 2024
1 parent f9b92cf commit 3dd173a
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 10 deletions.
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ RAGLite is a Python package for Retrieval-Augmented Generation (RAG) with SQLite
2. 🔒 Fully local RAG with [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) as an LLM provider and [SQLite](https://github.com/sqlite/sqlite) as a local database
3. 🚀 Acceleration with Metal on macOS and with CUDA on Linux and Windows
4. 📖 PDF to Markdown conversion on top of [pdftext](https://github.com/VikParuchuri/pdftext) and [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
5. ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d)
5. ✂️ Optimal [level 4 semantic chunking](https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d) by solving a [binary integer programming problem](https://en.wikipedia.org/wiki/Integer_programming)
6. 📌 Markdown-based [contextual chunk headings](https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag)
7. 🌈 [Multi-vector chunk retrieval](https://python.langchain.com/v0.2/docs/how_to/multi_vector/)
8. 🌀 Optimal closed-form linear query adapter by solving an [(orthogonal) Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
7. 🌈 Sub-chunk matching with [multi-vector chunk retrieval](https://python.langchain.com/v0.2/docs/how_to/multi_vector/)
8. 🌀 Optimal [closed-form linear query adapter](src/raglite/_query_adapter.py) by solving an [orthogonal Procrustes problem](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem)
9. 🔍 [Hybrid search](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) that combines [SQLite's BM25 full-text search](https://sqlite.org/fts5.html) with [PyNNDescent's ANN vector search](https://github.com/lmcinnes/pynndescent)
10. ✍️ Optional support for automatic conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)
10. ✍️ Optional support for conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)

## Installing

Expand Down Expand Up @@ -49,11 +49,10 @@ my_config = RAGLiteConfig(db_url="sqlite:///raglite.sqlite")

# Index documents:
from pathlib import Path
from raglite import insert_document, update_vector_index
from raglite import insert_document

insert_document(Path("On the Measure of Intelligence.pdf"), config=my_config)
insert_document(Path("Situational Awareness.pdf"), config=my_config)
update_vector_index(config=my_config)
insert_document(Path("Special Relativity.pdf"), config=my_config)

# Search for chunks:
from raglite import hybrid_search, keyword_search, vector_search
Expand All @@ -66,7 +65,7 @@ results_hybrid = hybrid_search(prompt, num_results=5, config=my_config)
# Answer questions with RAG:
from raglite import rag

prompt = "What is a 'SkillProgram'?"
prompt = "What does it mean for two events to be simultaneous?"
stream = rag(prompt, search=hybrid_search, config=my_config)
for update in stream:
print(update, end="")
Expand Down
9 changes: 7 additions & 2 deletions src/raglite/_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,10 @@ def _create_chunk_records(
return chunk_records


def insert_document(doc_path: Path, *, config: RAGLiteConfig | None = None) -> None:
"""Insert a document into the database."""
def insert_document(
doc_path: Path, *, update_index: bool = True, config: RAGLiteConfig | None = None
) -> None:
"""Insert a document into the database and update the index."""
# Use the default config if not provided.
config = config or RAGLiteConfig()
# Preprocess the document into chunks.
Expand Down Expand Up @@ -96,6 +98,9 @@ def insert_document(doc_path: Path, *, config: RAGLiteConfig | None = None) -> N
continue
session.add(chunk_record)
session.commit()
# Update the vector search chunk index.
if update_index:
update_vector_index(config)


def update_vector_index(config: RAGLiteConfig | None = None) -> None:
Expand Down

0 comments on commit 3dd173a

Please sign in to comment.