This repository contains a Lex Fridman Podcast Transcript question-answering system using Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HYDE) techniques.
The goal is to create an AI-based question-answering system that can provide accurate and contextually relevant answers based on the content of the Lex Fridman podcast transcripts. The system should:
- Retrieve relevant passages from the podcast transcripts based on the user's questions.
- Generate precise and contextually appropriate answers using the retrieved information.
- Enhance the quality of retrieved passages and generated answers using HYDE to improve the embedding representations.
- HyDE Involvement: Utilizes a large language model (LLM) to generate a "hypothetical document" that captures the essence of the user's query.
- Embedding and Retrieval: The hypothetical document is encoded into a vector representation and used to search the document embedding space of the podcast transcripts, retrieving passages most similar to the hypothetical document.
- Using Retrieved Passages: The retrieved passages from the transcripts are fed into the Generator component of RAG, another LLM.
- Formulating Answers: The LLM analyzes the retrieved passages and the user's original question to formulate a precise and contextually relevant answer.
- Traditional vs. HyDE Retrieval: Traditional retrieval methods might struggle to capture the full context of the question. HyDE's hypothetical documents provide a more nuanced representation of the needed information.
Ensure that you have the following installed on your machine:
- Python 3.x
- Git
-
Clone the Repository
git clone https://github.com/Darshanroy/-Lex-Fridman-Podcast-Transcript-RAG-ChatBot.git
-
Navigate to the Project Directory
cd -Lex-Fridman-Podcast-Transcript-RAG-ChatBot
-
Install the Required Dependencies
pip install -r requirements.txt
-
Run the Streamlit App
streamlit run streamlit-app.py
- Initialize Document Embedding:
- Click the "Initialize Document Embedding" button. This process will load the documents and create the necessary embeddings. A message will be displayed indicating that the vector store database is ready.
- Enter Your Question:
- Input your question in the text box labeled "Enter your question based on the documents".
- Click the "Submit Question" button to submit your question.
-
Response Time:
- The response time for processing the question will be displayed.
-
Answer:
- The answer generated by the model will be shown below the response time.
-
Document Similarity Search:
- Expand the "Document Similarity Search" section to view the documents that were most relevant to your question. This section will display the content of these documents.
Necessary libraries for Streamlit, environment variables, and Langchain components are imported.
Environment variables (like API tokens) are loaded using dotenv. The Streamlit app's title is set to "Lex Fridman Podcast Transcript question-answering system using RAG and HYDE techniques".
- HuggingFaceEndpointEmbeddings are loaded for sentence embedding using the
sentence-transformers/all-MiniLM-L6-v2
model. - Two HuggingFaceEndpoint instances are created:
mistral_llm
: For text generation using themistralai/Mistral-7B-Instruct-v0.3
model.mistral_hyde_embeddings
: For generating hypothetical document embeddings usingmistral_llm
and the loaded sentence embeddings with theweb_search
prompt.
qa_prompt_template
defines a template for the question-answering prompt provided to mistral_llm
. It includes context and the user's question.
This function initializes the vector store database:
- Stores the
mistral_hyde_embeddings
in the session state. - Loads documents using
UnstructuredCSVLoader
fromdocs/Lex/podcastdata_dataset.csv
. - Splits documents using
RecursiveCharacterTextSplitter
for efficient processing. - Creates a Chroma vector store from the first 10,000 split documents and the
mistral_hyde_embeddings
.
user_question
is a text input field for the user to enter their question.contextualize_q_prompt
reformulates the user's question into a standalone format without requiring the chat history.
get_chat_session_history
retrieves the chat history for the current session (identified by session_id
).
The loop continues until the user enters 'q' to quit. Inside the loop:
- If the user enters a question (except 'q'):
- A question-answer chain (
question_answer_chain
) is created usingcreate_stuff_documents_chain
with themistral_llm
and theqa_prompt_template
. - A history-aware retriever (
history_aware_retriever
) is created usingcreate_history_aware_retriever
. This retriever leverages themistral_llm
to contextualize the user's question based on the document embeddings. - A retrieval chain (
retrieval_chain
) is created usingcreate_retrieval_chain
to combine the history-aware retriever and the question-answer chain. - A conversational RAG chain (
conversational_rag_chain
) is created usingRunnableWithMessageHistory
. This chain manages the chat history and uses the retrieval chain to answer questions based on the documents. - Response time is measured using
time.process_time
before invoking theconversational_rag_chain
with the user's question and a session ID. - The retrieved answer is displayed along with the response time.
- An expander section allows viewing the document similarity search results (top retrieved documents based on the user's question).
- A question-answer chain (
- KAGGLE: Lex Fridman Podcast Transcript
- Calculated basic contextual length
- Removed any unusual symbols or outliers
- Initial Exploration: OllamaIndex (LLM), Hugging Face (LLM), Langchain (other operations)
- Challenges: Loading models from different sources, slow Ollama embeddings, Google Generative AI embeddings struggling with large document volumes
- Solution: Hugging Face Langchain library, Sentence Transformers for embedding
- Used Unstructured CSV loader for efficiency
- FAISS vs. Chroma
- Selection: FAISS for dense vector similarity search
- Explored Parent Document Retriever and Ensemble Retriever
- Selection: Standard Retriever for practicality
- Used unlabeled evaluation metrics within Langchain
- Pairwise string and embedding evaluations to identify the best model pair
- If you want to stop the application, you can do so by closing the terminal or command prompt window where the Streamlit app is running.
- To ask a new question, simply enter it in the text box and click the "Submit Question" button again.
By following these steps, you can effectively interact with the Lex Fridman Podcast Transcript RAG ChatBot and obtain answers to your questions based on the provided podcast transcripts.