Lex Fridman Podcast Transcript RAG ChatBot

This repository contains a Lex Fridman Podcast Transcript question-answering system using Retrieval-Augmented Generation (RAG) and Hypothetical Document Embeddings (HYDE) techniques.

Task Description

The goal is to create an AI-based question-answering system that can provide accurate and contextually relevant answers based on the content of the Lex Fridman podcast transcripts. The system should:

Retrieve relevant passages from the podcast transcripts based on the user's questions.
Generate precise and contextually appropriate answers using the retrieved information.
Enhance the quality of retrieved passages and generated answers using HYDE to improve the embedding representations.

Proposed Solution: RAG with Hypothetical Document Embeddings (HyDE)

Retrieval with HyDE

HyDE Involvement: Utilizes a large language model (LLM) to generate a "hypothetical document" that captures the essence of the user's query.
Embedding and Retrieval: The hypothetical document is encoded into a vector representation and used to search the document embedding space of the podcast transcripts, retrieving passages most similar to the hypothetical document.

Answer Generation

Using Retrieved Passages: The retrieved passages from the transcripts are fed into the Generator component of RAG, another LLM.
Formulating Answers: The LLM analyzes the retrieved passages and the user's original question to formulate a precise and contextually relevant answer.

Enhancing Performance with HyDE

Traditional vs. HyDE Retrieval: Traditional retrieval methods might struggle to capture the full context of the question. HyDE's hypothetical documents provide a more nuanced representation of the needed information.

Setup Instructions

Prerequisites

Ensure that you have the following installed on your machine:

Python 3.x
Git

Steps to Setup

Clone the Repository

git clone https://github.com/Darshanroy/-Lex-Fridman-Podcast-Transcript-RAG-ChatBot.git

Navigate to the Project Directory

cd -Lex-Fridman-Podcast-Transcript-RAG-ChatBot

Install the Required Dependencies
```
pip install -r requirements.txt
```
Run the Streamlit App
```
streamlit run streamlit-app.py
```

Usage

Initializing Document Embedding

Initialize Document Embedding:
- Click the "Initialize Document Embedding" button. This process will load the documents and create the necessary embeddings. A message will be displayed indicating that the vector store database is ready.

Asking Questions

Enter Your Question:
- Input your question in the text box labeled "Enter your question based on the documents".
- Click the "Submit Question" button to submit your question.

Viewing the Response

Response Time:
- The response time for processing the question will be displayed.
Answer:
- The answer generated by the model will be shown below the response time.
Document Similarity Search:
- Expand the "Document Similarity Search" section to view the documents that were most relevant to your question. This section will display the content of these documents.

Technical Approach and Implementation Details

Imports

Necessary libraries for Streamlit, environment variables, and Langchain components are imported.

Environment Variables and Streamlit Title

Environment variables (like API tokens) are loaded using dotenv. The Streamlit app's title is set to "Lex Fridman Podcast Transcript question-answering system using RAG and HYDE techniques".

Initializing Embeddings and Models

HuggingFaceEndpointEmbeddings are loaded for sentence embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
Two HuggingFaceEndpoint instances are created:
- mistral_llm: For text generation using the mistralai/Mistral-7B-Instruct-v0.3 model.
- mistral_hyde_embeddings: For generating hypothetical document embeddings using mistral_llm and the loaded sentence embeddings with the web_search prompt.

Chat Prompt Template

qa_prompt_template defines a template for the question-answering prompt provided to mistral_llm. It includes context and the user's question.

prepare_vector_store Function

This function initializes the vector store database:

Stores the mistral_hyde_embeddings in the session state.
Loads documents using UnstructuredCSVLoader from docs/Lex/podcastdata_dataset.csv.
Splits documents using RecursiveCharacterTextSplitter for efficient processing.
Creates a Chroma vector store from the first 10,000 split documents and the mistral_hyde_embeddings.

User Input and Contextualization

user_question is a text input field for the user to enter their question.
contextualize_q_prompt reformulates the user's question into a standalone format without requiring the chat history.

Chat History Management

get_chat_session_history retrieves the chat history for the current session (identified by session_id).

Main Loop

The loop continues until the user enters 'q' to quit. Inside the loop:

If the user enters a question (except 'q'):
- A question-answer chain (question_answer_chain) is created using create_stuff_documents_chain with the mistral_llm and the qa_prompt_template.
- A history-aware retriever (history_aware_retriever) is created using create_history_aware_retriever. This retriever leverages the mistral_llm to contextualize the user's question based on the document embeddings.
- A retrieval chain (retrieval_chain) is created using create_retrieval_chain to combine the history-aware retriever and the question-answer chain.
- A conversational RAG chain (conversational_rag_chain) is created using RunnableWithMessageHistory. This chain manages the chat history and uses the retrieval chain to answer questions based on the documents.
- Response time is measured using time.process_time before invoking the conversational_rag_chain with the user's question and a session ID.
- The retrieved answer is displayed along with the response time.
- An expander section allows viewing the document similarity search results (top retrieved documents based on the user's question).

Data Sources and Preprocessing Steps

Data Source

KAGGLE: Lex Fridman Podcast Transcript

Preprocessing Steps

Calculated basic contextual length
Removed any unusual symbols or outliers

Challenges Faced and How They Were Addressed

1. LLM & Embedding Model Selection

Initial Exploration: OllamaIndex (LLM), Hugging Face (LLM), Langchain (other operations)
Challenges: Loading models from different sources, slow Ollama embeddings, Google Generative AI embeddings struggling with large document volumes
Solution: Hugging Face Langchain library, Sentence Transformers for embedding

2. Data Loading

Used Unstructured CSV loader for efficiency

3. Vector Storage Comparison

FAISS vs. Chroma
Selection: FAISS for dense vector similarity search

4. Retriever Selection

Explored Parent Document Retriever and Ensemble Retriever
Selection: Standard Retriever for practicality

5. Evaluation

Used unlabeled evaluation metrics within Langchain
Pairwise string and embedding evaluations to identify the best model pair

Notes

If you want to stop the application, you can do so by closing the terminal or command prompt window where the Streamlit app is running.
To ask a new question, simply enter it in the text box and click the "Submit Question" button again.

By following these steps, you can effectively interact with the Lex Fridman Podcast Transcript RAG ChatBot and obtain answers to your questions based on the provided podcast transcripts.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
chroma_db		chroma_db
docs/Lex		docs/Lex
.gitignore		.gitignore
Main-Notebook.ipynb		Main-Notebook.ipynb
README.md		README.md
requirements.txt		requirements.txt
streamlit-app.py		streamlit-app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lex Fridman Podcast Transcript RAG ChatBot

Task Description

Proposed Solution: RAG with Hypothetical Document Embeddings (HyDE)

Retrieval with HyDE

Answer Generation

Enhancing Performance with HyDE

Setup Instructions

Prerequisites

Steps to Setup

Usage

Initializing Document Embedding

Asking Questions

Viewing the Response

Technical Approach and Implementation Details

Imports

Environment Variables and Streamlit Title

Initializing Embeddings and Models

Chat Prompt Template

prepare_vector_store Function

User Input and Contextualization

Chat History Management

Main Loop

Data Sources and Preprocessing Steps

Data Source

Preprocessing Steps

Challenges Faced and How They Were Addressed

1. LLM & Embedding Model Selection

2. Data Loading

3. Vector Storage Comparison

4. Retriever Selection

5. Evaluation

Notes

About

Releases

Packages

Languages

Darshanroy/-Lex-Fridman-Podcast-Transcript-RAG-ChatBot

Folders and files

Latest commit

History

Repository files navigation

Lex Fridman Podcast Transcript RAG ChatBot

Task Description

Proposed Solution: RAG with Hypothetical Document Embeddings (HyDE)

Retrieval with HyDE

Answer Generation

Enhancing Performance with HyDE

Setup Instructions

Prerequisites

Steps to Setup

Usage

Initializing Document Embedding

Asking Questions

Viewing the Response

Technical Approach and Implementation Details

Imports

Environment Variables and Streamlit Title

Initializing Embeddings and Models

Chat Prompt Template

prepare_vector_store Function

User Input and Contextualization

Chat History Management

Main Loop

Data Sources and Preprocessing Steps

Data Source

Preprocessing Steps

Challenges Faced and How They Were Addressed

1. LLM & Embedding Model Selection

2. Data Loading

3. Vector Storage Comparison

4. Retriever Selection

5. Evaluation

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages