Skip to content

In this repository I'm adding the files I used to learn and create a RAG in python. It contains the initial python notebook, the streamlit app and finally the docker file to create an image and run the app as a docker container.

Notifications You must be signed in to change notification settings

ace97/Extracting-Structured-Data-From-PDFs-using-RAG

Repository files navigation

Extracting Structured Data From PDFs using RAG

In this short project. I attempt to create a RAG application that can parse a PDF document (in this case from research papers) and extract the main information from it in a structured tabular form.

Project Description

Started off with a simple Jupyter notebook with the outline of creating a structured data retrieval system from an unstructured source (documents, PDFs, websites etc.). I also wanted to test out the capabilities and limitations of new lightweight LLM models. After going through multiple examples on YouTube I found THU VU's video on creating an LLM based RAG for structured data using Open AI's API key and Encoders to be the most suited for my need. I personally wanted to try and use Google's Gemini models and so in this project I have used Gemini 1.5 Flash.

More than the metrics and functionality of the LLM used I wanted to mainly gauge how quickly can we move from an idea to a notebook to an MVP(Minimum viable Product) from scratch. Except for understanding streamlit's session state and how to use it to manage passing the API from user input and .env file; the streamlit app was fairly straightforward to code using the . While creating the app can be done on one single python file, I found that keeping the app's frontend separate from the underlying functions was a cleaner and easier way to manage and debug the code.

Requirements

Python 3.11 or higher. You will need docker Docker to run the app as a container.

Steps to run the app as a container.

Go to the app folder in terminal.

cd ../app

Build the image using the dockerfile

docker build -t streamlit-app .

Run the image as a container

docker run -e API_KEY="your-secret-key" -p 8501:8501 streamlit-app

In streamlit_app.py file, read the API Key (from .env file or from user input)

import os
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("API_KEY")
import streamlit as st
from functions import *
import base64

# Initialize the API key in session state if it doesn't exist from .env file
if "api_key" not in st.session_state:
    st.session_state.api_key = API_KEY
.....
.....

# from user input
st.text_input('API key', type='password', key='api_key',
                    label_visibility="collapsed", disabled=False)

You can also provide API key at runtime. (But I personally found it buggy so kept both options as a backup for the other.)

License

MIT

About

In this repository I'm adding the files I used to learn and create a RAG in python. It contains the initial python notebook, the streamlit app and finally the docker file to create an image and run the app as a docker container.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published