This Python script extracts text and tables from a PDF file.
- Python 3.x
- Place the PDF file you want to process in the specified location
data/raw
. - Update the
pdf_path
variable in the script with the path to your PDF file. - Set the
output_folder
variable to the desired folder to save the extracted CSV files and text file:data/processed
. - Run the script:
python src/main.py
- The script will generate a text file containing the extracted text from the PDF, saved in the specified output folder.
- Separate CSV files will be created for each table found in the PDF, named with the PDF's stem (filename without extension) and table number.