pdf

Star

Here are 2,274 public repositories matching this topic...

microsoft / markitdown

Star

Python tool for converting files and office documents to Markdown.

markdown pdf openai microsoft-office autogen langchain autogen-extension

Updated Jan 16, 2025
Python

opendatalab / MinerU

Star

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Jan 17, 2025
Python

paperless-ngx / paperless-ngx

Star

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

pdf machine-learning django angular ocr archiving dms document-management optical-character-recognition document-management-system

Updated Jan 19, 2025
Python

DS4SD / docling

Star

Get your documents ready for gen AI

html markdown pdf ai convert xlsx pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

Updated Jan 17, 2025
Python

Byaidu / PDFMathTranslate

Star

PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译，支持 Google/DeepL/Ollama/OpenAI 等服务，提供 CLI/GUI/Docker/Zotero

python pdf latex translation math japanese english openai translate document chinese edit modify russian korean zotero obsidian pdf2zh

Updated Jan 19, 2025
Python

ocrmypdf / OCRmyPDF

Star

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

python pdf ocr image-processing tesseract

Updated Jan 9, 2025
Python

h2oai / h2ogpt

Star

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

pdf ai embeddings private gpt generative llm chatgpt gpt4all vectorstore privategpt llama2 mixtral

Updated Jan 16, 2025
Python

py-pdf / pypdf

Star

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

python pdf help-wanted pdf-documents pypdf2 pdf-manipulation pdf-parsing pdf-parser

Updated Jan 15, 2025
Python

getomni-ai / zerox

Star

PDF to Markdown with vision models

pdf ocr

Updated Dec 18, 2024
Python

Kozea / WeasyPrint

Star

The awesome document factory

css python html pdf converter weasyprint

Updated Jan 16, 2025
Python

jsvine / pdfplumber

Star

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdf pdf-parsing table-extraction

Updated Jan 1, 2025
Python

pymupdf / PyMuPDF

Star

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

python pdf font data-science ocr tesseract epub mupdf text-processing pdf-documents extract-data table-extraction text-shaping xps pymupdf

Updated Jan 18, 2025
Python

pdfminer / pdfminer.six

Star

Community maintained fork of pdfminer - we fathom PDF

python pdf parser

Updated Aug 2, 2024
Python

QuivrHQ / MegaParse

Star

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

pdf parser powerpoint docx llm

Updated Jan 17, 2025
Python

pdfarranger / pdfarranger

Star

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

linux pdf gtk python3 gtk3