Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PLAYA instead of pdfminer #1226

Draft
wants to merge 34 commits into
base: develop
Choose a base branch
from
Draft

Use PLAYA instead of pdfminer #1226

wants to merge 34 commits into from

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Nov 20, 2024

So... I went ahead and rewrote large parts of pdfminer.six, because I kept having nightmares about being back in Software Engineering 101 every time I looked at its code. The result is PLAYA, which does less stuff than pdfminer.six but I believe does it somewhat better (and about 20% faster).

This PR uses it, and also as a consequence fixes a few longstanding issues due to pdfminer's quirks. Some of these quirks have not been fixed yet (e.g. the placement of things relative to the MediaBox, lack of actual support for pattern color spaces) but should be soon.

On the downside, LAParams no longer exists and thus cannot be used. What it actually did was mostly just change the ordering of items in the page, and do some heuristic detection of whitespace in text, replicating things that pdfplumber was already doing. (in general this is true of all the "layout analysis" pdfminer did)

I have tried to keep the API reasonable and compact so that it could ultimately be reimplemented on some other PDF parser. Note however that the API is subject to change - this PR is using the "eager" API which is kind of custom made for pdfplumber and also retains some pdfplumber quirks, and thus might not stick around.

Do not merge this, for obvious reasons! It's here in case you or anyone somehow feel the desire to play with it.

@jsvine
Copy link
Owner

jsvine commented Nov 20, 2024

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

@dhdaines
Copy link
Contributor Author

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

This wouldn't be terribly hard to do - it would be a useful exercise as some of the representations used by pdfplumber are inadvertently specific to pdfminer.six. I think it would be worthwhile for pdfplumber to explicitly define its data models, whether it's with pydantic or something else (you could just make a JSON Schema for instance).

The goal of PLAYA is just to be a Pythonic and lazy wrapper around the internals of PDF, obviously pdfplumber (and Camelot, and unstructured.io, and etc, and etc, ...) are what you want for actual information extraction.

(I will probably change the recursive acronym to PLAYA is a LAzY Analyzer for PDF 🤣)

@dhdaines
Copy link
Contributor Author

I may wish to promote this to a real PR shortly (awaiting a release of PLAYA that will fix a couple important bugs).

PLAYA is much more robust to borken PDFs than pdfminer.six, supports color spaces and patterns more correctly, and is also significantly faster.

For a 486-page PDF document, running extract_words using pdfminer takes 1:46 minutes on my (old) computer.

With PLAYA it takes 1:16 minutes ... a 28% speedup!

@dhdaines dhdaines changed the title Use PLAYA instead of pdfminer (draft! do not merge!) Use PLAYA instead of pdfminer Dec 15, 2024
@dhdaines dhdaines marked this pull request as ready for review December 15, 2024 18:12
@dhdaines dhdaines marked this pull request as draft December 15, 2024 18:41
@jsvine
Copy link
Owner

jsvine commented Dec 16, 2024

Really neat to see you developing this so rapidly, and great to hear about that speedup.

@dhdaines
Copy link
Contributor Author

Really neat to see you developing this so rapidly, and great to hear about that speedup.

Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately... these ones are fixed in PLAYA:

pdfminer/pdfminer.six#1065
pdfminer/pdfminer.six#1067

This one isn't yet (and it's kind of nasty since it causes text extraction to simply fail silently on some files):

pdfminer/pdfminer.six#1072

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants