Use PLAYA instead of pdfminer #1226

dhdaines · 2024-11-20T05:26:59Z

So... I went ahead and rewrote large parts of pdfminer.six, because I kept having nightmares about being back in Software Engineering 101 every time I looked at its code. The result is PLAYA, which does less stuff than pdfminer.six but I believe does it somewhat better (and about 20% faster).

This PR uses it, and also as a consequence fixes a few longstanding issues due to pdfminer's quirks. Some of these quirks have not been fixed yet (e.g. the placement of things relative to the MediaBox, lack of actual support for pattern color spaces) but should be soon.

On the downside, LAParams no longer exists and thus cannot be used. What it actually did was mostly just change the ordering of items in the page, and do some heuristic detection of whitespace in text, replicating things that pdfplumber was already doing. (in general this is true of all the "layout analysis" pdfminer did)

I have tried to keep the API reasonable and compact so that it could ultimately be reimplemented on some other PDF parser. Note however that the API is subject to change - this PR is using the "eager" API which is kind of custom made for pdfplumber and also retains some pdfplumber quirks, and thus might not stick around.

Do not merge this, for obvious reasons! It's here in case you or anyone somehow feel the desire to play with it.

jsvine · 2024-11-20T12:33:02Z

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

dhdaines · 2024-11-20T14:18:05Z

Fascinating! Thank you for sharing. An idle thought: What if pdfplumber could allow users to choose their parsing backend? Would require pdfplumber to develop some additional abstractions, but might be a neat way to support more experimentation like this.

This wouldn't be terribly hard to do - it would be a useful exercise as some of the representations used by pdfplumber are inadvertently specific to pdfminer.six. I think it would be worthwhile for pdfplumber to explicitly define its data models, whether it's with pydantic or something else (you could just make a JSON Schema for instance).

The goal of PLAYA is just to be a Pythonic and lazy wrapper around the internals of PDF, obviously pdfplumber (and Camelot, and unstructured.io, and etc, and etc, ...) are what you want for actual information extraction.

(I will probably change the recursive acronym to PLAYA is a LAzY Analyzer for PDF 🤣)

dhdaines · 2024-12-13T14:24:45Z

I may wish to promote this to a real PR shortly (awaiting a release of PLAYA that will fix a couple important bugs).

PLAYA is much more robust to borken PDFs than pdfminer.six, supports color spaces and patterns more correctly, and is also significantly faster.

For a 486-page PDF document, running extract_words using pdfminer takes 1:46 minutes on my (old) computer.

With PLAYA it takes 1:16 minutes ... a 28% speedup!

jsvine · 2024-12-16T03:49:43Z

Really neat to see you developing this so rapidly, and great to hear about that speedup.

dhdaines · 2024-12-16T04:18:25Z

Really neat to see you developing this so rapidly, and great to hear about that speedup.

Thanks! I keep on finding interesting bugs in pdfminer.six, unfortunately... these ones are fixed in PLAYA:

pdfminer/pdfminer.six#1065
pdfminer/pdfminer.six#1067

This one isn't yet (and it's kind of nasty since it causes text extraction to simply fail silently on some files):

pdfminer/pdfminer.six#1072

dhdaines added 26 commits December 12, 2024 12:42

feat: use playa instead of pdfminer

dfa4b9e

feat: playa does the right thing for mcids

ddd0532

fix: playa exposes ncs/scs

87a0fa3

fix: update to handle parsed pages

fffd551

chore: format, lint

e64d509

fix(deps): switch to unreleased playa

5991da3

feat: playa exposes these now (but... for how long)

f5ac9f8

fix: new API

b05d6e2

feat!: remove custom LAParams (just use pdfminer if you want them)

36e28cb

refactor!: another useless pdfminer API removed

d6b5106

fix: numbertree is just iterable

c0f50c2

refactor!: remove structure as it is in playa

a2aeeb3

fix: minimally support (not quite working) new PLAYA API

ac185d6

fix: some updates for latest playa

8d70e02

fix: serialize namedtuple colors

46a8ba2

fix: adjust a few things for playa

b370322

fix: add page numbers to structure tests

3188fba

fix: updated playa names

be99434

fix: update for PLAYA 0.1

59c255b

fix: lint and format and such

9449b11

fix(deps): playa is on pypi now

07bfadf

fix(tests): PLAYA fixed its colors

4de84f6

fix(deps): messed up playa again...

cdd3895

fix(tests): back to previous way of formatting colors (for now)

d27fcb9

fix: no longer needs repair as mediabox is normalized

7e7d354

fix: remove unused import

b3da221

dhdaines force-pushed the playa branch from 4aeb6de to b3da221 Compare December 12, 2024 17:42

feat: mostly implement using lazy api

b0b7e6c

dhdaines added 3 commits December 13, 2024 00:22

feat: complete the reimplementation using playa lazy api

6de85ed

feat: lightly wrap playa structure

56bcab6

fix: lint

4117380

dhdaines changed the title ~~Use PLAYA instead of pdfminer (draft! do not merge!)~~ Use PLAYA instead of pdfminer Dec 15, 2024

dhdaines marked this pull request as ready for review December 15, 2024 18:12

docs: update README and CHANGELOG

718a558

dhdaines marked this pull request as draft December 15, 2024 18:41

jsvine mentioned this pull request Dec 16, 2024

Filter out invisible text rendered with Tr(3) #1230

Closed

dhdaines added 3 commits December 27, 2024 12:31

feat: expose render_mode (fixes: jsvine#1230)

0e6dc30

fix: correct the "size" of rotated glyphs

5caaefb

fix(tests): new and more correct text objects

c9d3848

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PLAYA instead of pdfminer #1226

Use PLAYA instead of pdfminer #1226

dhdaines commented Nov 20, 2024 •

edited

Loading

jsvine commented Nov 20, 2024

dhdaines commented Nov 20, 2024

dhdaines commented Dec 13, 2024

jsvine commented Dec 16, 2024

dhdaines commented Dec 16, 2024

Use PLAYA instead of pdfminer #1226

Are you sure you want to change the base?

Use PLAYA instead of pdfminer #1226

Conversation

dhdaines commented Nov 20, 2024 • edited Loading

jsvine commented Nov 20, 2024

dhdaines commented Nov 20, 2024

dhdaines commented Dec 13, 2024

jsvine commented Dec 16, 2024

dhdaines commented Dec 16, 2024

dhdaines commented Nov 20, 2024 •

edited

Loading