Not all tables being extracted in this pdf #858

pdille · 2023-04-10T22:22:27Z

pdille
Apr 10, 2023

I have a pdf where the final page of it looks like the one attached: test.pdf

The page contains two, not particularly large tables. Using extact_tables() only the second table is extracted. The first one appears to be ignored.

Running the debug_tablefinder() method, I can see the following image below of how pdfplumber interpreted the tables on the page. It appears that the horizontal lines of the first table aren't being detected?

I tried some of the table_settings options, but was not successful in seeing a difference, though perhaps I wasn't using the right settings. Also tried ghostscript to create create a repaired file in the event something was corrupted with the original. Same result.

Any help getting all the content extracted from this pdf using pdfplumber is greatly appreciated. This library has been my go to these days for all my automated pdf parsing needs and is the only library that has been able to do files that Tabular wasn't able to. Hope I can get to the bottom of this roadblock. Thanks!

Here's a simplified version of the code I used:

pdf = pdfplumber.open("test.pdf")
page = pdf.pages[0]
tables = page.extract_tables()

jsvine · 2023-04-11T21:56:55Z

jsvine
Apr 11, 2023
Maintainer

Hi @pdille and happy to hear that pdfplumber has, generally, be serving you well. In this instance, you've run into a common (and admittedly somewhat user-frustrating) edge-case, which which those things that look like horizontal lines or very thin black rectangles are actually more complex paths, classified as curve objects by pdfminer.six. Digging a bit deeper, the PDF is drawing each of those horizontal pseudo-lines as backwards-C-shaped paths that look a bit like this (ignore the small gap on the right side):

──────────────────────┐
──────────────────────┘

... where that vertical distance is very small, just 0.96 units.

The good news is that there's a way to handle this: #808 (comment)

This is becoming a frequent-enough frustration that I think it's worth adjusting pdfplumber's approach to include all strictly-horizontal/vertical curve segments in its table-detection process.

0 replies

pdille · 2023-04-12T00:23:27Z

pdille
Apr 12, 2023
Author

Thanks, @jsvine! The solution in that thread worked great. I'll have to test to see how well it behaves across the daily pdfs I'm extracting (they all have roughly the same layout), in the hopes this can be a permanent change for this set of inputs and the automation can continue to chug along without too much hand holding. :-)

I'm all for the default table detection process of pdfplumber to improve, so I support the adjustment you suggest. I also see in that thread that some page layouts may trip this solution up, and while a one size fits all solution for the vast array of pdfs out in the wild is certainly not likely possible, if the detection of these curves can be done without false positive, then awesome.

Anyway, thanks again for this awesome tool. It's been a life saver for my work.

1 reply

jsvine Apr 13, 2023
Maintainer

As of the newly-released v0.9.0, pdfplumber now uses curve segments (.curve_edges) as part of the default table-detection algorithm. It sticks to 0/90/180/270-degree-oriented segments, which should help reduce false positives. So you should now be able to revert to your original code, although it should also still work with the new code.

pdille · 2023-04-13T22:00:11Z

pdille
Apr 13, 2023
Author

@jsvine Awesome, many thanks for the quick responses and latest release! I can confirm that v0.9.0 correctly parses the file in question (the test one posted here and full version of it) without the need for the table settings from the thread you linked to. The default settings work as you described.

Thanks again!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all tables being extracted in this pdf #858

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Not all tables being extracted in this pdf #858

pdille Apr 10, 2023

Replies: 3 comments · 1 reply

jsvine Apr 11, 2023 Maintainer

pdille Apr 12, 2023 Author

jsvine Apr 13, 2023 Maintainer

pdille Apr 13, 2023 Author

pdille
Apr 10, 2023

Replies: 3 comments 1 reply

jsvine
Apr 11, 2023
Maintainer

pdille
Apr 12, 2023
Author

jsvine Apr 13, 2023
Maintainer

pdille
Apr 13, 2023
Author