Not all tables being extracted in this pdf #858
Replies: 3 comments 1 reply
-
Hi @pdille and happy to hear that
... where that vertical distance is very small, just 0.96 units. The good news is that there's a way to handle this: #808 (comment) This is becoming a frequent-enough frustration that I think it's worth adjusting |
Beta Was this translation helpful? Give feedback.
-
Thanks, @jsvine! The solution in that thread worked great. I'll have to test to see how well it behaves across the daily pdfs I'm extracting (they all have roughly the same layout), in the hopes this can be a permanent change for this set of inputs and the automation can continue to chug along without too much hand holding. :-) I'm all for the default table detection process of pdfplumber to improve, so I support the adjustment you suggest. I also see in that thread that some page layouts may trip this solution up, and while a one size fits all solution for the vast array of pdfs out in the wild is certainly not likely possible, if the detection of these curves can be done without false positive, then awesome. Anyway, thanks again for this awesome tool. It's been a life saver for my work. |
Beta Was this translation helpful? Give feedback.
-
@jsvine Awesome, many thanks for the quick responses and latest release! I can confirm that v0.9.0 correctly parses the file in question (the test one posted here and full version of it) without the need for the table settings from the thread you linked to. The default settings work as you described. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
I have a pdf where the final page of it looks like the one attached: test.pdf
The page contains two, not particularly large tables. Using extact_tables() only the second table is extracted. The first one appears to be ignored.
Running the debug_tablefinder() method, I can see the following image below of how pdfplumber interpreted the tables on the page. It appears that the horizontal lines of the first table aren't being detected?
I tried some of the table_settings options, but was not successful in seeing a difference, though perhaps I wasn't using the right settings. Also tried ghostscript to create create a repaired file in the event something was corrupted with the original. Same result.
Any help getting all the content extracted from this pdf using pdfplumber is greatly appreciated. This library has been my go to these days for all my automated pdf parsing needs and is the only library that has been able to do files that Tabular wasn't able to. Hope I can get to the bottom of this roadblock. Thanks!
Here's a simplified version of the code I used:
pdf = pdfplumber.open("test.pdf")
page = pdf.pages[0]
tables = page.extract_tables()
Beta Was this translation helpful? Give feedback.
All reactions