Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to extract table with invisible lines #123

Closed
cqluohong opened this issue Jun 25, 2019 · 8 comments
Closed

how to extract table with invisible lines #123

cqluohong opened this issue Jun 25, 2019 · 8 comments

Comments

@cqluohong
Copy link

cqluohong commented Jun 25, 2019

Like the table below, there are 4 columns per row, but I can't get the correct results when I use pdfpumber to extract.

@cqluohong cqluohong changed the title how how to extract table with invisible lines Jun 25, 2019
@cqluohong
Copy link
Author

aa (1)

@OisinMoran
Copy link
Contributor

Have you tried cropping to just the table you want to extract before running the extraction?
Another thing to try would be playing around with the parameters (it looks like increasing the snap_tolerance might work here).

@cqluohong
Copy link
Author

cqluohong commented Jun 28, 2019

Have you tried cropping to just the table you want to extract before running the extraction?
Another thing to try would be playing around with the parameters (it looks like increasing the snap_tolerance might work here).

yeah,I try,in vertical direction,Extracting the form is OK,is there ocr technique in pdfplumer?I also try camelot that use it ,but handle Chinese,there is a bug,I am confused!

@OisinMoran
Copy link
Contributor

So pdfplumber "works best on machine-generated, rather than scanned, PDFs". There is no OCR capability. If you can share the actual PDF you are trying to extract a table from it can help with debugging the issue.

@cqluohong
Copy link
Author

So pdfplumber "works best on machine-generated, rather than scanned, PDFs". There is no OCR capability. If you can share the actual PDF you are trying to extract a table from it can help with debugging the issue.

430027-北科光大-2017年年度报告.pdf
I want to extract tableswith page 47- 57

@luoqygit
Copy link

luoqygit commented Jul 5, 2019

If you don't mind using an alpha version, you can switch to 0.6-alpha, and use 'snap_y_tolerance' when calling 'page.extract_tables()'.
tabs = page.extract_tables( { 'snap_y_tolerance': 6} )
You can replace 6 with whatever value you want.

Please refer to 0.6-alpha documents and #51 for details.

@jsvine
Copy link
Owner

jsvine commented Jul 18, 2020

Thanks @luoqygit and @OisinMoran! Cleaning up old issues. Feel free to reopen @cqluohong if you'd like to continue the discussion.

@jsvine jsvine closed this as completed Jul 18, 2020
@clj55
Copy link

clj55 commented Jun 21, 2024

Hi I'm facing issues extracting the invisible tables too. I can't crop the page to specific coordinates of the table because I'm running the program on multiple PDFs where the table can appear in different positions.

Customer Information.pdf

I tried it with these table_settings:

table_settings={
    "vertical_strategy":"text",
                "text_keep_blank_chars":True,
                "horizontal_strategy":"text",
                }

But it recognised the paragraph text as a table too
customer_info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants