how to extract table with invisible lines #123

cqluohong · 2019-06-25T12:28:13Z

Like the table below, there are 4 columns per row, but I can't get the correct results when I use pdfpumber to extract.

cqluohong · 2019-06-25T12:32:54Z

OisinMoran · 2019-06-27T13:26:04Z

Have you tried cropping to just the table you want to extract before running the extraction?
Another thing to try would be playing around with the parameters (it looks like increasing the snap_tolerance might work here).

cqluohong · 2019-06-28T07:39:16Z

Have you tried cropping to just the table you want to extract before running the extraction?
Another thing to try would be playing around with the parameters (it looks like increasing the snap_tolerance might work here).

yeah,I try,in vertical direction,Extracting the form is OK,is there ocr technique in pdfplumer?I also try camelot that use it ,but handle Chinese,there is a bug,I am confused!

OisinMoran · 2019-06-28T08:10:06Z

So pdfplumber "works best on machine-generated, rather than scanned, PDFs". There is no OCR capability. If you can share the actual PDF you are trying to extract a table from it can help with debugging the issue.

cqluohong · 2019-07-01T09:21:23Z

So pdfplumber "works best on machine-generated, rather than scanned, PDFs". There is no OCR capability. If you can share the actual PDF you are trying to extract a table from it can help with debugging the issue.

430027-北科光大-2017年年度报告.pdf
I want to extract tableswith page 47- 57

luoqygit · 2019-07-05T10:49:10Z

If you don't mind using an alpha version, you can switch to 0.6-alpha, and use 'snap_y_tolerance' when calling 'page.extract_tables()'.
tabs = page.extract_tables( { 'snap_y_tolerance': 6} )
You can replace 6 with whatever value you want.

Please refer to 0.6-alpha documents and #51 for details.

jsvine · 2020-07-18T16:29:53Z

Thanks @luoqygit and @OisinMoran! Cleaning up old issues. Feel free to reopen @cqluohong if you'd like to continue the discussion.

clj55 · 2024-06-21T04:15:43Z

Hi I'm facing issues extracting the invisible tables too. I can't crop the page to specific coordinates of the table because I'm running the program on multiple PDFs where the table can appear in different positions.

Customer Information.pdf

I tried it with these table_settings:

table_settings={
    "vertical_strategy":"text",
                "text_keep_blank_chars":True,
                "horizontal_strategy":"text",
                }

But it recognised the paragraph text as a table too

cqluohong changed the title ~~how~~ how to extract table with invisible lines Jun 25, 2019

jsvine closed this as completed Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to extract table with invisible lines #123

how to extract table with invisible lines #123

cqluohong commented Jun 25, 2019 •

edited

Loading

cqluohong commented Jun 25, 2019

OisinMoran commented Jun 27, 2019

cqluohong commented Jun 28, 2019 •

edited

Loading

OisinMoran commented Jun 28, 2019

cqluohong commented Jul 1, 2019

luoqygit commented Jul 5, 2019

jsvine commented Jul 18, 2020

clj55 commented Jun 21, 2024

how to extract table with invisible lines #123

how to extract table with invisible lines #123

Comments

cqluohong commented Jun 25, 2019 • edited Loading

cqluohong commented Jun 25, 2019

OisinMoran commented Jun 27, 2019

cqluohong commented Jun 28, 2019 • edited Loading

OisinMoran commented Jun 28, 2019

cqluohong commented Jul 1, 2019

luoqygit commented Jul 5, 2019

jsvine commented Jul 18, 2020

clj55 commented Jun 21, 2024

cqluohong commented Jun 25, 2019 •

edited

Loading

cqluohong commented Jun 28, 2019 •

edited

Loading