Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

Unable to parse tables that use multiple colors for text, background and borders #147

Closed
odedsh opened this issue Oct 15, 2018 · 6 comments
Labels
Milestone

Comments

@odedsh
Copy link

odedsh commented Oct 15, 2018

The attached pdf file contains a simple example of a table that I could not extract using camelot:

Both these do not work:
tables = camelot.read_pdf('bugExample.pdf')
tables = camelot.read_pdf('bugExample.pdf', process_background=True)

tabula-py can parse the same example
tables = tabula.read_pdf('bugExample.pdf')

I suspect that the fact that the borders are not a strong black is what causing this issue but I am not sure :(

Example PDF uploaded to: http://s000.tinyupload.com/?file_id=72860431408529670599

@vinayak-mehta
Copy link
Contributor

Thanks for the report @odedsh! This might be due to the differently colored lines (both in foreground and background). Let me have a look into how this can be done using Lattice. We could even add this as an example and a test.

Meanwhile, you should definitely be able to get the table out using the Stream flavor!

@odedsh
Copy link
Author

odedsh commented Oct 15, 2018

Unfortunately because of the long titles in this table reading with the 'stream' flavor will not correctly identify the cells.

The shape I am getting with 'stream' is: <Table shape=(11, 2)>
which is incorrect there are 4 columns in the table.

@vinayak-mehta
Copy link
Contributor

Let me look into it this week.

@vinayak-mehta vinayak-mehta added this to the v0.8.0 milestone Dec 2, 2018
@vinayak-mehta vinayak-mehta removed this from the v0.8.0 milestone Dec 20, 2018
@vinayak-mehta
Copy link
Contributor

@odedsh stream will work on this now. I'm getting <Table shape=(11, 4)> with the latest version.

@vinayak-mehta vinayak-mehta added this to the v0.7.0 milestone Dec 28, 2018
@vinayak-mehta
Copy link
Contributor

@odedsh For lattice, the multiple colors in lines don't matter as long as they are in the foreground. To get the table out, some parameters need to be tweaked with some visual debugging. Since the initial output wasn't very clean with lattice, I plotted the lines found on the page.

$ camelot lattice -plot line bugExample.pdf

image

Since smaller lines weren't being detected, I tweaked the scale.

$ camelot lattice -scale 60 -plot line bugExample.pdf

image

Since the output was still not very clean, I plotted the joints to see if all line intersections were being captured.

$ camelot lattice -scale 60 -plot joint bugExample.pdf

image

They weren't, so I increased the line lengths by passing an iteration of 1 to make closer lines intersect.

$ camelot lattice -I 1 -scale 60 -plot joint bugExample.pdf

image

I was able to get the table out with line_size_scaling=60 (CLI: -scale 60) and iterations=1 (CLI: -I 1).

@odedsh
Copy link
Author

odedsh commented Jan 2, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants