-
Notifications
You must be signed in to change notification settings - Fork 360
Unable to parse tables that use multiple colors for text, background and borders #147
Comments
Thanks for the report @odedsh! This might be due to the differently colored lines (both in foreground and background). Let me have a look into how this can be done using Lattice. We could even add this as an example and a test. Meanwhile, you should definitely be able to get the table out using the Stream flavor! |
Unfortunately because of the long titles in this table reading with the 'stream' flavor will not correctly identify the cells. The shape I am getting with 'stream' is: |
Let me look into it this week. |
@odedsh stream will work on this now. I'm getting |
@odedsh For lattice, the multiple colors in lines don't matter as long as they are in the foreground. To get the table out, some parameters need to be tweaked with some visual debugging. Since the initial output wasn't very clean with lattice, I plotted the lines found on the page. $ camelot lattice -plot line bugExample.pdf Since smaller lines weren't being detected, I tweaked the scale. $ camelot lattice -scale 60 -plot line bugExample.pdf Since the output was still not very clean, I plotted the joints to see if all line intersections were being captured. $ camelot lattice -scale 60 -plot joint bugExample.pdf They weren't, so I increased the line lengths by passing an iteration of 1 to make closer lines intersect. $ camelot lattice -I 1 -scale 60 -plot joint bugExample.pdf I was able to get the table out with |
I will try these out. Thanks!
…On Wed, Jan 2, 2019, 12:43 PM Vinayak Mehta ***@***.*** wrote:
@odedsh <https://github.com/odedsh> For lattice, the multiple colors in
lines don't matter as long as they are in the foreground. To get the table
out, some parameters need to be tweaked with some visual debugging. Since
the initial output wasn't very clean with lattice, I plotted the lines
found on the page.
$ camelot lattice -plot line bugExample.pdf
[image: image]
<https://user-images.githubusercontent.com/4329421/50588323-87ab7980-0ea7-11e9-99d9-f870e1b9b68d.png>
Since smaller lines weren't being detected, I tweaked the scale
<https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines>
.
$ camelot lattice -scale 60 -plot line bugExample.pdf
[image: image]
<https://user-images.githubusercontent.com/4329421/50588343-9f82fd80-0ea7-11e9-8f2e-812f18791f37.png>
Since the output was still not very clean, I plotted the joints to see if
all line intersections were being captured.
$ camelot lattice -scale 60 -plot joint bugExample.pdf
[image: image]
<https://user-images.githubusercontent.com/4329421/50588363-b9244500-0ea7-11e9-85c3-4904f264191c.png>
They weren't, so I increased the line lengths by passing an iteration of 1
to make closer lines intersect.
$ camelot lattice -I 1 -scale 60 -plot joint bugExample.pdf
[image: image]
<https://user-images.githubusercontent.com/4329421/50588400-d6f1aa00-0ea7-11e9-974b-16ecfdaf1f8d.png>
I was able to get the table out with line_size_scaling=60 (CLI: -scale 60)
and iterations=1 (CLI: -I 1).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#147 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFB2GF2HU2zIfbxMeqwPGnp9Dmopd9lQks5u_I1lgaJpZM4XcpvI>
.
|
The attached pdf file contains a simple example of a table that I could not extract using camelot:
Both these do not work:
tables = camelot.read_pdf('bugExample.pdf')
tables = camelot.read_pdf('bugExample.pdf', process_background=True)
tabula-py can parse the same example
tables = tabula.read_pdf('bugExample.pdf')
I suspect that the fact that the borders are not a strong black is what causing this issue but I am not sure :(
Example PDF uploaded to: http://s000.tinyupload.com/?file_id=72860431408529670599
The text was updated successfully, but these errors were encountered: