cluster_list may put too many entries in a group #192

dwalton76 · 2020-03-15T11:12:56Z

Today if you give pass cluster_list a list of numbers such as ['0', '1.1', '2.2', '3.3', '4.4', '5.5'] with a tolerance of 2 it will put all of those numbers in one group. This happens because last = x is executed everytime through the main loopl. last = x should only happen when a new group is created.

With the fix in place cluster_list will return [[0, 1.1], [2.2, 3.3], [4.4, 5.5]]

jsvine · 2020-03-30T00:04:24Z

Hi @dwalton76, and thanks for your interesting in PDFPlumber. In this case, the current functionality is the intended functionality — or at least how I intended it, for the specific way it is used in the library. I can see, however, a case for wanting to perform different types of clustering. What's your particular use-case here?

dwalton76 · 2020-03-30T17:55:39Z

Hi @jsvine, I was parsing a PDF where the current behavior causes the words to be extracted incorrectly. I would upload the PDF but it is from a customer so will just describe the behavior as best I can. In a nutshell we ended up with the following very unusual "word" bounding boxes

+-----+ +------------+
|hello| |HomerSimpson|
+-----+ | 12    34   |
        +------------+

Where the 2nd word word was Ho1m2erSim3p4son. This happened because all of the characters in HomerSimpson, 12 and 34 ended up in the same cluster. With the patch in place the characters in HomerSimpson end up in one cluster while the characters in 12 and 34 go into a different cluster from HomerSimpson

jsvine · 2020-03-31T03:07:52Z

Hi @dwalton76, and thanks for the extra info. I totally understand that you're in a difficult place because you can't share the PDF you're working on. Unfortunately, I'm having a bit of a hard time understanding the sketch provided. Is there a de-identified PDF excerpt you could share?

Also (though I may just be misunderstanding the sketch), I wonder if you could achieve the desired results by passing x_tolerance = 0 (or related param, depending on the context).

dwalton76 · 2020-03-31T12:40:27Z

We deal with a pretty large variety of pdfs so while maybe I could do a work-around like x_tolerance = 0 for this one pdf that wouldn't be something I could apply in general. So then I would need to code some logic to know that the work-around needed to be applied...seems better to fix the underlying issue.

Will try to elaborate on my artwork above :) A subset of text on the page looks like this:

hello HomerSimpson
       12    34

we ask plumber to give us a list of all of the words on the page and then we draw a bounding box around each word, that is what is shown here

+-----+ +------------+
|hello| |HomerSimpson|
+-----+ | 12    34   |
        +------------+

The characters in HomerSimpson, 12 and 34 end up in the same cluster so when plumber traverses through the characters in that cluster from left to right it constructs the word Ho1m2erSim3p4son.

The reason the characters in HomerSimpson, 12 and 34 end up in the same cluster is due to the last = x logic in cluster_list. Assume we pass cluster_list a tolerance of 2. There are many other characters of text on the page and if they are spaced out just the right way vertically so that you never have a vertical delta of >2 between them all of those characters will end up in the same cluster. That is what happens here, the 12 and the 34 characters end up in the same cluster as HomerSimpson.

jsvine · 2020-03-31T13:14:02Z

Thanks for the extra info, @dwalton76. I appreciate it.

Just to reiterate: The current logic of cluster_list is the intended one, and changing that logic would likely cause many more edge cases for other users. Looking at your example, though, I wonder if you're possibly misunderstanding the goal of cluster_list, and its role in the issue you're encountering. It's a very low-level function meant to be used by other utilities in the library.

Without the PDF or explicit code, it's hard to debug precisely, but ... :

My intuition is that the characters in 12 and 34 overlap vertically with the characters in HomerSimpson. (Perhaps you could use the visual-debugging tools to show us the bounding boxes of the individual characters, censoring out any sensitive information?)
By design, PDFPlumber interprets vertically-overlapping characters as belonging to one word — a side-effect of wanting to handle the common edge-case of PDFs not always perfectly aligning text in the same word (or using different vertical extents for capital and lowercase characters).
Adjusting the tolerance for those overlaps is what extract_words(...)'s y_tolerance and x_tolerance parameters are meant to help with. (I don't see extract_words mentioned explicitly in your issue, but it seems that's what you're referencing. Please correct me if I'm wrong about that.)

It's entirely possible, of course, that I'm misunderstanding the situation and that you've identified a genuine bug. But to really dig into that, I'd need a PDF sample and code that reproduce the problem.

dwalton76 · 2020-03-31T13:46:37Z

My intuition is that the characters in 12 and 34 overlap vertically with the characters in HomerSimpson. na...there are about 4 blank lines between the HomerSimpson line and the 12 34 line.

Let's do this...let me figure out a way to create a copy of the customer PDF but with all of the text modified to something I can share on github. Do you know of any tools that will let you modify the content of the characters in a PDF?

dwalton76 · 2020-03-31T14:33:50Z

Here is a sample PDF with the problem. It looks like gibberish because I randomly replaced each character using https://github.com/JoshData/pdf-redactor

Without the patch we get the following bounding boxes for the words from extract_words()

With the patch we get:

jsvine · 2020-03-31T14:49:48Z

Thanks! That's very helpful indeed. I'm in the middle of the workday, but can take a closer look in the evening. In the meantime, I took a quick skim at what you sent, and it seems the issue might not be with cluster_list but rather with how extract_words deals with vertical text (and side-effects it may have for the rest of the text on a page). Compare the following results:

Using a section that contains just the structure that you previously identified

p0 = pdf.pages[0]
cropped = p0.crop((280, 100, 580, 300))
cropped.to_image().draw_rects(cropped.extract_words())

Using the same section, but expanded slightly to include vertical text

cropped = p0.crop((200, 100, 650, 300))
cropped.to_image().draw_rects(cropped.extract_words())

I'll dig into this more later.

dwalton76 · 2020-03-31T15:11:50Z

ah that is a good point I bet the top coordinates of the sideways characters are all close enough together to cause all of those characters to go into a single cluster

Previously, utils.extract_text(...) returned incorrect results in certain cases when vertical text was present, as observed in #192. This commit fixes that by first segregating vertical and horizontal text (via "upright" char attribute) before clustering characters. It also adds two parameters, horizontal_ltr and vertical_ttb, to give users control over whethere words are meant to be read left-to-right and/or top-to-bottom vs. their opposites.

jsvine · 2020-04-01T13:33:53Z

Hi @dwalton76. The problem did, indeed, stem from how extract_words handled vertical text. Thanks for presenting this edge case! Fixed via 8a5d858 and available in version 0.5.17. I tested the new code on your sample PDF, and the extraction appears to work as expected, no changes needed to cluster_list. For your use-case, you'll want to pass vertical_ttb = False to extract_text, since the vertical text in your PDF is meant to be read from bottom to top, rather than top to bottom.

dwalton76 · 2020-04-01T14:57:33Z

@jsvine cool thank you for taking a look and fixing this. Am doing some testing with the patch and see one issue. For the vertical text here:

extract_words has the order of the letters reversed

gbabA
baambaaaA
aabbaaabbaaaabA
60bbabA

jsvine · 2020-04-01T15:56:03Z

Did you try this part?:

For your use-case, you'll want to pass vertical_ttb = False to extract_text, since the vertical text in your PDF is meant to be read from bottom to top, rather than top to bottom.

That should resolve your issue, but let me know if not.

dwalton76 · 2020-04-01T18:12:38Z

With vertical_ttb = False it correctly orders the bottom word in each column but the words above that are still reversed reversed:

Ababg bottom word in first column, correct
baambaaaA top word in first column, incorrect
Abaaaabbaaabbaa bottom word in second column, correct
60bbabA top word in second column, incorrect

jsvine · 2020-04-02T03:40:03Z

Ah, yes, thank you for catching that! My mistake; I failed to fully test that logic. Now fixed in 150b5a9 and available in v0.5.18.

cluster_list may put too many entries in a group

4ab626b

jsvine added the pending-discussion Issues and PRs that require further discussion label Mar 30, 2020

jsvine closed this Apr 1, 2020

dwalton76 mentioned this pull request Apr 28, 2020

extract_words logic is flawed #116

Closed

dwalton76 deleted the cluster_list branch April 28, 2020 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster_list may put too many entries in a group #192

cluster_list may put too many entries in a group #192

dwalton76 commented Mar 15, 2020

jsvine commented Mar 30, 2020

dwalton76 commented Mar 30, 2020

jsvine commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

jsvine commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

jsvine commented Mar 31, 2020

dwalton76 commented Mar 31, 2020 •

edited

Loading

jsvine commented Apr 1, 2020

dwalton76 commented Apr 1, 2020

jsvine commented Apr 1, 2020

dwalton76 commented Apr 1, 2020

jsvine commented Apr 2, 2020

cluster_list may put too many entries in a group #192

cluster_list may put too many entries in a group #192

Conversation

dwalton76 commented Mar 15, 2020

jsvine commented Mar 30, 2020

dwalton76 commented Mar 30, 2020

jsvine commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

jsvine commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

dwalton76 commented Mar 31, 2020

jsvine commented Mar 31, 2020

Using a section that contains just the structure that you previously identified

Using the same section, but expanded slightly to include vertical text

dwalton76 commented Mar 31, 2020 • edited Loading

jsvine commented Apr 1, 2020

dwalton76 commented Apr 1, 2020

jsvine commented Apr 1, 2020

dwalton76 commented Apr 1, 2020

jsvine commented Apr 2, 2020

dwalton76 commented Mar 31, 2020 •

edited

Loading