Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster_list may put too many entries in a group #192

Closed
wants to merge 1 commit into from
Closed

cluster_list may put too many entries in a group #192

wants to merge 1 commit into from

Conversation

dwalton76
Copy link

Today if you give pass cluster_list a list of numbers such as ['0', '1.1', '2.2', '3.3', '4.4', '5.5'] with a tolerance of 2 it will put all of those numbers in one group. This happens because last = x is executed everytime through the main loopl. last = x should only happen when a new group is created.

With the fix in place cluster_list will return [[0, 1.1], [2.2, 3.3], [4.4, 5.5]]

@jsvine
Copy link
Owner

jsvine commented Mar 30, 2020

Hi @dwalton76, and thanks for your interesting in PDFPlumber. In this case, the current functionality is the intended functionality — or at least how I intended it, for the specific way it is used in the library. I can see, however, a case for wanting to perform different types of clustering. What's your particular use-case here?

@jsvine jsvine added the pending-discussion Issues and PRs that require further discussion label Mar 30, 2020
@dwalton76
Copy link
Author

Hi @jsvine, I was parsing a PDF where the current behavior causes the words to be extracted incorrectly. I would upload the PDF but it is from a customer so will just describe the behavior as best I can. In a nutshell we ended up with the following very unusual "word" bounding boxes

+-----+ +------------+
|hello| |HomerSimpson|
+-----+ | 12    34   |
        +------------+

Where the 2nd word word was Ho1m2erSim3p4son. This happened because all of the characters in HomerSimpson, 12 and 34 ended up in the same cluster. With the patch in place the characters in HomerSimpson end up in one cluster while the characters in 12 and 34 go into a different cluster from HomerSimpson

@jsvine
Copy link
Owner

jsvine commented Mar 31, 2020

Hi @dwalton76, and thanks for the extra info. I totally understand that you're in a difficult place because you can't share the PDF you're working on. Unfortunately, I'm having a bit of a hard time understanding the sketch provided. Is there a de-identified PDF excerpt you could share?

Also (though I may just be misunderstanding the sketch), I wonder if you could achieve the desired results by passing x_tolerance = 0 (or related param, depending on the context).

@dwalton76
Copy link
Author

We deal with a pretty large variety of pdfs so while maybe I could do a work-around like x_tolerance = 0 for this one pdf that wouldn't be something I could apply in general. So then I would need to code some logic to know that the work-around needed to be applied...seems better to fix the underlying issue.

Will try to elaborate on my artwork above :) A subset of text on the page looks like this:

hello HomerSimpson
       12    34

we ask plumber to give us a list of all of the words on the page and then we draw a bounding box around each word, that is what is shown here

+-----+ +------------+
|hello| |HomerSimpson|
+-----+ | 12    34   |
        +------------+

The characters in HomerSimpson, 12 and 34 end up in the same cluster so when plumber traverses through the characters in that cluster from left to right it constructs the word Ho1m2erSim3p4son.

The reason the characters in HomerSimpson, 12 and 34 end up in the same cluster is due to the last = x logic in cluster_list. Assume we pass cluster_list a tolerance of 2. There are many other characters of text on the page and if they are spaced out just the right way vertically so that you never have a vertical delta of >2 between them all of those characters will end up in the same cluster. That is what happens here, the 12 and the 34 characters end up in the same cluster as HomerSimpson.

@jsvine
Copy link
Owner

jsvine commented Mar 31, 2020

Thanks for the extra info, @dwalton76. I appreciate it.

Just to reiterate: The current logic of cluster_list is the intended one, and changing that logic would likely cause many more edge cases for other users. Looking at your example, though, I wonder if you're possibly misunderstanding the goal of cluster_list, and its role in the issue you're encountering. It's a very low-level function meant to be used by other utilities in the library.

Without the PDF or explicit code, it's hard to debug precisely, but ... :

  • My intuition is that the characters in 12 and 34 overlap vertically with the characters in HomerSimpson. (Perhaps you could use the visual-debugging tools to show us the bounding boxes of the individual characters, censoring out any sensitive information?)

  • By design, PDFPlumber interprets vertically-overlapping characters as belonging to one word — a side-effect of wanting to handle the common edge-case of PDFs not always perfectly aligning text in the same word (or using different vertical extents for capital and lowercase characters).

  • Adjusting the tolerance for those overlaps is what extract_words(...)'s y_tolerance and x_tolerance parameters are meant to help with. (I don't see extract_words mentioned explicitly in your issue, but it seems that's what you're referencing. Please correct me if I'm wrong about that.)

It's entirely possible, of course, that I'm misunderstanding the situation and that you've identified a genuine bug. But to really dig into that, I'd need a PDF sample and code that reproduce the problem.

@dwalton76
Copy link
Author

My intuition is that the characters in 12 and 34 overlap vertically with the characters in HomerSimpson. na...there are about 4 blank lines between the HomerSimpson line and the 12 34 line.

Let's do this...let me figure out a way to create a copy of the customer PDF but with all of the text modified to something I can share on github. Do you know of any tools that will let you modify the content of the characters in a PDF?

@dwalton76
Copy link
Author

Here is a sample PDF with the problem. It looks like gibberish because I randomly replaced each character using https://github.com/JoshData/pdf-redactor

Without the patch we get the following bounding boxes for the words from extract_words()
without-the-patch

With the patch we get:
with-the-patch

@jsvine
Copy link
Owner

jsvine commented Mar 31, 2020

Thanks! That's very helpful indeed. I'm in the middle of the workday, but can take a closer look in the evening. In the meantime, I took a quick skim at what you sent, and it seems the issue might not be with cluster_list but rather with how extract_words deals with vertical text (and side-effects it may have for the rest of the text on a page). Compare the following results:

Using a section that contains just the structure that you previously identified

p0 = pdf.pages[0]
cropped = p0.crop((280, 100, 580, 300))
cropped.to_image().draw_rects(cropped.extract_words())

image

Using the same section, but expanded slightly to include vertical text

cropped = p0.crop((200, 100, 650, 300))
cropped.to_image().draw_rects(cropped.extract_words())

image

I'll dig into this more later.

@dwalton76
Copy link
Author

dwalton76 commented Mar 31, 2020

ah that is a good point I bet the top coordinates of the sideways characters are all close enough together to cause all of those characters to go into a single cluster

jsvine added a commit that referenced this pull request Apr 1, 2020
Previously, utils.extract_text(...) returned incorrect results in
certain cases when vertical text was present, as observed in
#192. This commit fixes that
by first segregating vertical and horizontal text (via "upright" char
attribute) before clustering characters.

It also adds two parameters, horizontal_ltr and vertical_ttb, to give
users control over whethere words are meant to be read left-to-right
and/or top-to-bottom vs. their opposites.
@jsvine
Copy link
Owner

jsvine commented Apr 1, 2020

Hi @dwalton76. The problem did, indeed, stem from how extract_words handled vertical text. Thanks for presenting this edge case! Fixed via 8a5d858 and available in version 0.5.17. I tested the new code on your sample PDF, and the extraction appears to work as expected, no changes needed to cluster_list. For your use-case, you'll want to pass vertical_ttb = False to extract_text, since the vertical text in your PDF is meant to be read from bottom to top, rather than top to bottom.

@jsvine jsvine closed this Apr 1, 2020
@dwalton76
Copy link
Author

@jsvine cool thank you for taking a look and fixing this. Am doing some testing with the patch and see one issue. For the vertical text here:
cell

extract_words has the order of the letters reversed

  • gbabA
  • baambaaaA
  • aabbaaabbaaaabA
  • 60bbabA

@jsvine
Copy link
Owner

jsvine commented Apr 1, 2020

Did you try this part?:

For your use-case, you'll want to pass vertical_ttb = False to extract_text, since the vertical text in your PDF is meant to be read from bottom to top, rather than top to bottom.

That should resolve your issue, but let me know if not.

@dwalton76
Copy link
Author

With vertical_ttb = False it correctly orders the bottom word in each column but the words above that are still reversed reversed:

  • Ababg bottom word in first column, correct
  • baambaaaA top word in first column, incorrect
  • Abaaaabbaaabbaa bottom word in second column, correct
  • 60bbabA top word in second column, incorrect

@jsvine
Copy link
Owner

jsvine commented Apr 2, 2020

Ah, yes, thank you for catching that! My mistake; I failed to fully test that logic. Now fixed in 150b5a9 and available in v0.5.18.

@dwalton76 dwalton76 deleted the cluster_list branch April 28, 2020 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending-discussion Issues and PRs that require further discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants