-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster_list may put too many entries in a group #192
Conversation
Hi @dwalton76, and thanks for your interesting in PDFPlumber. In this case, the current functionality is the intended functionality — or at least how I intended it, for the specific way it is used in the library. I can see, however, a case for wanting to perform different types of clustering. What's your particular use-case here? |
Hi @jsvine, I was parsing a PDF where the current behavior causes the words to be extracted incorrectly. I would upload the PDF but it is from a customer so will just describe the behavior as best I can. In a nutshell we ended up with the following very unusual "word" bounding boxes
Where the 2nd word word was |
Hi @dwalton76, and thanks for the extra info. I totally understand that you're in a difficult place because you can't share the PDF you're working on. Unfortunately, I'm having a bit of a hard time understanding the sketch provided. Is there a de-identified PDF excerpt you could share? Also (though I may just be misunderstanding the sketch), I wonder if you could achieve the desired results by passing |
We deal with a pretty large variety of pdfs so while maybe I could do a work-around like Will try to elaborate on my artwork above :) A subset of text on the page looks like this:
we ask plumber to give us a list of all of the words on the page and then we draw a bounding box around each word, that is what is shown here
The characters in The reason the characters in |
Thanks for the extra info, @dwalton76. I appreciate it. Just to reiterate: The current logic of Without the PDF or explicit code, it's hard to debug precisely, but ... :
It's entirely possible, of course, that I'm misunderstanding the situation and that you've identified a genuine bug. But to really dig into that, I'd need a PDF sample and code that reproduce the problem. |
Let's do this...let me figure out a way to create a copy of the customer PDF but with all of the text modified to something I can share on github. Do you know of any tools that will let you modify the content of the characters in a PDF? |
Here is a sample PDF with the problem. It looks like gibberish because I randomly replaced each character using https://github.com/JoshData/pdf-redactor Without the patch we get the following bounding boxes for the words from extract_words() |
Thanks! That's very helpful indeed. I'm in the middle of the workday, but can take a closer look in the evening. In the meantime, I took a quick skim at what you sent, and it seems the issue might not be with Using a section that contains just the structure that you previously identifiedp0 = pdf.pages[0]
cropped = p0.crop((280, 100, 580, 300))
cropped.to_image().draw_rects(cropped.extract_words()) Using the same section, but expanded slightly to include vertical textcropped = p0.crop((200, 100, 650, 300))
cropped.to_image().draw_rects(cropped.extract_words()) I'll dig into this more later. |
ah that is a good point I bet the top coordinates of the sideways characters are all close enough together to cause all of those characters to go into a single cluster |
Previously, utils.extract_text(...) returned incorrect results in certain cases when vertical text was present, as observed in #192. This commit fixes that by first segregating vertical and horizontal text (via "upright" char attribute) before clustering characters. It also adds two parameters, horizontal_ltr and vertical_ttb, to give users control over whethere words are meant to be read left-to-right and/or top-to-bottom vs. their opposites.
Hi @dwalton76. The problem did, indeed, stem from how |
@jsvine cool thank you for taking a look and fixing this. Am doing some testing with the patch and see one issue. For the vertical text here: extract_words has the order of the letters reversed
|
Did you try this part?:
That should resolve your issue, but let me know if not. |
With
|
Ah, yes, thank you for catching that! My mistake; I failed to fully test that logic. Now fixed in 150b5a9 and available in v0.5.18. |
Today if you give pass cluster_list a list of numbers such as
['0', '1.1', '2.2', '3.3', '4.4', '5.5']
with a tolerance of2
it will put all of those numbers in one group. This happens becauselast = x
is executed everytime through the main loopl.last = x
should only happen when a new group is created.With the fix in place cluster_list will return
[[0, 1.1], [2.2, 3.3], [4.4, 5.5]]