Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong extraction of nested cropped page with relative flag #914

Closed
SS-035 opened this issue Jun 27, 2023 · 2 comments
Closed

Wrong extraction of nested cropped page with relative flag #914

SS-035 opened this issue Jun 27, 2023 · 2 comments
Assignees
Labels

Comments

@SS-035
Copy link

SS-035 commented Jun 27, 2023

Describe the bug

When extracting text from a page which was cropped multiple times with relative parameter, the return string is always empty. In the following code snippet, when I try to extract the Column 2, text1 returns the right result, but text2 with relative cropping returns empty.

Code to reproduce the problem

pdf = pdfplumber.open('Lorem.pdf')
page = pdf.pages[0]
cropped = page.crop((page.width / 2, 0, page.width, page.height))
crop1 = cropped.crop((page.width / 2, 0, page.width, cropped.height), relative=False)
text1 = crop1.extract_text()      # returns correct result
crop2 = cropped.crop((0, 0, cropped.width, cropped.height), relative=True)
text2 = crop2.extract_text()      # returns ''

PDF file

Lorem.pdf
image

Expected behavior

text2 should also give the correctly extracted Column 2 value like text1.

Actual behavior

text2 is empty string.

Environment

  • pdfplumber version: 0.9.0
  • Python version: 3.10.10
  • OS: Ubuntu 22.04

Additional context

  • Both crop1 and crop2 have the exact same bounding box.
  • This used to work as intended in older pdfplumber version (tested on v0.5.28).

Thanks for continuously maintaining this wonderful package.

@SS-035 SS-035 added the bug label Jun 27, 2023
@jsvine
Copy link
Owner

jsvine commented Jun 30, 2023

Thanks for flagging this, @SS-035. I agree, this seems to be a bug. I’ll look into it and will report back.

@jsvine jsvine self-assigned this Jun 30, 2023
jsvine added a commit that referenced this issue Jul 16, 2023
When using relative=True for a re-crop, pdfplumber was passing the wrong
bounding box to the cropping function. This commit fixes that bug and
also refactors CroppedPage.__init__(...) for clarity and consistency's
sake.
@jsvine jsvine mentioned this issue Jul 16, 2023
@jsvine
Copy link
Owner

jsvine commented Jul 17, 2023

Fix now available in v0.10.0. Feel free to reopen this issue if the problem persists. Thanks again for flagging, @SS-035 👍

@jsvine jsvine closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants