-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754
Comments
I think I can replicate this with pdf2txt.py. This is a bug that needs fixing.
|
I am currently also hitting this problem with multiple PDFs. I am solving it for myself by keeping a reference list of the images in a structured output(no extraction), but I would love to be able to extract them too eventually. Do you have some pointers @pietermarsman about how to go about solving this? Anything to take into consideration while poking around? |
Are you facing the first part of the problem related to pdfobjref or the second part related to bmp handling? The first part can probably be worked around with the diff i posted above.. I suspect bmp handling is a seperate issue altogether. |
@TamaraAtanasoska @ramSeraph #773 focused on the issue of BMP handling I created a pull with the fix but no one has reviewed it. #784 |
Thank you @ramSeraph and @pettzilla1 ! I look into both of your solutions today and see if they fix my problem, or there is more to it 😄 🙏 I hope someone reviews that PR! |
It seems that I've hit problems with 'DeviceCMYK' as colorspace right after this(at least I think that is the issue), so it won't be enough to fix my problem. Still awesome to save some time on digging around about these issues at least! |
Thanks @pettzilla1! Will review it soon. |
@MartinThoma Thanks for the heads up. That is probably because the issue was fixed in either #773 and #784. At least I don't get an error anymore. |
Bug report
Description of the bug
Image extraction does not handle the case when colorspace is a PdfObjRef
bmp handling might be broken in some cases
Steps to reproduce the bug.
Tried with both pypi latest and github latest.
file:
58A_3.pdf
Output from running the command:
Output of using
file
on the image createdBut, it is clear that there is a proper image in the pdf as can be seen from opening the image in any pdf viewer
adding the following change gets the image out in bmp format
the output of the fixed script:
Output of using
file
on the image createdEven though this solves one part of the problem, the resulting bmp file is still not ok.
The actual pdf looks like this:
the bmp created looks like this:
This is either 2 bugs or one bug for which i made the wrong fix.
I have decided to give up and report this.
The text was updated successfully, but these errors were encountered: