Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

Closed
ramSeraph opened this issue May 4, 2022 · 9 comments
Labels

Comments

@ramSeraph
Copy link

ramSeraph commented May 4, 2022

Bug report

  • Description of the bug
    Image extraction does not handle the case when colorspace is a PdfObjRef
    bmp handling might be broken in some cases

  • Steps to reproduce the bug.

Tried with both pypi latest and github latest.

file:

58A_3.pdf

from pdfminer.image import ImageWriter
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTImage

filename = '58A_3.pdf'

def get_images(layout):
    imgs = []
    if isinstance(layout, LTImage):
        imgs.append(layout)

    objs = getattr(layout, '_objs', [])
    for obj in objs:
        imgs.extend(get_images(obj))
    return imgs


with open(filename, "rb") as fp:
    parser = PDFParser(fp)
    document = PDFDocument(parser)
    img_writer = ImageWriter('.')
    rsrcmgr = PDFResourceManager(caching=True)
    device = PDFPageAggregator(rsrcmgr, laparams=None)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    page_info = {}
    pno = 0
    for page in PDFPage.create_pages(document):
        if pno > 0:
            raise Exception('only one page expected')
        interpreter.process_page(page)
        layout = device.get_result()
        page_info = {}
        page_info['layout'] = layout
        images = get_images(layout)
        if len(images) > 1:
            raise Exception('Only one image expected')
        image = images[0]
        print(f'{image=}')
        print(f'{image.colorspace=}')
        fname = img_writer.export_image(image)
        print(f'image extracted to {fname}')
        pno += 1

Output from running the command:

image=<LTImage(Im0) 0.000,0.000,2160.000,2160.000 (18000, 18000)>
image.colorspace=[<PDFObjRef:7>]
image extracted to Im0.0.8.18000x18000.img

Output of using file on the image created

Im0.0.8.18000x18000.img: data

But, it is clear that there is a proper image in the pdf as can be seen from opening the image in any pdf viewer

adding the following change gets the image out in bmp format

--- bug_report.py	2022-05-04 13:15:27.000000000 +0530
+++ bug_report_fix.py	2022-05-04 13:22:12.000000000 +0530
@@ -6,7 +6,7 @@
 from pdfminer.pdfinterp import PDFPageInterpreter
 from pdfminer.converter import PDFPageAggregator
 from pdfminer.layout import LTImage
-
+from pdfminer.pdftypes import resolve_all, PDFObjRef


 filename = '58A_3.pdf'
@@ -42,6 +42,13 @@
         if len(images) > 1:
             raise Exception('Only one image expected')
         image = images[0]
+
+        # fix to pdfminer bug
+        if len(image.colorspace) == 1 and isinstance(image.colorspace[0], PDFObjRef):
+            image.colorspace = resolve_all(image.colorspace[0])
+            if not isinstance(image.colorspace, list):
+                image.colorspace = [ image.colorspace ]
+
         print(f'{image=}')
         print(f'{image.colorspace=}')
         fname = img_writer.export_image(image)

the output of the fixed script:

image=<LTImage(Im0) 0.000,0.000,2160.000,2160.000 (18000, 18000)>
image.colorspace=[/'Indexed', /'DeviceRGB', 255, <PDFStream(10): raw=768, {'Length': 768}>]
image extracted to Im0.18000x18000.bmp

Output of using file on the image created

Im0.18000x18000.bmp: PC bitmap, Windows 3.x format, 18000 x 18000 x 24, image size 972000000, cbSize 972000054, bits offset 54

Even though this solves one part of the problem, the resulting bmp file is still not ok.

The actual pdf looks like this:

Screenshot 2022-05-04 at 1 28 00 PM

the bmp created looks like this:
Screenshot 2022-05-04 at 1 31 33 PM

This is either 2 bugs or one bug for which i made the wrong fix.

I have decided to give up and report this.

@ramSeraph ramSeraph changed the title Image extraction does not handle the case when colorspace is a PdfObjRef, and bmp handling is probably broken Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken May 4, 2022
@pietermarsman
Copy link
Member

I think I can replicate this with pdf2txt.py. This is a bug that needs fixing.

$ pdf2txt.py --output-dir ~/Desktop ~/Downloads/58A_3.pdf                                                                                                                                                                  22:18:24
Traceback (most recent call last):
  File "/home/pieter/.local/bin/pdf2txt.py", line 313, in <module>                                                                                                                                                                    
    sys.exit(main())
  File "/home/pieter/.local/bin/pdf2txt.py", line 307, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/.local/bin/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/high_level.py", line 121, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 992, in process_page
    self.device.end_page(page)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 80, in end_page
    self.receive_layout(self.cur_item)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 322, in receive_layout
    render(ltpage)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 311, in render
    render(child)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 311, in render
    render(child)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 318, in render
    self.imagewriter.export_image(item)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/image.py", line 125, in export_image
    name = self._save_bytes(image)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/image.py", line 240, in _save_bytes
    img = Image.frombytes(mode, image.srcsize, image.stream.get_data(), "raw")
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2706, in frombytes
    im = new(mode, size)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2670, in new
    return im._new(core.fill(mode, size, color))
ValueError: unrecognized image mode

@TamaraAtanasoska
Copy link

I am currently also hitting this problem with multiple PDFs. I am solving it for myself by keeping a reference list of the images in a structured output(no extraction), but I would love to be able to extract them too eventually. Do you have some pointers @pietermarsman about how to go about solving this? Anything to take into consideration while poking around?

@ramSeraph
Copy link
Author

Are you facing the first part of the problem related to pdfobjref or the second part related to bmp handling? The first part can probably be worked around with the diff i posted above.. I suspect bmp handling is a seperate issue altogether.

@pettzilla1
Copy link
Contributor

pettzilla1 commented Jul 28, 2022

@TamaraAtanasoska @ramSeraph #773 focused on the issue of BMP handling I created a pull with the fix but no one has reviewed it. #784
you should be able to find the fix for bmp handling in there , you should then be able to use pdfminers in built functions

@TamaraAtanasoska
Copy link

Thank you @ramSeraph and @pettzilla1 ! I look into both of your solutions today and see if they fix my problem, or there is more to it 😄 🙏 I hope someone reviews that PR!

@TamaraAtanasoska
Copy link

It seems that I've hit problems with 'DeviceCMYK' as colorspace right after this(at least I think that is the issue), so it won't be enough to fix my problem. Still awesome to save some time on digging around about these issues at least!

@pietermarsman
Copy link
Member

Thanks @pettzilla1! Will review it soon.

@MartinThoma
Copy link

As a side-note: pypdf extracts it just fine (e.g.: pdfly extract-images 58A_3.pdf, pdfly uses pypdf). Meaning you might want to have a look at how we do it :-)

@pietermarsman
Copy link
Member

@MartinThoma Thanks for the heads up. That is probably because the issue was fixed in either #773 and #784. At least I don't get an error anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants