Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

ramSeraph · 2022-05-04T08:06:23Z

Bug report

Description of the bug
Image extraction does not handle the case when colorspace is a PdfObjRef
bmp handling might be broken in some cases
Steps to reproduce the bug.

Tried with both pypi latest and github latest.

file:

from pdfminer.image import ImageWriter
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTImage

filename = '58A_3.pdf'

def get_images(layout):
    imgs = []
    if isinstance(layout, LTImage):
        imgs.append(layout)

    objs = getattr(layout, '_objs', [])
    for obj in objs:
        imgs.extend(get_images(obj))
    return imgs


with open(filename, "rb") as fp:
    parser = PDFParser(fp)
    document = PDFDocument(parser)
    img_writer = ImageWriter('.')
    rsrcmgr = PDFResourceManager(caching=True)
    device = PDFPageAggregator(rsrcmgr, laparams=None)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    page_info = {}
    pno = 0
    for page in PDFPage.create_pages(document):
        if pno > 0:
            raise Exception('only one page expected')
        interpreter.process_page(page)
        layout = device.get_result()
        page_info = {}
        page_info['layout'] = layout
        images = get_images(layout)
        if len(images) > 1:
            raise Exception('Only one image expected')
        image = images[0]
        print(f'{image=}')
        print(f'{image.colorspace=}')
        fname = img_writer.export_image(image)
        print(f'image extracted to {fname}')
        pno += 1

Output from running the command:

image=<LTImage(Im0) 0.000,0.000,2160.000,2160.000 (18000, 18000)>
image.colorspace=[<PDFObjRef:7>]
image extracted to Im0.0.8.18000x18000.img

Output of using file on the image created

Im0.0.8.18000x18000.img: data

But, it is clear that there is a proper image in the pdf as can be seen from opening the image in any pdf viewer

adding the following change gets the image out in bmp format

--- bug_report.py	2022-05-04 13:15:27.000000000 +0530
+++ bug_report_fix.py	2022-05-04 13:22:12.000000000 +0530
@@ -6,7 +6,7 @@
 from pdfminer.pdfinterp import PDFPageInterpreter
 from pdfminer.converter import PDFPageAggregator
 from pdfminer.layout import LTImage
-
+from pdfminer.pdftypes import resolve_all, PDFObjRef


 filename = '58A_3.pdf'
@@ -42,6 +42,13 @@
         if len(images) > 1:
             raise Exception('Only one image expected')
         image = images[0]
+
+        # fix to pdfminer bug
+        if len(image.colorspace) == 1 and isinstance(image.colorspace[0], PDFObjRef):
+            image.colorspace = resolve_all(image.colorspace[0])
+            if not isinstance(image.colorspace, list):
+                image.colorspace = [ image.colorspace ]
+
         print(f'{image=}')
         print(f'{image.colorspace=}')
         fname = img_writer.export_image(image)

the output of the fixed script:

image=<LTImage(Im0) 0.000,0.000,2160.000,2160.000 (18000, 18000)>
image.colorspace=[/'Indexed', /'DeviceRGB', 255, <PDFStream(10): raw=768, {'Length': 768}>]
image extracted to Im0.18000x18000.bmp

Output of using file on the image created

Im0.18000x18000.bmp: PC bitmap, Windows 3.x format, 18000 x 18000 x 24, image size 972000000, cbSize 972000054, bits offset 54

Even though this solves one part of the problem, the resulting bmp file is still not ok.

The actual pdf looks like this:

the bmp created looks like this:

This is either 2 bugs or one bug for which i made the wrong fix.

I have decided to give up and report this.

The text was updated successfully, but these errors were encountered:

pietermarsman · 2022-05-06T20:21:24Z

I think I can replicate this with pdf2txt.py. This is a bug that needs fixing.

$ pdf2txt.py --output-dir ~/Desktop ~/Downloads/58A_3.pdf                                                                                                                                                                  22:18:24
Traceback (most recent call last):
  File "/home/pieter/.local/bin/pdf2txt.py", line 313, in <module>                                                                                                                                                                    
    sys.exit(main())
  File "/home/pieter/.local/bin/pdf2txt.py", line 307, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/.local/bin/pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/high_level.py", line 121, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 992, in process_page
    self.device.end_page(page)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 80, in end_page
    self.receive_layout(self.cur_item)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 322, in receive_layout
    render(ltpage)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 311, in render
    render(child)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 311, in render
    render(child)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/converter.py", line 318, in render
    self.imagewriter.export_image(item)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/image.py", line 125, in export_image
    name = self._save_bytes(image)
  File "/home/pieter/.local/lib/python3.10/site-packages/pdfminer/image.py", line 240, in _save_bytes
    img = Image.frombytes(mode, image.srcsize, image.stream.get_data(), "raw")
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2706, in frombytes
    im = new(mode, size)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2670, in new
    return im._new(core.fill(mode, size, color))
ValueError: unrecognized image mode

TamaraAtanasoska · 2022-07-27T20:47:46Z

I am currently also hitting this problem with multiple PDFs. I am solving it for myself by keeping a reference list of the images in a structured output(no extraction), but I would love to be able to extract them too eventually. Do you have some pointers @pietermarsman about how to go about solving this? Anything to take into consideration while poking around?

ramSeraph · 2022-07-27T23:56:58Z

Are you facing the first part of the problem related to pdfobjref or the second part related to bmp handling? The first part can probably be worked around with the diff i posted above.. I suspect bmp handling is a seperate issue altogether.

pettzilla1 · 2022-07-28T07:16:53Z

@TamaraAtanasoska @ramSeraph #773 focused on the issue of BMP handling I created a pull with the fix but no one has reviewed it. #784
you should be able to find the fix for bmp handling in there , you should then be able to use pdfminers in built functions

TamaraAtanasoska · 2022-07-28T11:19:58Z

Thank you @ramSeraph and @pettzilla1 ! I look into both of your solutions today and see if they fix my problem, or there is more to it 😄 🙏 I hope someone reviews that PR!

TamaraAtanasoska · 2022-07-29T12:31:07Z

It seems that I've hit problems with 'DeviceCMYK' as colorspace right after this(at least I think that is the issue), so it won't be enough to fix my problem. Still awesome to save some time on digging around about these issues at least!

pietermarsman · 2022-08-08T20:11:57Z

Thanks @pettzilla1! Will review it soon.

MartinThoma · 2023-12-21T18:41:09Z

As a side-note: pypdf extracts it just fine (e.g.: pdfly extract-images 58A_3.pdf, pdfly uses pypdf). Meaning you might want to have a look at how we do it :-)

pietermarsman · 2023-12-22T20:11:56Z

@MartinThoma Thanks for the heads up. That is probably because the issue was fixed in either #773 and #784. At least I don't get an error anymore.

ramSeraph changed the title ~~Image extraction does not handle the case when colorspace is a PdfObjRef, and bmp handling is probably broken~~ Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken May 4, 2022

pietermarsman added the type: bug label May 6, 2022

pietermarsman added component:converter component: converter Related to any PDFLayoutAnalyzer status: accepted and removed component:converter labels Aug 8, 2022

MartinThoma mentioned this issue Dec 21, 2023

BUG: Handle IndirectObject as image filter py-pdf/pypdf#2355

Merged

pietermarsman closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

ramSeraph commented May 4, 2022 •

edited

Loading

pietermarsman commented May 6, 2022

TamaraAtanasoska commented Jul 27, 2022

ramSeraph commented Jul 27, 2022

pettzilla1 commented Jul 28, 2022 •

edited

Loading

TamaraAtanasoska commented Jul 28, 2022

TamaraAtanasoska commented Jul 29, 2022

pietermarsman commented Aug 8, 2022

MartinThoma commented Dec 21, 2023

pietermarsman commented Dec 22, 2023

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

Image extraction does not handle the case when colorspace is a PdfObjRef and bmp handling is probably broken #754

Comments

ramSeraph commented May 4, 2022 • edited Loading

pietermarsman commented May 6, 2022

TamaraAtanasoska commented Jul 27, 2022

ramSeraph commented Jul 27, 2022

pettzilla1 commented Jul 28, 2022 • edited Loading

TamaraAtanasoska commented Jul 28, 2022

TamaraAtanasoska commented Jul 29, 2022

pietermarsman commented Aug 8, 2022

MartinThoma commented Dec 21, 2023

pietermarsman commented Dec 22, 2023

ramSeraph commented May 4, 2022 •

edited

Loading

pettzilla1 commented Jul 28, 2022 •

edited

Loading