ValueError: not enough image data when enumerating images #1957

michelcrypt4d4mus · 2023-07-09T08:06:49Z

Background is the same as #1954 but the error is again slightly different.

The page in question (which again I sadly cannot share) seems to definitely have images... it's in fact a page that is one big image. pypdf does not have any issue parsing many very extremely similar pages in the PDF - pages that are also one big image, the image is scanned text, and that text is almost exactly the same format / layout on the page.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ workspace/clown_sort/clown_sort/files/pdf_file.py:52 in extracted_text            │
│                                                                                                  │
│    49 │   │   │   │                                                                              │
│    50 │   │   │   │   # Extracting images is a bit fraught (lots of PIL and pypdf exceptions h   │
│    51 │   │   │   │   try:                                                                       │
│ ❱  52 │   │   │   │   │   for image_number, image in enumerate(page.images, start=1):            │
│    53 │   │   │   │   │   │   image_name = f"Page {page_number}, Image {image_number}"           │
│    54 │   │   │   │   │   │   self._log_to_stderr(f"   Processing {image_name}...")              │
│    55 │   │   │   │   │   │   page_buffer.print(Panel(image_name, expand=False))                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2596 in __iter__                                                       │
│                                                                                                  │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│ ❱ 2596 │   │   │   yield self[i]                                                                 │
│   2597 │                                                                                         │
│   2598 │   def __str__(self) -> str:                                                             │
│   2599 │   │   p = [f"Image_{i}={n}" for i, n in enumerate(self.ids_function())]                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2592 in __getitem__                                                    │
│                                                                                                  │
│   2589 │   │   │   index = len_self + index                                                      │
│   2590 │   │   if index < 0 or index >= len_self:                                                │
│   2591 │   │   │   raise IndexError("sequence index out of range")                               │
│ ❱ 2592 │   │   return self.get_function(lst[index])                                              │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:522 in _get_image                                                      │
│                                                                                                  │
│    519 │   │   │   │   │   raise KeyError("no inline image can be found")                        │
│    520 │   │   │   │   return self.inline_images[id]                                             │
│    521 │   │   │                                                                                 │
│ ❱  522 │   │   │   imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))                      │
│    523 │   │   │   extension, byte_stream = imgd[:2]                                             │
│    524 │   │   │   f = ImageFile(                                                                │
│    525 │   │   │   │   name=f"{id[1:]}{extension}",                                              │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:844 in _xobj_to_image                                                │
│                                                                                                  │
│   841 │   filters = x_object_obj.get(SA.FILTER, [None])                                          │
│   842 │   lfilters = filters[-1] if isinstance(filters, list) else filters                       │
│   843 │   if lfilters == FT.FLATE_DECODE:                                                        │
│ ❱ 844 │   │   img, image_format, extension = _handle_flate(                                      │
│   845 │   │   │   size, data, mode, color_space, colors                                          │
│   846 │   │   )                                                                                  │
│   847 │   elif lfilters in (FT.LZW_DECODE, FT.ASCII_85_DECODE, FT.CCITT_FAX_DECODE):             │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:729 in _handle_flate                                                 │
│                                                                                                  │
│   726 │   │   │   color_space, base, hival, lookup = (                                           │
│   727 │   │   │   │   value.get_object() for value in color_space                                │
│   728 │   │   │   )                                                                              │
│ ❱ 729 │   │   img = Image.frombytes(mode, size, data)                                            │
│   730 │   │   if color_space == "/Indexed":                                                      │
│   731 │   │   │   from .generic import ByteStringObject                                          │
│   732                                                                                            │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/PIL/Image.py:2970 in frombytes                                                        │
│                                                                                                  │
│   2967 │   │   args = mode                                                                       │
│   2968 │                                                                                         │
│   2969 │   im = new(mode, size)                                                                  │
│ ❱ 2970 │   im.frombytes(data, decoder_name, args)                                                │
│   2971 │   return im                                                                             │
│   2972                                                                                           │
│   2973                                                                                           │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/PIL/Image.py:826 in frombytes                                                         │
│                                                                                                  │
│    823 │   │                                                                                     │
│    824 │   │   if s[0] >= 0:                                                                     │
│    825 │   │   │   msg = "not enough image data"                                                 │
│ ❱  826 │   │   │   raise ValueError(msg)                                                         │
│    827 │   │   if s[1] != 0:                                                                     │
│    828 │   │   │   msg = "cannot decode image data"                                              │
│    829 │   │   │   raise ValueError(msg)                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2023-07-09T08:25:41Z

same as #1955

michelcrypt4d4mus · 2023-07-11T21:29:24Z

emailed the broken page to @MartinThoma

pubpub-zz · 2023-07-12T21:14:50Z

the issue is coming from an image being coded on 2 bits : this requires buffer refactoring before creating the internal image. We have to be careful that sequence are padded always start lines on a bit0
fix is in progress

MartinThoma · 2023-07-12T21:16:31Z

You are amazing 🤗 But it's pretty late. Don't forget to sleep enough 😊 pubpub-zz ***@***.***> schrieb am Mi., 12. Juli 2023, 23:15:

…

the issue is coming from an image being coded on 2 bits : this requires buffer refactoring before creating the internal image. We have to be careful that sequence are padded always start lines on a bit0 fix is in progress — Reply to this email directly, view it on GitHub <#1957 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMU2BP45OFBT3SOAWNU7DDXP4HVJANCNFSM6AAAAAA2DK76YA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pubpub-zz · 2023-07-18T09:11:00Z

examples and discussion were wrongly in #1954
this is now closed

michelcrypt4d4mus changed the title ~~ValueError: not enough image data when extracting images~~ ValueError: not enough image data when enumerating images Jul 9, 2023

pubpub-zz added the needs-pdf The issue needs a PDF file to show the problem label Jul 9, 2023

pubpub-zz added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF and removed needs-pdf The issue needs a PDF file to show the problem labels Jul 17, 2023

This was referenced Jul 18, 2023

OSError: cannot write mode CMYK as PNG when enumerating images #1954

Closed

BUG: Process 2bits and 4bits images #1967

Merged

pubpub-zz closed this as completed Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: not enough image data when enumerating images #1957

ValueError: not enough image data when enumerating images #1957

michelcrypt4d4mus commented Jul 9, 2023

pubpub-zz commented Jul 9, 2023

michelcrypt4d4mus commented Jul 11, 2023

pubpub-zz commented Jul 12, 2023

MartinThoma commented Jul 12, 2023 via email

pubpub-zz commented Jul 18, 2023

ValueError: not enough image data when enumerating images #1957

ValueError: not enough image data when enumerating images #1957

Comments

michelcrypt4d4mus commented Jul 9, 2023

pubpub-zz commented Jul 9, 2023

michelcrypt4d4mus commented Jul 11, 2023

pubpub-zz commented Jul 12, 2023

MartinThoma commented Jul 12, 2023 via email

pubpub-zz commented Jul 18, 2023