Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: not enough image data when enumerating images #1957

Closed
michelcrypt4d4mus opened this issue Jul 9, 2023 · 5 comments · Fixed by #1967
Closed

ValueError: not enough image data when enumerating images #1957

michelcrypt4d4mus opened this issue Jul 9, 2023 · 5 comments · Fixed by #1967
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@michelcrypt4d4mus
Copy link

Background is the same as #1954 but the error is again slightly different.

The page in question (which again I sadly cannot share) seems to definitely have images... it's in fact a page that is one big image. pypdf does not have any issue parsing many very extremely similar pages in the PDF - pages that are also one big image, the image is scanned text, and that text is almost exactly the same format / layout on the page.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ workspace/clown_sort/clown_sort/files/pdf_file.py:52 in extracted_text            │
│                                                                                                  │
│    49 │   │   │   │                                                                              │
│    50 │   │   │   │   # Extracting images is a bit fraught (lots of PIL and pypdf exceptions h   │
│    51 │   │   │   │   try:                                                                       │
│ ❱  52 │   │   │   │   │   for image_number, image in enumerate(page.images, start=1):            │
│    53 │   │   │   │   │   │   image_name = f"Page {page_number}, Image {image_number}"           │
│    54 │   │   │   │   │   │   self._log_to_stderr(f"   Processing {image_name}...")              │
│    55 │   │   │   │   │   │   page_buffer.print(Panel(image_name, expand=False))                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2596 in __iter__                                                       │
│                                                                                                  │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│ ❱ 2596 │   │   │   yield self[i]                                                                 │
│   2597 │                                                                                         │
│   2598 │   def __str__(self) -> str:                                                             │
│   2599 │   │   p = [f"Image_{i}={n}" for i, n in enumerate(self.ids_function())]                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2592 in __getitem__                                                    │
│                                                                                                  │
│   2589 │   │   │   index = len_self + index                                                      │
│   2590 │   │   if index < 0 or index >= len_self:                                                │
│   2591 │   │   │   raise IndexError("sequence index out of range")                               │
│ ❱ 2592 │   │   return self.get_function(lst[index])                                              │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:522 in _get_image                                                      │
│                                                                                                  │
│    519 │   │   │   │   │   raise KeyError("no inline image can be found")                        │
│    520 │   │   │   │   return self.inline_images[id]                                             │
│    521 │   │   │                                                                                 │
│ ❱  522 │   │   │   imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))                      │
│    523 │   │   │   extension, byte_stream = imgd[:2]                                             │
│    524 │   │   │   f = ImageFile(                                                                │
│    525 │   │   │   │   name=f"{id[1:]}{extension}",                                              │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:844 in _xobj_to_image                                                │
│                                                                                                  │
│   841 │   filters = x_object_obj.get(SA.FILTER, [None])                                          │
│   842 │   lfilters = filters[-1] if isinstance(filters, list) else filters                       │
│   843 │   if lfilters == FT.FLATE_DECODE:                                                        │
│ ❱ 844 │   │   img, image_format, extension = _handle_flate(                                      │
│   845 │   │   │   size, data, mode, color_space, colors                                          │
│   846 │   │   )                                                                                  │
│   847 │   elif lfilters in (FT.LZW_DECODE, FT.ASCII_85_DECODE, FT.CCITT_FAX_DECODE):             │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:729 in _handle_flate                                                 │
│                                                                                                  │
│   726 │   │   │   color_space, base, hival, lookup = (                                           │
│   727 │   │   │   │   value.get_object() for value in color_space                                │
│   728 │   │   │   )                                                                              │
│ ❱ 729 │   │   img = Image.frombytes(mode, size, data)                                            │
│   730 │   │   if color_space == "/Indexed":                                                      │
│   731 │   │   │   from .generic import ByteStringObject                                          │
│   732                                                                                            │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/PIL/Image.py:2970 in frombytes                                                        │
│                                                                                                  │
│   2967 │   │   args = mode                                                                       │
│   2968 │                                                                                         │
│   2969 │   im = new(mode, size)                                                                  │
│ ❱ 2970 │   im.frombytes(data, decoder_name, args)                                                │
│   2971 │   return im                                                                             │
│   2972                                                                                           │
│   2973                                                                                           │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/PIL/Image.py:826 in frombytes                                                         │
│                                                                                                  │
│    823 │   │                                                                                     │
│    824 │   │   if s[0] >= 0:                                                                     │
│    825 │   │   │   msg = "not enough image data"                                                 │
│ ❱  826 │   │   │   raise ValueError(msg)                                                         │
│    827 │   │   if s[1] != 0:                                                                     │
│    828 │   │   │   msg = "cannot decode image data"                                              │
│    829 │   │   │   raise ValueError(msg)                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
@michelcrypt4d4mus michelcrypt4d4mus changed the title ValueError: not enough image data when extracting images ValueError: not enough image data when enumerating images Jul 9, 2023
@pubpub-zz
Copy link
Collaborator

same as #1955

@pubpub-zz pubpub-zz added the needs-pdf The issue needs a PDF file to show the problem label Jul 9, 2023
@michelcrypt4d4mus
Copy link
Author

emailed the broken page to @MartinThoma

@pubpub-zz
Copy link
Collaborator

the issue is coming from an image being coded on 2 bits : this requires buffer refactoring before creating the internal image. We have to be careful that sequence are padded always start lines on a bit0
fix is in progress

@MartinThoma
Copy link
Member

MartinThoma commented Jul 12, 2023 via email

@pubpub-zz pubpub-zz added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF and removed needs-pdf The issue needs a PDF file to show the problem labels Jul 17, 2023
@pubpub-zz
Copy link
Collaborator

examples and discussion were wrongly in #1954
this is now closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants