Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: unhashable type: 'IndirectObject' when enumerating images #1955

Closed
michelcrypt4d4mus opened this issue Jul 9, 2023 · 11 comments · Fixed by #2007
Closed

TypeError: unhashable type: 'IndirectObject' when enumerating images #1955

michelcrypt4d4mus opened this issue Jul 9, 2023 · 11 comments · Fixed by #2007
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@michelcrypt4d4mus
Copy link

The background info is the same as #1954 - an exception while iterating over the images on a page of a PDF - but the exception, and traceback are slightly different.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ workspace/clown_sort/clown_sort/files/pdf_file.py:52 in extracted_text            │
│                                                                                                  │
│    49 │   │   │   │                                                                              │
│    50 │   │   │   │   # Extracting images is a bit fraught (lots of PIL and pypdf exceptions h   │
│    51 │   │   │   │   try:                                                                       │
│ ❱  52 │   │   │   │   │   for image_number, image in enumerate(page.images, start=1):            │
│    53 │   │   │   │   │   │   image_name = f"Page {page_number}, Image {image_number}"           │
│    54 │   │   │   │   │   │   self._log_to_stderr(f"   Processing {image_name}...")              │
│    55 │   │   │   │   │   │   page_buffer.print(Panel(image_name, expand=False))                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2596 in __iter__                                                       │
│                                                                                                  │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│ ❱ 2596 │   │   │   yield self[i]                                                                 │
│   2597 │                                                                                         │
│   2598 │   def __str__(self) -> str:                                                             │
│   2599 │   │   p = [f"Image_{i}={n}" for i, n in enumerate(self.ids_function())]                 │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:2592 in __getitem__                                                    │
│                                                                                                  │
│   2589 │   │   │   index = len_self + index                                                      │
│   2590 │   │   if index < 0 or index >= len_self:                                                │
│   2591 │   │   │   raise IndexError("sequence index out of range")                               │
│ ❱ 2592 │   │   return self.get_function(lst[index])                                              │
│   2593 │                                                                                         │
│   2594 │   def __iter__(self) -> Iterator[ImageFile]:                                            │
│   2595 │   │   for i in range(len(self)):                                                        │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:533 in _get_image                                                      │
│                                                                                                  │
│    530 │   │   │   return f                                                                      │
│    531 │   │   else:  # in a sub object                                                          │
│    532 │   │   │   ids = id[1:]                                                                  │
│ ❱  533 │   │   │   return self._get_image(ids, cast(DictionaryObject, xobjs[id[0]]))             │
│    534 │                                                                                         │
│    535 │   @property                                                                             │
│    536 │   def images(self) -> List[ImageFile]:                                                  │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/_page.py:522 in _get_image                                                      │
│                                                                                                  │
│    519 │   │   │   │   │   raise KeyError("no inline image can be found")                        │
│    520 │   │   │   │   return self.inline_images[id]                                             │
│    521 │   │   │                                                                                 │
│ ❱  522 │   │   │   imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))                      │
│    523 │   │   │   extension, byte_stream = imgd[:2]                                             │
│    524 │   │   │   f = ImageFile(                                                                │
│    525 │   │   │   │   name=f"{id[1:]}{extension}",                                              │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:826 in _xobj_to_image                                                │
│                                                                                                  │
│   823 │   if x_object_obj.get("/BitsPerComponent", 8) == 1:                                      │
│   824 │   │   mode = _get_imagemode("1bit", 0, "")                                               │
│   825 │   else:                                                                                  │
│ ❱ 826 │   │   mode = _get_imagemode(                                                             │
│   827 │   │   │   color_space,                                                                   │
│   828 │   │   │   2                                                                              │
│   829 │   │   │   if (                                                                           │
│                                                                                                  │
│ Library/Caches/pypoetry/virtualenvs/clown-sort-zLqmJuxs-py3.11/lib/python3.11/sit │
│ e-packages/pypdf/filters.py:682 in _get_imagemode                                                │
│                                                                                                  │
│   679 │   │   "/DeviceCMYK": "CMYK",                                                             │
│   680 │   }                                                                                      │
│   681 │   mode: mode_str_type = (                                                                │
│ ❱ 682 │   │   mode_map.get(color_space)  # type: ignore                                          │
│   683 │   │   or list(mode_map.values())[color_components]                                       │
│   684 │   │   or prev_mode                                                                       │
│   685 │   )  # type: ignore                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

@michelcrypt4d4mus michelcrypt4d4mus changed the title TypeError: unhashable type: 'IndirectObject' TypeError: unhashable type: 'IndirectObject' when enumerating images Jul 9, 2023
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 9, 2023

@michelcrypt4d4mus, without data this is too difficult to perform some analysis. Would it be possible to share just the failling pages in private with us (sending them privately by e-mail to @MartinThoma [email protected]

@pubpub-zz pubpub-zz added the needs-pdf The issue needs a PDF file to show the problem label Jul 9, 2023
@MartinThoma
Copy link
Member

Let's leave this open until 31.07.2023. If somebody has an example PDF that can trigger this, we can work on it. Otherwise we have to close it to focus on stuff that we can work on.

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jul 10, 2023
@michelcrypt4d4mus
Copy link
Author

@michelcrypt4d4mus, without data this is too difficult to perform some analysis. Would it be possible to share just the failling pages in private with us (sending them privately by e-mail to @MartinThoma [email protected]

this may be possible. can PyPDF we used to extract a single page without any modifications?

@pubpub-zz
Copy link
Collaborator

If you use writer.append() to extract the page the file should contain the data( you should be able to confirm this reruning your test on the new produced document)

@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus
have you been able to send a sample ?

@MartinThoma
Copy link
Member

@michelcrypt4d4mus I've forwarded the two other mails, but I think for this one I didn't receive an example

@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus : +1?

@MartinThoma
Copy link
Member

I'm closing this for the moment. If anybody has a PDF that causes this, we can re-open :-)

@michelcrypt4d4mus
Copy link
Author

i sent a PDF to @MartinThoma but it cannot be shared publicly or used in the test suite

@pubpub-zz
Copy link
Collaborator

@michelcrypt4d4mus
thanks we will of course keep privacy as requested

@pubpub-zz pubpub-zz reopened this Jul 22, 2023
@pubpub-zz pubpub-zz added Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests and removed needs-pdf The issue needs a PDF file to show the problem labels Jul 22, 2023
@pubpub-zz
Copy link
Collaborator

the images in the PDF are using /Separation ColorSpace ; this colorspace requires also color inversion. I've produced a test PDF for this:
TestWithSeparationBlack.pdf

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 24, 2023
MartinThoma pushed a commit that referenced this issue Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants