BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

Alexhuszagh · 2023-10-23T19:34:23Z

Use a visited memo to check if the current object in the clone operation has already been visited, and if so, do not add it to the list of objects. This avoids infinite recursion in case there are references to identical objects inside a PDF.

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

The code that fails is as simple as:

from pypdf import PdfReader, PdfWriter
reader = PdfReader('/path/to/sensitive/file.pdf')
writer = PdfWriter('myfilename.pdf', clone_from=reader)

Use a visited memo to check if the current object in the clone operation has already been visited, and if so, do not add it to the list of objects. This avoids infinite recursion in case there are links to identical objects inside a PDF.

Alexhuszagh · 2023-10-23T19:42:48Z

I didn't apply the mypy lint to filters.py but would be more than happy to do so. My only major question would be then do I need to check if the PDFs are identical for the id(pdf) inside (so rather than just the generation and idnum, but also the PDF id)?

codecov · 2023-10-23T19:49:39Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9047079) 94.44% compared to head (e275585) 94.43%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2264      +/-   ##
==========================================
- Coverage   94.44%   94.43%   -0.01%     
==========================================
  Files          43       43              
  Lines        7634     7643       +9     
  Branches     1506     1508       +2     
==========================================
+ Hits         7210     7218       +8     
  Misses        262      262              
- Partials      162      163       +1

Files	Coverage Δ
pypdf/generic/_data_structures.py	`91.90% <92.30%> (-0.04%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-10-23T20:25:34Z

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

@Alexhuszagh
Do you accept to wipe out text with .remove_text() and then share the file in private by mail with @MartinThoma ?

Alexhuszagh · 2023-10-23T21:03:14Z

Unfortunately, since the example PDF contains financial data from a client I'm not free to provide the failing example that caused this PR. If you have a suggestion on how to remove that data while keeping the references identical, I'd be more than happy to contribute that. I hope the implementation is logical enough so this isn't needed.

@Alexhuszagh Do you accept to wipe out text with .remove_text() and then share the file in private by mail with @MartinThoma ?

I just got approval to send the clean version over from the client and just sent it over.

pypdf/generic/_data_structures.py

MartinThoma · 2023-10-29T04:07:25Z

With the PDF you provided I can confirm that there is an infinite loop causing 100% CPU utilization for the executing core. Hence I added the nf-security Non-functional change: Security label

^CTraceback (most recent call last):
  File "/home/moose/Downloads/pypdf-issue.py", line 4, in <module>
    reader = PdfWriter("issuepypdf.pdf", clone_from="pypdf-clean.pdf")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 207, in __init__
    self.clone_document_from_reader(clone_from)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 1191, in clone_document_from_reader
    self.clone_reader_document_root(reader)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/_writer.py", line 1101, in clone_reader_document_root
    self._root_object = cast(DictionaryObject, reader.trailer[TK.ROOT].clone(self))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 297, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 300, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 297, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 300, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 197, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 285, in _clone
    c._clone(s, pdf_dest, force_duplicate, ignore_fields)
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_data_structures.py", line 269, in _clone
    cur_obj._reference_clone(
  File "/home/moose/.pyenv/versions/3.11.1/lib/python3.11/site-packages/pypdf/generic/_base.py", line 129, in _reference_clone
    if id(ind.pdf) not in pdf_dest._id_translated:
       ^^^^^^^^^^^
KeyboardInterrupt

Meta-information about the problematic file

It's not completely broken:

$ qpdf --check pypdf-issue-2264.pdf
checking pypdf-issue-2264.pdf
PDF Version: 1.3
File is not encrypted
File is not linearized
No syntax or stream encoding errors found; the file may still contain
errors that qpdf cannot detect

I can also just open it fine in chrome / evince.

pypdf/generic/_data_structures.py

MartinThoma · 2023-10-29T04:15:49Z

Thank you for your contribution 🙏 I will make a release in a few minutes that contains your fix. Additionally, I will also add a security advisory to update.

What a great first PR! If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@Alexhuszagh

## What's new ### Security (SEC) - Infinite recursion when using PdfWriter(clone_from=reader) (#2264) by @Alexhuszagh ### New Features (ENH) - Add parameter to select images to be removed (#2214) by @pubpub-zz ### Bug Fixes (BUG) - Correctly handle image mode 1 with FlateDecode (#2249) by @stefan6419846 - Error when filling a value with parentheses #2268 (#2269) by @KanorUbu - Handle empty root outline (#2239) by @pubpub-zz ### Documentation (DOC) - Improve merging docs (#2247) by @stefan6419846 ### Developer Experience (DEV) - Test Python 3.7 with cryptopgraphy provider as well (#2276) by @stefan6419846 - Run CI with windows-latest (#2258) by @MartinThoma - Use pytest-xdist (#2254) by @MartinThoma - Attribute correct authors in the release notes (#2246) by @stefan6419846 ### Maintenance (MAINT) - Apply pre-commit hooks (#2277) by @MartinThoma - Update requirements + mypy fixes (#2275) by @MartinThoma - Explicitly provide Any for IO generic argument (#2272) by @nilehmann ### Testing (TST) - Fix test_image_without_pillow in windows environment (#2257) by @pubpub-zz ### Code Style (STY) - Remove unused import by @MartinThoma [Full Changelog](3.16.4...3.17.0)

Alexhuszagh · 2023-10-30T14:08:01Z

Thank you for your contribution 🙏 I will make a release in a few minutes that contains your fix. Additionally, I will also add a security advisory to update.

What a great first PR! If you want, I'll add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

That would be wonderful, thank you.

Fix an issue with infinite recursion.

e275585

Use a visited memo to check if the current object in the clone operation has already been visited, and if so, do not add it to the list of objects. This avoids infinite recursion in case there are links to identical objects inside a PDF.

Alexhuszagh force-pushed the main branch from 71c8140 to e275585 Compare October 23, 2023 19:40

Alexhuszagh commented Oct 24, 2023

View reviewed changes

pypdf/generic/_data_structures.py Show resolved Hide resolved

pypdf/generic/_data_structures.py Show resolved Hide resolved

MartinThoma changed the title ~~Fix an issue with infinite recursion.~~ BUG: infinite recursion Oct 28, 2023

MartinThoma changed the title ~~BUG: infinite recursion~~ BUG: Infinite recursion when using PdfWriter(clone_from=reader) Oct 29, 2023

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-security Non-functional change: Security labels Oct 29, 2023

MartinThoma reviewed Oct 29, 2023

View reviewed changes

pypdf/generic/_data_structures.py Show resolved Hide resolved

MartinThoma merged commit 9b23ac3 into py-pdf:main Oct 29, 2023
11 of 13 checks passed

bosue mentioned this pull request Nov 1, 2023

build(deps): Update PyPDF to v3.17.0 frappe/frappe#23060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

codecov bot commented Oct 23, 2023

pubpub-zz commented Oct 23, 2023

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

MartinThoma commented Oct 29, 2023 •

edited

Loading

MartinThoma commented Oct 29, 2023

Alexhuszagh commented Oct 30, 2023

BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

BUG: Infinite recursion when using PdfWriter(clone_from=reader) #2264

Conversation

Alexhuszagh commented Oct 23, 2023 • edited Loading

Alexhuszagh commented Oct 23, 2023 • edited Loading

codecov bot commented Oct 23, 2023

Codecov Report

pubpub-zz commented Oct 23, 2023

Alexhuszagh commented Oct 23, 2023 • edited Loading

MartinThoma commented Oct 29, 2023 • edited Loading

Meta-information about the problematic file

MartinThoma commented Oct 29, 2023

Alexhuszagh commented Oct 30, 2023

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

Alexhuszagh commented Oct 23, 2023 •

edited

Loading

MartinThoma commented Oct 29, 2023 •

edited

Loading