Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_redactions() does not work as expected #3863

Closed
nsklei opened this issue Sep 15, 2024 · 9 comments
Closed

apply_redactions() does not work as expected #3863

nsklei opened this issue Sep 15, 2024 · 9 comments
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package

Comments

@nsklei
Copy link

nsklei commented Sep 15, 2024

Description of the bug

When using apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) I get several "MuPDF error: syntax error: cannot find XObject resource" errors and as well there are some pages which are completely empty, altough all pages originally contain images.

How to reproduce the bug

import pymupdf
from io import BytesIO
from pathlib import Path

file_path = "path\to\Example_PDF.pdf"
output_path = "path\to\Example_PDF_redacted.pdf"

new_doc = pymupdf.open(file_path)

for num, page in enumerate(new_doc):
    print(f"Page {num + 1} - {page.rect}:")
    
    for image in page.get_images(full=True):
        print(f"  - Image: {image}")

    redact_rect = page.rect

    if page.rotation in {90, 270}:
        redact_rect = pymupdf.Rect(0, 0, page.rect.height, page.rect.width)

    page.add_redact_annot(redact_rect)
    page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE)

byte_stream = BytesIO()
new_doc.save(byte_stream)
byte_stream.seek(0)

Path(output_path).write_bytes(byte_stream.getvalue())

The code above prints the following information:

Page 1 - Rect(0.0, 0.0, 598.3200073242188, 813.5999755859375):
  - Image: (22, 0, 554, 754, 8, 'ICCBased', '', 'Im0', 'DCTDecode', 0)
  - Image: (23, 43, 554, 754, 8, 'ICCBased', '', 'Im1', 'DCTDecode', 0)
Page 2 - Rect(0.0, 0.0, 598.3200073242188, 816.47998046875):
  - Image: (25, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (26, 44, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 3 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (28, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (29, 45, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 4 - Rect(0.0, 0.0, 815.760009765625, 597.5999755859375):
  - Image: (31, 0, 554, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (32, 46, 554, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 5 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (34, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (35, 47, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 6 - Rect(0.0, 0.0, 806.4000244140625, 598.3200073242188):
  - Image: (37, 0, 554, 747, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (38, 48, 554, 747, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
Page 7 - Rect(0.0, 0.0, 815.0399780273438, 597.5999755859375):
  - Image: (39, 0, 554, 755, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (40, 49, 554, 755, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

Page 8 - Rect(0.0, 0.0, 815.760009765625, 596.8800048828125):
  - Image: (41, 0, 553, 756, 8, 'ICCBased', '', 'Im001', 'DCTDecode', 0)
  - Image: (42, 50, 553, 756, 8, 'ICCBased', '', 'Im002', 'DCTDecode', 0)
MuPDF error: syntax error: cannot find XObject resource 'Im1'

MuPDF error: syntax error: cannot find XObject resource 'Im2'

As you can see, each page contains two images. The function should remove all content from the PDF file except the images.
But when saving the byte_stream there are some pages completely empy.

PyMuPDF version

1.24.10

Operating system

Windows

Python version

3.12

@JorjMcKie
Copy link
Collaborator

This post cannot be accepted as a bug report because no reproducer file is provided.

@JorjMcKie
Copy link
Collaborator

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Sep 15, 2024
@JorjMcKie
Copy link
Collaborator

@nsklei - You are aware that all pages only contain images - no text, no vector graphics.
So your redactions effectively are no-ops!

@nsklei
Copy link
Author

nsklei commented Sep 15, 2024

Thank you for reviewing my issue and creating a bug report.
The described behaviour in your bug report is correct. I am aware, that all pages only contain images and nothing else, so the redactions should indeed be no-ops in this case.

@JorjMcKie
Copy link
Collaborator

I found that removing page rotation avoids the problem:

for page in doc:
    page.add_redact_annot(page.rect * page.derotation_matrix)
    page.remove_rotation()
    page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE)

Works without problem.

@nsklei
Copy link
Author

nsklei commented Sep 17, 2024

Thank you for providing a solution to my problem. I tested your suggestion and it works perfectly :)

@nsklei nsklei closed this as completed Sep 17, 2024
@JorjMcKie
Copy link
Collaborator

Thanks for the feedback!
I am going to re-open this until the fix itself is publicly available. This is our policy for dealing with issue resolutions.

@JorjMcKie JorjMcKie reopened this Sep 17, 2024
@sebras
Copy link
Contributor

sebras commented Sep 27, 2024

@JorjMcKie This appears to have been fixed upstream, so can be marked "fix developed"?

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

4 participants