Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SystemError: <built-in function TextPage_extractIMGINFO> returned a result with an exception set #2905

Closed
myhloli opened this issue Dec 18, 2023 · 5 comments

Comments

@myhloli
Copy link

myhloli commented Dec 18, 2023

Description of the bug

when i use the api page.get_image_rects,there is a SystemError: returned a result with an exception set

How to reproduce the bug

pdf file:
https://drive.google.com/file/d/1yCwGVTOwRSXvHCzNC0C0dsVJLQI-p2ZM/view
page_id:115(page_id start from 0)

pdf_docs = fitz.open("pdf", pdf_bytes)
for page_id, page in enumerate(pdf_docs):
    page_imgs = page.get_images()
    for img in page_imgs:
    recs = page.get_image_rects(img, transform=True)

my code like this demo

    recs = page.get_image_rects(img, transform=True)
           │    │               └ (2052, 0, 1895, 1248, 8, 'DeviceRGB', '', 'Im0', 'DCTDecode')
           │    └ <function get_image_rects at 0x0000027B025F5E40>
           └ page 115 of <memory, doc# 1>
    infos = page.get_image_info(hashes=True)
            │    └ <function get_image_info at 0x0000027B025F5DA0>
            └ page 115 of <memory, doc# 1>
    imginfo = tp.extractIMGINFO(hashes=hashes)
              │  │                     └ True
              │  └ <function TextPage.extractIMGINFO at 0x0000027B025E3D80>
              └ <fitz.fitz.TextPage; proxy of <Swig Object of type 'TextPage *' at 0x0000027B027579F0> >
    return _fitz.TextPage_extractIMGINFO(self, hashes)
           │     │                       │     └ True
           │     │                       └ <fitz.fitz.TextPage; proxy of <Swig Object of type 'TextPage *' at 0x0000027B027579F0> >
           │     └ <built-in function TextPage_extractIMGINFO>
           └ <module 'fitz._fitz' from 

there is stack when i meet this error

PyMuPDF version

1.23.7

Operating system

Windows

Python version

3.11

@JorjMcKie
Copy link
Collaborator

We are in the process of migrating PyMuPDF to a new architecture. This implementation is available already now under the import statement import fitz_new as fitz.
IAW it is a drop-in replacement for the classic implementation.
We plan to roll out the new implementation after the coming 1.23.8 (which is slated presumably this week).
Your issue is fixed in this new implementation, and we do not intend to apply the fix in the current version.

import fitz_new as fitz
doc=fitz.open("test.pdf")
page=doc[115]
for img in page.get_images():
    print(page.get_image_rects(img))

    
[Rect(-9.0, -9.89996337890625, 1270.400146484375, 833.0001220703125)]
[Rect(619.5001220703125, 45.3999137878418, 642.35009765625, 76.84991455078125)]
[Rect(255.75010681152344, 572.2999877929688, 364.2001037597656, 662.0999755859375)]

@JorjMcKie JorjMcKie added the wontfix no intention to resolve label Dec 18, 2023
@JorjMcKie JorjMcKie removed the wontfix no intention to resolve label Dec 19, 2023
@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 19, 2023

The reason for failure in the classic implementation is the use of non-UTF8 characters in the colorspace name of image xref 2054 on that page. So an alternative Python method to make the respective string object is an easy solution.
Colorspace info of image 2054:

print(doc.xref_object(4516))
[ /Separation /#BA#DA#C9#AB /DeviceCMYK 4517 0 R ]

The bytes((0xBA,0xDA,0xC9,0xAB)) is not UTF8-decodable.

@myhloli
Copy link
Author

myhloli commented Dec 20, 2023

We are in the process of migrating PyMuPDF to a new architecture. This implementation is available already now under the import statement import fitz_new as fitz. IAW it is a drop-in replacement for the classic implementation. We plan to roll out the new implementation after the coming 1.23.8 (which is slated presumably this week). Your issue is fixed in this new implementation, and we do not intend to apply the fix in the current version.

import fitz_new as fitz
doc=fitz.open("test.pdf")
page=doc[115]
for img in page.get_images():
    print(page.get_image_rects(img))

    
[Rect(-9.0, -9.89996337890625, 1270.400146484375, 833.0001220703125)]
[Rect(619.5001220703125, 45.3999137878418, 642.35009765625, 76.84991455078125)]
[Rect(255.75010681152344, 572.2999877929688, 364.2001037597656, 662.0999755859375)]

Thanks for your reply , but i have a question
what's the different between the fitz_new and fitz?
could i use fitz_new replace fitz in my project?
or i should wait the PyMuPDF update 1.23.8?

@JorjMcKie
Copy link
Collaborator

You can use import fitz_new as fitz immediately. This is the only change in your scripts to have the "rebased" version working.
There are no changes in the behavior.

Early next year (first two weeks), the final migration step will take place - such that you again can simply say import fitz.

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.9 where import fitz gets the rebased implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants