-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect utf32 text extraction (high & low surrogates are split) #2608
Labels
upstream bug
bug outside this package
Comments
Note that the string produced also cannot be passed to Python's own
produces
It seems that it's uniformly considered invalid. |
julian-smith-artifex-com
added
upstream bug
bug outside this package
Fixed in next release
labels
Oct 18, 2023
Thanks for the detailed report. It seems to be a bug in MuPDF which is being looked at now, so will be fixed in PyMuPDF's next release. |
julian-smith-artifex-com
added a commit
to ArtifexSoftware/PyMuPDF-julian
that referenced
this issue
Oct 24, 2023
Only run tests on recent mupdf which have fixed the underlying bug. Test failed in optimised rebased; have fixed.
julian-smith-artifex-com
added a commit
that referenced
this issue
Oct 24, 2023
Only run tests on recent mupdf which have fixed the underlying bug. Test failed in optimised rebased; have fixed.
Fixed in 1.23.6. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When extracting text (e.g. with
page.get_text_blocks
), some utf32 characters (e.g.𝜎
- U+1D70E) seem to confuse extraction logic. In that case, the extracted text is𝜋\udf0e
, which is considered invalid text by some software (DOMParser in my case).I notice that
𝜎
and𝜋
share the same high surrogate, and\udf0e
is the correct low surrogate. I don't know enough about pdf or unicode to investigate the file itself, but I'm attaching it here (page 5, the final paragraph under the3.3 H.E.S.S.
heading, the entire line isany variability above 2.2 𝜎. For the total data set of 1.8 h, 95% confi-
).There is a similar issue in the final line of the same paragraph (
𝐸th = 120 GeV
) and more throughout the document.I am able to access same text correctly with apple's Preview and with google's chrome/pdfium.
2201.00069.pdf
To Reproduce (mandatory)
Your configuration (mandatory)
The text was updated successfully, but these errors were encountered: