Incorrect utf32 text extraction (high & low surrogates are split) #2608

nikitar · 2023-08-22T03:13:09Z

When extracting text (e.g. with page.get_text_blocks), some utf32 characters (e.g. 𝜎 - U+1D70E) seem to confuse extraction logic. In that case, the extracted text is 𝜋\udf0e, which is considered invalid text by some software (DOMParser in my case).

I notice that 𝜎 and 𝜋 share the same high surrogate, and \udf0e is the correct low surrogate. I don't know enough about pdf or unicode to investigate the file itself, but I'm attaching it here (page 5, the final paragraph under the 3.3 H.E.S.S. heading, the entire line is any variability above 2.2 𝜎. For the total data set of 1.8 h, 95% confi-).

There is a similar issue in the final line of the same paragraph (𝐸th = 120 GeV) and more throughout the document.

I am able to access same text correctly with apple's Preview and with google's chrome/pdfium.

2201.00069.pdf

To Reproduce (mandatory)

    flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
    with fitz.open(PDF_PATH) as doc:
        page = doc[4]
        blocks = page.get_text_blocks(flags=flags)
        print(blocks[10])

Your configuration (mandatory)

Tested on macos and ubuntu
Python 3.11

3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] 
 darwin 
 
PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).

The text was updated successfully, but these errors were encountered:

nikitar · 2023-08-24T00:17:42Z

Note that the string produced also cannot be passed to Python's own encode, e.g.

"variability above 2.2 𝜋\udf0e. For the total".encode("utf8")

produces

UnicodeEncodeError: 'utf-8' codec can't encode character '\udf0e' in position 23: surrogates not allowed

It seems that it's uniformly considered invalid.

julian-smith-artifex-com · 2023-10-18T08:23:34Z

Thanks for the detailed report.

It seems to be a bug in MuPDF which is being looked at now, so will be fixed in PyMuPDF's next release.

Only run tests on recent mupdf which have fixed the underlying bug. Test failed in optimised rebased; have fixed.

julian-smith-artifex-com · 2023-11-06T19:16:31Z

Fixed in 1.23.6.

nikitar mentioned this issue Aug 22, 2023

A document has every space encoded as � #2609

Closed

julian-smith-artifex-com added upstream bug bug outside this package Fixed in next release labels Oct 18, 2023

julian-smith-artifex-com added a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Oct 24, 2023

Added test for issue pymupdf#2608.

ce4db1a

Only run tests on recent mupdf which have fixed the underlying bug. Test failed in optimised rebased; have fixed.

julian-smith-artifex-com mentioned this issue Oct 24, 2023

#2608, TEXT_PRESERVE_SPANS. #2761

Merged

julian-smith-artifex-com added a commit that referenced this issue Oct 24, 2023

Added test for issue #2608.

cb0b741

Only run tests on recent mupdf which have fixed the underlying bug. Test failed in optimised rebased; have fixed.

julian-smith-artifex-com removed the Fixed in next release label Nov 6, 2023

julian-smith-artifex-com closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect utf32 text extraction (high & low surrogates are split) #2608

Incorrect utf32 text extraction (high & low surrogates are split) #2608

nikitar commented Aug 22, 2023

nikitar commented Aug 24, 2023

julian-smith-artifex-com commented Oct 18, 2023

julian-smith-artifex-com commented Nov 6, 2023

Incorrect utf32 text extraction (high & low surrogates are split) #2608

Incorrect utf32 text extraction (high & low surrogates are split) #2608

Comments

nikitar commented Aug 22, 2023

To Reproduce (mandatory)

Your configuration (mandatory)

nikitar commented Aug 24, 2023

julian-smith-artifex-com commented Oct 18, 2023

julian-smith-artifex-com commented Nov 6, 2023