最后的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ #241

zhou995287902 · 2024-07-30T03:54:24Z

Description of the bug | 错误描述

最后输出的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ。并且一个单词会被分割成三个单词，例如：文本中是lanifibranor最后输出的文本时 lani ﬁ branor。
我看了您的源代码中是使用了PyMuPDF库读取的pdf文件，可以使用unidecode库。它可以将Unicode文本转换为只包含ASCII字符的文本。我试了是有效果的，希望能够纠正这个问题。

How to reproduce the bug | 如何复现

我使用的pdf文件会放在附件中，您可以转换这个pdf就能复现我描述的问题，谢谢。
7061196.pdf
输出的结果文件在下方：
7061196.md
7061196_content_list.json

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

myhloli · 2024-07-30T06:34:13Z

Thanks for your feedback. Our standard output is in UTF-8 encoding. Considering that text export should not only cater to mainstream languages but also accommodate some lesser-spoken languages, converting all Unicode encoding to ASCII is not the optimal choice.
The issue of source information in spans during extraction causes single words to be split into multiple segments. When concatenating non-Chinese spans, a space is always inserted between different spans to reduce the likelihood of word conglutination.
For your example :

These words are incorrectly identified as multiple segments during span extraction, leading to subsequent concatenation errors.
In future versions, we might be able to address and fix this issue through certain means.

fix(merge_text): add ligature replacement functionality #305 #241

myhloli · 2024-11-04T10:49:50Z

fix

zhou995287902 added the bug Something isn't working label Jul 30, 2024

myhloli mentioned this issue Aug 2, 2024

输出的连字太多了 #305

Closed

myhloli added a commit that referenced this issue Nov 4, 2024

Merge pull request #857 from myhloli/dev

29c61a9

fix(merge_text): add ligature replacement functionality #305 #241

myhloli mentioned this issue Nov 4, 2024

fix(merge_text): add ligature replacement functionality #305 #241 #857

Merged

6 tasks

myhloli closed this as completed Nov 4, 2024

dt-yy added enhancement New feature or request P1 P1 BUG and removed enhancement New feature or request labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

最后的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ #241

最后的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ #241

zhou995287902 commented Jul 30, 2024 •

edited

Loading

myhloli commented Jul 30, 2024

myhloli commented Nov 4, 2024

最后的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ #241

最后的文本中会有很多特殊字符编码问题，例如将fi识别为一个字符ﬁ，fl被识别成ﬂ #241

Comments

zhou995287902 commented Jul 30, 2024 • edited Loading

Description of the bug | 错误描述

How to reproduce the bug | 如何复现

Operating system | 操作系统

Python version | Python 版本

Software version | 软件版本 (magic-pdf --version)

Device mode | 设备模式

myhloli commented Jul 30, 2024

myhloli commented Nov 4, 2024

zhou995287902 commented Jul 30, 2024 •

edited

Loading