Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

最后的文本中会有很多特殊字符编码问题,例如将fi识别为一个字符fi,fl被识别成fl #241

Closed
zhou995287902 opened this issue Jul 30, 2024 · 2 comments
Labels
bug Something isn't working P1 P1 BUG

Comments

@zhou995287902
Copy link

zhou995287902 commented Jul 30, 2024

Description of the bug | 错误描述

最后输出的文本中会有很多特殊字符编码问题,例如将fi识别为一个字符fi,fl被识别成fl。并且一个单词会被分割成三个单词,例如:文本中是lanifibranor最后输出的文本时 lani fi branor。
我看了您的源代码中是使用了PyMuPDF库读取的pdf文件,可以使用unidecode库。它可以将Unicode文本转换为只包含ASCII字符的文本。我试了是有效果的,希望能够纠正这个问题。

How to reproduce the bug | 如何复现

我使用的pdf文件会放在附件中,您可以转换这个pdf就能复现我描述的问题,谢谢。
7061196.pdf
输出的结果文件在下方:
7061196.md
7061196_content_list.json

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

@zhou995287902 zhou995287902 added the bug Something isn't working label Jul 30, 2024
@myhloli
Copy link
Collaborator

myhloli commented Jul 30, 2024

Thanks for your feedback. Our standard output is in UTF-8 encoding. Considering that text export should not only cater to mainstream languages but also accommodate some lesser-spoken languages, converting all Unicode encoding to ASCII is not the optimal choice.
The issue of source information in spans during extraction causes single words to be split into multiple segments. When concatenating non-Chinese spans, a space is always inserted between different spans to reduce the likelihood of word conglutination.
For your example :
image
These words are incorrectly identified as multiple segments during span extraction, leading to subsequent concatenation errors.
In future versions, we might be able to address and fix this issue through certain means.

myhloli added a commit that referenced this issue Nov 4, 2024
fix(merge_text): add ligature replacement functionality #305 #241
@myhloli
Copy link
Collaborator

myhloli commented Nov 4, 2024

fix

@myhloli myhloli closed this as completed Nov 4, 2024
@dt-yy dt-yy added enhancement New feature or request P1 P1 BUG and removed enhancement New feature or request labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 P1 BUG
Projects
None yet
Development

No branches or pull requests

3 participants