You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your feedback. Our standard output is in UTF-8 encoding. Considering that text export should not only cater to mainstream languages but also accommodate some lesser-spoken languages, converting all Unicode encoding to ASCII is not the optimal choice.
The issue of source information in spans during extraction causes single words to be split into multiple segments. When concatenating non-Chinese spans, a space is always inserted between different spans to reduce the likelihood of word conglutination.
For your example :
These words are incorrectly identified as multiple segments during span extraction, leading to subsequent concatenation errors.
In future versions, we might be able to address and fix this issue through certain means.
Description of the bug | 错误描述
最后输出的文本中会有很多特殊字符编码问题,例如将fi识别为一个字符fi,fl被识别成fl。并且一个单词会被分割成三个单词,例如:文本中是lanifibranor最后输出的文本时 lani fi branor。
我看了您的源代码中是使用了PyMuPDF库读取的pdf文件,可以使用unidecode库。它可以将Unicode文本转换为只包含ASCII字符的文本。我试了是有效果的,希望能够纠正这个问题。
How to reproduce the bug | 如何复现
我使用的pdf文件会放在附件中,您可以转换这个pdf就能复现我描述的问题,谢谢。
7061196.pdf
输出的结果文件在下方:
7061196.md
7061196_content_list.json
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: