-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text': '(cid:0) instead of character #29
Comments
Thanks for flagging, @wanghaisheng! I don't think |
@jsvine it seems mupdf can handle this file . there is a font folder here where pdfminer only got cmap resources |
Thanks, and good to know! |
Is this issue fixed? |
no. there is some kind pdf is garbage. which. you should try an ocr way |
I am using this simple approach in python to handle such scenario to avoid any OCR based extraction.
|
I found I could create a map of
The This method isn't perfect - it conflates all the |
piggybacking off of @Mahendra114027 's solution, we created this function:
|
{'adv': Decimal('15.975'), 'fontname': 'SRPUEP+SimSun', 'doctop': Decimal('8.092'), 'y1': Decimal('411.158'), 'bottom': Decimal('23.061'), 'text': '(cid:0)', 'top': Decimal('8.092'), 'object_type': 'char', 'height': Decimal('14.969'), 'width': Decimal('15.975'), 'page_number': 1, 'upright': True, 'y0': Decimal('396.189'), 'x0': Decimal('147.400'), 'x1': Decimal('163.375'), 'size': Decimal('14.969')}
pdf is https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip
you can download and just remove the .zip in the filename to get pdf file
The text was updated successfully, but these errors were encountered: