Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text': '(cid:0) instead of character #29

Closed
wanghaisheng opened this issue Mar 20, 2017 · 8 comments
Closed

text': '(cid:0) instead of character #29

wanghaisheng opened this issue Mar 20, 2017 · 8 comments

Comments

@wanghaisheng
Copy link

wanghaisheng commented Mar 20, 2017

>>> with pdfplumber.open("/Users/edwin/1.pdf") as pdf:
...     first_page = pdf.pages[0]
...     print(first_page.chars[0])
... 

{'adv': Decimal('15.975'), 'fontname': 'SRPUEP+SimSun', 'doctop': Decimal('8.092'), 'y1': Decimal('411.158'), 'bottom': Decimal('23.061'), 'text': '(cid:0)', 'top': Decimal('8.092'), 'object_type': 'char', 'height': Decimal('14.969'), 'width': Decimal('15.975'), 'page_number': 1, 'upright': True, 'y0': Decimal('396.189'), 'x0': Decimal('147.400'), 'x1': Decimal('163.375'), 'size': Decimal('14.969')}

pdf is https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip
you can download and just remove the .zip in the filename to get pdf file

@jsvine
Copy link
Owner

jsvine commented Mar 20, 2017

Thanks for flagging, @wanghaisheng! pdfplumber depends on pdfminer/pdfminer.six to get the text for each character. When pdfminer doesn't recognize a character, it produces that (cid:...) string: euske/pdfminer#122

I don't think pdfminer or pdfminer.six has resolved that problem, unfortunately, but I'll put fixing it on my to-do list.

@wanghaisheng
Copy link
Author

@jsvine it seems mupdf can handle this file . there is a font folder here where pdfminer only got cmap resources
https://github.com/muennich/mupdf/tree/master/resources/fonts

@jsvine
Copy link
Owner

jsvine commented Jun 1, 2017

Thanks, and good to know!

@Ajay-MS
Copy link

Ajay-MS commented Jul 26, 2018

Is this issue fixed?

@wanghaisheng
Copy link
Author

no. there is some kind pdf is garbage. which. you should try an ocr way

@Mahendra114027
Copy link

Mahendra114027 commented Sep 28, 2018

I am using this simple approach in python to handle such scenario to avoid any OCR based extraction.

text_str = '(cid:66)'
if 'cid' in text_str.lower():
	text_str = text_str.strip('(')
	text_str = text_str.strip(')')
	ascii_num = text_str.split(':')[-1]
	ascii_num = int(ascii_num)
	text_val = chr(ascii_num)  # 66 = 'B' in ascii

@drkane
Copy link

drkane commented Jan 9, 2019

I found I could create a map of (cid:xxxx) to unicode values using something like the following (where pdf is a PDFPlumber document:

cid_lookup = {}
for k, font in pdf.device.rsrcmgr._cached_fonts.items():
    if font.unicode_map:
        cid_lookup.update(font.unicode_map.cid2unichr)

The font.unicode_map.cid2unichr dicts have keys with the cid code and value as the equivalent unicode string. You can use these codes to find and replace - I used regex to do this.

This method isn't perfect - it conflates all the cids from various different fonts into one, so there could be collisions. And I found that this replaced some (cid:xxx) values but some weren't in the dictionary.

@iateadonut
Copy link

piggybacking off of @Mahendra114027 's solution, we created this function:

    def prune_text(text):

        def replace_cid(match):
            ascii_num = int(match.group(1))
            try:
                return chr(ascii_num)
            except:
                return ''  # In case of conversion failure, return empty string

        # Regular expression to find all (cid:x) patterns
        cid_pattern = re.compile(r'\(cid:(\d+)\)')
        pruned_text = re.sub(cid_pattern, replace_cid, text)
        return pruned_text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants