text': '(cid:0) instead of character #29

wanghaisheng · 2017-03-20T06:31:07Z

>>> with pdfplumber.open("/Users/edwin/1.pdf") as pdf:
...     first_page = pdf.pages[0]
...     print(first_page.chars[0])
...

{'adv': Decimal('15.975'), 'fontname': 'SRPUEP+SimSun', 'doctop': Decimal('8.092'), 'y1': Decimal('411.158'), 'bottom': Decimal('23.061'), 'text': '(cid:0)', 'top': Decimal('8.092'), 'object_type': 'char', 'height': Decimal('14.969'), 'width': Decimal('15.975'), 'page_number': 1, 'upright': True, 'y0': Decimal('396.189'), 'x0': Decimal('147.400'), 'x1': Decimal('163.375'), 'size': Decimal('14.969')}

pdf is https://github.com/clear-datacenter/plan/files/524831/1.pdf.zip
you can download and just remove the .zip in the filename to get pdf file

The text was updated successfully, but these errors were encountered:

jsvine · 2017-03-20T14:03:14Z

Thanks for flagging, @wanghaisheng! pdfplumber depends on pdfminer/pdfminer.six to get the text for each character. When pdfminer doesn't recognize a character, it produces that (cid:...) string: euske/pdfminer#122

I don't think pdfminer or pdfminer.six has resolved that problem, unfortunately, but I'll put fixing it on my to-do list.

wanghaisheng · 2017-03-20T15:09:25Z

@jsvine it seems mupdf can handle this file . there is a font folder here where pdfminer only got cmap resources
https://github.com/muennich/mupdf/tree/master/resources/fonts

jsvine · 2017-06-01T14:33:37Z

Thanks, and good to know!

Ajay-MS · 2018-07-26T07:00:45Z

Is this issue fixed?

wanghaisheng · 2018-07-26T08:11:26Z

no. there is some kind pdf is garbage. which. you should try an ocr way

Mahendra114027 · 2018-09-28T10:38:09Z

I am using this simple approach in python to handle such scenario to avoid any OCR based extraction.

text_str = '(cid:66)'
if 'cid' in text_str.lower():
	text_str = text_str.strip('(')
	text_str = text_str.strip(')')
	ascii_num = text_str.split(':')[-1]
	ascii_num = int(ascii_num)
	text_val = chr(ascii_num)  # 66 = 'B' in ascii

drkane · 2019-01-09T14:27:06Z

I found I could create a map of (cid:xxxx) to unicode values using something like the following (where pdf is a PDFPlumber document:

cid_lookup = {}
for k, font in pdf.device.rsrcmgr._cached_fonts.items():
    if font.unicode_map:
        cid_lookup.update(font.unicode_map.cid2unichr)

The font.unicode_map.cid2unichr dicts have keys with the cid code and value as the equivalent unicode string. You can use these codes to find and replace - I used regex to do this.

This method isn't perfect - it conflates all the cids from various different fonts into one, so there could be collisions. And I found that this replaced some (cid:xxx) values but some weren't in the dictionary.

iateadonut · 2024-02-27T11:23:44Z

piggybacking off of @Mahendra114027 's solution, we created this function:

    def prune_text(text):

        def replace_cid(match):
            ascii_num = int(match.group(1))
            try:
                return chr(ascii_num)
            except:
                return ''  # In case of conversion failure, return empty string

        # Regular expression to find all (cid:x) patterns
        cid_pattern = re.compile(r'\(cid:(\d+)\)')
        pruned_text = re.sub(cid_pattern, replace_cid, text)
        return pruned_text

wanghaisheng closed this as completed Jun 1, 2017

samkit-jain mentioned this issue Nov 30, 2019

When I use extract PDF content, it's all CID: XXXX #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text': '(cid:0) instead of character #29

text': '(cid:0) instead of character #29

wanghaisheng commented Mar 20, 2017 •

edited

Loading

jsvine commented Mar 20, 2017

wanghaisheng commented Mar 20, 2017

jsvine commented Jun 1, 2017

Ajay-MS commented Jul 26, 2018

wanghaisheng commented Jul 26, 2018

Mahendra114027 commented Sep 28, 2018 •

edited

Loading

drkane commented Jan 9, 2019

iateadonut commented Feb 27, 2024

text': '(cid:0) instead of character #29

text': '(cid:0) instead of character #29

Comments

wanghaisheng commented Mar 20, 2017 • edited Loading

jsvine commented Mar 20, 2017

wanghaisheng commented Mar 20, 2017

jsvine commented Jun 1, 2017

Ajay-MS commented Jul 26, 2018

wanghaisheng commented Jul 26, 2018

Mahendra114027 commented Sep 28, 2018 • edited Loading

drkane commented Jan 9, 2019

iateadonut commented Feb 27, 2024

wanghaisheng commented Mar 20, 2017 •

edited

Loading

Mahendra114027 commented Sep 28, 2018 •

edited

Loading