TypeError: replace() argument 1 must be str, not bytes #1379

Zeping-Jian · 2022-10-05T04:42:00Z

I just want Covert a pdf file to a txt file , but the run failed

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
root@0cc46add0ae3:/home/learn/IR/irBooks# python -m platform
Linux-5.10.124-linuxkit-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
root@0cc46add0ae3:/home/learn/IR/irBooks# python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2

print(PyPDF2.__version__)

def hanleOnePage(pdfreader, pIdx, outputFile) :
    print("hanle Page:%s" % (pIdx))
    pageobj=pdfreader.getPage(pIdx)

    text=pageobj.extractText()
    # print(text)
    file1=open(outputFile,"a")
    file1.writelines(text)

#create file object variable
#opening method will be rb
# pdffileobj=open('01bool.pdf','rb')
pdffileobj=open('02voc.pdf','rb')
 
#create reader variable that will read the pdffileobj
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
 
#This will store the number of pages of this pdf file
x=pdfreader.numPages
print("PDF:numPages:%s" % (x))

#create a variable that will select the selected number of pages
pIndex = 0
while pIndex < x:
    hanleOnePage(pdfreader, pIndex, "all.txt")
    pIndex += 1

pdffileobj.close()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
02voc.pdf

Traceback

This is the complete Traceback I see:

root@0cc46add0ae3:/home/learn/IR/irBooks# python tokens-step1.py
2.11.0
PDF:numPages:29
hanle Page:0
hanle Page:1
hanle Page:2
Traceback (most recent call last):
File "/home/learn/IR/irBooks/tokens-step1.py", line 39, in
hanleOnePage(pdfreader, pIndex, "3.txt")
File "/home/learn/IR/irBooks/tokens-step1.py", line 12, in hanleOnePage
text=pageobj.extractText()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1865, in extractText
return self.extract_text()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1818, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1323, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 27, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 193, in parse_to_unicode
cm = prepare_cm(ft)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 210, in prepare_cm
.replace(b"beginbfchar", b"\nbeginbfchar\n")
TypeError: replace() argument 1 must be str, not bytes

Zeping-Jian · 2022-10-05T04:54:12Z

I checked it's PDF file problem, When I changed a new one It run well

fixes py-pdf#1379 have not been able to identify why str is returned instead of bytes as usual prefer to convert locally

pubpub-zz · 2022-10-05T18:58:46Z

this is unusual. I've added code to fix that

pubpub-zz · 2022-10-05T19:54:02Z

PS : in the extraction result, the arabic characters are replaced with /afiinnnn. this is because the data uses the iso 10036 standard that I've not been able to find any free information on how to do transcoding
file 02voc.pdf
test code:

import PyPDF2;
PyPDF2.PdfReader("e:/02voc.pdf").pages[2].extract_text()

Zeping-Jian · 2022-10-07T08:47:29Z

Thand you for analysis this Issue for help more developer who involved. I am noob for PDF. Will wait for resolved.
Is this issue is being resolved , and need me update code for solve?

Fixes #1379

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 5, 2022

ROB: cope with str returned from get_data in cmap

f93f087

fixes py-pdf#1379 have not been able to identify why str is returned instead of bytes as usual prefer to convert locally

pubpub-zz mentioned this issue Oct 5, 2022

ROB: Cope with str returned from get_data in cmap #1380

Merged

pubpub-zz mentioned this issue Oct 5, 2022

can not decode afii characters (ISO 10036) #1381

Open

MartinThoma closed this as completed in #1380 Oct 8, 2022

MartinThoma pushed a commit that referenced this issue Oct 8, 2022

ROB: Cope with str returned from get_data in cmap (#1380)

92c894d

Fixes #1379

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: replace() argument 1 must be str, not bytes #1379

TypeError: replace() argument 1 must be str, not bytes #1379

Zeping-Jian commented Oct 5, 2022 •

edited by MartinThoma

Loading

Zeping-Jian commented Oct 5, 2022

pubpub-zz commented Oct 5, 2022

pubpub-zz commented Oct 5, 2022 •

edited

Loading

Zeping-Jian commented Oct 7, 2022

TypeError: replace() argument 1 must be str, not bytes #1379

TypeError: replace() argument 1 must be str, not bytes #1379

Comments

Zeping-Jian commented Oct 5, 2022 • edited by MartinThoma Loading

Environment

Code + PDF

Traceback

Zeping-Jian commented Oct 5, 2022

pubpub-zz commented Oct 5, 2022

pubpub-zz commented Oct 5, 2022 • edited Loading

Zeping-Jian commented Oct 7, 2022

Zeping-Jian commented Oct 5, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented Oct 5, 2022 •

edited

Loading