Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: replace() argument 1 must be str, not bytes #1379

Closed
Zeping-Jian opened this issue Oct 5, 2022 · 4 comments · Fixed by #1380
Closed

TypeError: replace() argument 1 must be str, not bytes #1379

Zeping-Jian opened this issue Oct 5, 2022 · 4 comments · Fixed by #1380

Comments

@Zeping-Jian
Copy link

Zeping-Jian commented Oct 5, 2022

I just want Covert a pdf file to a txt file , but the run failed

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
root@0cc46add0ae3:/home/learn/IR/irBooks# python -m platform
Linux-5.10.124-linuxkit-x86_64-with-glibc2.31

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
root@0cc46add0ae3:/home/learn/IR/irBooks# python -c "import PyPDF2;print(PyPDF2.__version__)"
2.11.0

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2

print(PyPDF2.__version__)

def hanleOnePage(pdfreader, pIdx, outputFile) :
    print("hanle Page:%s" % (pIdx))
    pageobj=pdfreader.getPage(pIdx)

    text=pageobj.extractText()
    # print(text)
    file1=open(outputFile,"a")
    file1.writelines(text)

#create file object variable
#opening method will be rb
# pdffileobj=open('01bool.pdf','rb')
pdffileobj=open('02voc.pdf','rb')
 
#create reader variable that will read the pdffileobj
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
 
#This will store the number of pages of this pdf file
x=pdfreader.numPages
print("PDF:numPages:%s" % (x))

#create a variable that will select the selected number of pages
pIndex = 0
while pIndex < x:
    hanleOnePage(pdfreader, pIndex, "all.txt")
    pIndex += 1

pdffileobj.close()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
02voc.pdf

Traceback

This is the complete Traceback I see:

root@0cc46add0ae3:/home/learn/IR/irBooks# python tokens-step1.py
2.11.0
PDF:numPages:29
hanle Page:0
hanle Page:1
hanle Page:2
Traceback (most recent call last):
File "/home/learn/IR/irBooks/tokens-step1.py", line 39, in
hanleOnePage(pdfreader, pIndex, "3.txt")
File "/home/learn/IR/irBooks/tokens-step1.py", line 12, in hanleOnePage
text=pageobj.extractText()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1865, in extractText
return self.extract_text()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1818, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1323, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 27, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 193, in parse_to_unicode
cm = prepare_cm(ft)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 210, in prepare_cm
.replace(b"beginbfchar", b"\nbeginbfchar\n")
TypeError: replace() argument 1 must be str, not bytes

@Zeping-Jian
Copy link
Author

I checked it's PDF file problem, When I changed a new one It run well

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 5, 2022
fixes py-pdf#1379
have not been able to identify why str is returned instead of bytes as usual
prefer to convert locally
@pubpub-zz
Copy link
Collaborator

this is unusual. I've added code to fix that

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Oct 5, 2022

PS : in the extraction result, the arabic characters are replaced with /afiinnnn. this is because the data uses the iso 10036 standard that I've not been able to find any free information on how to do transcoding
file 02voc.pdf
test code:

import PyPDF2;
PyPDF2.PdfReader("e:/02voc.pdf").pages[2].extract_text()

@Zeping-Jian
Copy link
Author

Thand you for analysis this Issue for help more developer who involved. I am noob for PDF. Will wait for resolved.
Is this issue is being resolved , and need me update code for solve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants