-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: replace() argument 1 must be str, not bytes #1379
Comments
I checked it's PDF file problem, When I changed a new one It run well |
fixes py-pdf#1379 have not been able to identify why str is returned instead of bytes as usual prefer to convert locally
this is unusual. I've added code to fix that |
PS : in the extraction result, the arabic characters are replaced with /afiinnnn. this is because the data uses the iso 10036 standard that I've not been able to find any free information on how to do transcoding
|
Thand you for analysis this Issue for help more developer who involved. I am noob for PDF. Will wait for resolved. |
I just want Covert a pdf file to a txt file , but the run failed
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!
02voc.pdf
Traceback
This is the complete Traceback I see:
root@0cc46add0ae3:/home/learn/IR/irBooks# python tokens-step1.py
2.11.0
PDF:numPages:29
hanle Page:0
hanle Page:1
hanle Page:2
Traceback (most recent call last):
File "/home/learn/IR/irBooks/tokens-step1.py", line 39, in
hanleOnePage(pdfreader, pIndex, "3.txt")
File "/home/learn/IR/irBooks/tokens-step1.py", line 12, in hanleOnePage
text=pageobj.extractText()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1865, in extractText
return self.extract_text()
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1818, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_page.py", line 1323, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 27, in build_char_map
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 193, in parse_to_unicode
cm = prepare_cm(ft)
File "/usr/local/lib/python3.10/site-packages/PyPDF2/_cmap.py", line 210, in prepare_cm
.replace(b"beginbfchar", b"\nbeginbfchar\n")
TypeError: replace() argument 1 must be str, not bytes
The text was updated successfully, but these errors were encountered: