Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object of type 'PSKeyword' has no len() #617

Open
s-oliver opened this issue May 7, 2021 · 3 comments
Open

object of type 'PSKeyword' has no len() #617

s-oliver opened this issue May 7, 2021 · 3 comments
Labels
component:characters Anything with encodings, character mappings or CJK languages type: bug

Comments

@s-oliver
Copy link

s-oliver commented May 7, 2021

Bug report

I'm using paperless-ng, a document archiving system, which uses pdfminer under the hood to extract information from the pdfs added to it. I have filed a bug with the author already (=> jonaswinkler/paperless-ng#981), but seeing that this is a pdfminer issue he sent me here.

This is the exception and stack trace:

21:17:32 [Q] ERROR Failed [rechnung_2.pdf] - object of type 'PSKeyword' has no len() : Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/usr/src/paperless/src/documents/tasks.py", line 81, in consume_file
    task_id=task_id
  File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 212, in parse
    text_original = self.extract_text(None, document_path)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 120, in extract_text
    stripped = post_process_text(pdfminer_extract_text(pdf_file))
  File "/usr/local/lib/python3.7/site-packages/pdfminer/high_level.py", line 121, in extract_text
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 933, in execute
    func(*args)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 803, in do_TJ
    self.graphicstate.copy())
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 83, in render_string
    graphicstate)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 96, in render_string_horizontal
    for cid in font.decode(obj):
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 776, in decode
    return self.cmap.decode(bytes)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/cmapdb.py", line 115, in decode
    n = len(code)//2
TypeError: object of type 'PSKeyword' has no len()

The version of pdfminer.six used in paperless is 20201018.

Unfortunately the pdfs contains sensitive information so I'm not comfortable with sharing an example publicly. I hope the stack trace allows locating the issue and handling the exception gracefully.

@cchristiansen
Copy link

I have the same issue for some PDFs (which also unfortunately contain sensitive information).
The version of pdfminer.six I'm using is 20220506.

  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/high_level.py", line 200, in extract_pages
    interpreter.process_page(page)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 896, in do_TJ
    self.device.render_string(
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 133, in render_string
    textstate.linematrix = self.render_string_horizontal(
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 170, in render_string_horizontal
    for cid in font.decode(obj):
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdffont.py", line 1174, in decode
    return self.cmap.decode(bytes)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 134, in decode
    n = len(code) // 2
TypeError: object of type 'PSKeyword' has no len()

For the PDFs, the issue appears to be that some of the obj in seq (of type PDFTextSeq, inputted into PDFTextDevice.render_string_horizontal()) are not bytes but PSKeyword. In every case, the problematic obj is a PSKeyword with the name b'\x00'.

A quick and dirty hack which fixes the issue for me, is to insert the following two lines into pdfdevice.py at line 170:

                if isinstance(obj, PSKeyword):
                    obj = obj.name

This solution is evidently not ideal, as it does not answer why there is a PSKeyword within the PDFTextSeq in the first place. However, if the maintainers believe this is a suitable solution, I am happy to file a pull request.

I will endeavour to find out why b'\x00' is of type PSKeyword and not bytes, and attempt to create a PDF which I can share which reproduces this bug. If anybody has any further information or help they can offer, it would be greatly appreciated.

@pietermarsman pietermarsman added type: bug component:characters Anything with encodings, character mappings or CJK languages labels May 24, 2022
@cchristiansen
Copy link

A seemingly more proper fix than the hack above, is to add in the case c == b"\x00" for parse_number in PSBaseParser._parse_main (psparser.py).

Namely, at line 309 of psparser.py,

-        elif c in b"-+" or c.isdigit():
+        elif c in b"-+" or c.isdigit() or c == b"\x00":

I am not familiar enough with the PDF specification to explain why this edge case is required. However, this fixes the issue for me, and seems more proper than the previous fix proposed above.

@cchristiansen
Copy link

\0 outside of parentheses () but within square brackets [] in a stream before TJ, appear to be the issue. Attached is a mock-up PDF (which is slightly broken), which triggers the exception described.
example-decoded.pdf
Remove the errant \0 and the issue is resolved.

pietermarsman added a commit that referenced this issue Jun 26, 2022
* Ignore null characters in PSBaseParser

Beforehand, null characters were encoded as PSKeyword tokens. This caused
issue #617, as pdfdevice.py would attempt to decode the null character
PSKeyword, when it expects a byte string, as opposed to a PSKeyword, causing
pdfminer.six to crash.

As null characters are superfluous within PSBaseParser, ignore them.

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:characters Anything with encodings, character mappings or CJK languages type: bug
Projects
None yet
Development

No branches or pull requests

3 participants