object of type 'PSKeyword' has no len() #617

s-oliver · 2021-05-07T20:58:13Z

Bug report

I'm using paperless-ng, a document archiving system, which uses pdfminer under the hood to extract information from the pdfs added to it. I have filed a bug with the author already (=> jonaswinkler/paperless-ng#981), but seeing that this is a pdfminer issue he sent me here.

This is the exception and stack trace:

21:17:32 [Q] ERROR Failed [rechnung_2.pdf] - object of type 'PSKeyword' has no len() : Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
    res = f(*task["args"], **task["kwargs"])
  File "/usr/src/paperless/src/documents/tasks.py", line 81, in consume_file
    task_id=task_id
  File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 212, in parse
    text_original = self.extract_text(None, document_path)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 120, in extract_text
    stripped = post_process_text(pdfminer_extract_text(pdf_file))
  File "/usr/local/lib/python3.7/site-packages/pdfminer/high_level.py", line 121, in extract_text
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.execute(list_value(streams))
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 933, in execute
    func(*args)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 803, in do_TJ
    self.graphicstate.copy())
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 83, in render_string
    graphicstate)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 96, in render_string_horizontal
    for cid in font.decode(obj):
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 776, in decode
    return self.cmap.decode(bytes)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/cmapdb.py", line 115, in decode
    n = len(code)//2
TypeError: object of type 'PSKeyword' has no len()

The version of pdfminer.six used in paperless is 20201018.

Unfortunately the pdfs contains sensitive information so I'm not comfortable with sharing an example publicly. I hope the stack trace allows locating the issue and handling the exception gracefully.

The text was updated successfully, but these errors were encountered:

cchristiansen · 2022-05-24T11:21:13Z

I have the same issue for some PDFs (which also unfortunately contain sensitive information).
The version of pdfminer.six I'm using is 20220506.

  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/high_level.py", line 200, in extract_pages
    interpreter.process_page(page)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 991, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents
    self.execute(list_value(streams))
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 1036, in execute
    func(*args)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfinterp.py", line 896, in do_TJ
    self.device.render_string(
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 133, in render_string
    textstate.linematrix = self.render_string_horizontal(
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdfdevice.py", line 170, in render_string_horizontal
    for cid in font.decode(obj):
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/pdffont.py", line 1174, in decode
    return self.cmap.decode(bytes)
  File "~/.virtualenvs/foo/lib/python3.10/site-packages/pdfminer/cmapdb.py", line 134, in decode
    n = len(code) // 2
TypeError: object of type 'PSKeyword' has no len()

For the PDFs, the issue appears to be that some of the obj in seq (of type PDFTextSeq, inputted into PDFTextDevice.render_string_horizontal()) are not bytes but PSKeyword. In every case, the problematic obj is a PSKeyword with the name b'\x00'.

A quick and dirty hack which fixes the issue for me, is to insert the following two lines into pdfdevice.py at line 170:

                if isinstance(obj, PSKeyword):
                    obj = obj.name

This solution is evidently not ideal, as it does not answer why there is a PSKeyword within the PDFTextSeq in the first place. However, if the maintainers believe this is a suitable solution, I am happy to file a pull request.

I will endeavour to find out why b'\x00' is of type PSKeyword and not bytes, and attempt to create a PDF which I can share which reproduces this bug. If anybody has any further information or help they can offer, it would be greatly appreciated.

cchristiansen · 2022-06-08T01:09:43Z

A seemingly more proper fix than the hack above, is to add in the case c == b"\x00" for parse_number in PSBaseParser._parse_main (psparser.py).

Namely, at line 309 of psparser.py,

-        elif c in b"-+" or c.isdigit():
+        elif c in b"-+" or c.isdigit() or c == b"\x00":

I am not familiar enough with the PDF specification to explain why this edge case is required. However, this fixes the issue for me, and seems more proper than the previous fix proposed above.

cchristiansen · 2022-06-08T05:49:05Z

\0 outside of parentheses () but within square brackets [] in a stream before TJ, appear to be the issue. Attached is a mock-up PDF (which is slightly broken), which triggers the exception described.
example-decoded.pdf
Remove the errant \0 and the issue is resolved.

* Ignore null characters in PSBaseParser Beforehand, null characters were encoded as PSKeyword tokens. This caused issue #617, as pdfdevice.py would attempt to decode the null character PSKeyword, when it expects a byte string, as opposed to a PSKeyword, causing pdfminer.six to crash. As null characters are superfluous within PSBaseParser, ignore them. * Update CHANGELOG.md Co-authored-by: Pieter Marsman <[email protected]>

pietermarsman added type: bug component:characters Anything with encodings, character mappings or CJK languages labels May 24, 2022

cchristiansen mentioned this issue Jun 9, 2022

Ignore null characters in PSBaseParser #768

Merged

7 tasks

Irina-Pavlova mentioned this issue Nov 2, 2024

Object of type 'PSKeyword' has no len() via cmapdb.py #1059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object of type 'PSKeyword' has no len() #617

object of type 'PSKeyword' has no len() #617

s-oliver commented May 7, 2021

cchristiansen commented May 24, 2022

cchristiansen commented Jun 8, 2022

cchristiansen commented Jun 8, 2022

object of type 'PSKeyword' has no len() #617

object of type 'PSKeyword' has no len() #617

Comments

s-oliver commented May 7, 2021

cchristiansen commented May 24, 2022

cchristiansen commented Jun 8, 2022

cchristiansen commented Jun 8, 2022