inline image parsing fails when stream data contains "EI\n" #1008

dhdaines · 2024-07-10T18:35:18Z

In the case where an inline image uses the ASCII85Decode filter, it can (and frequently does) contain the sequence "EI" internally at the end of a line. This confuses the pdf parser and can lead to a variety of weird symptoms, since it will attempt to parse the rest of the stream data as the rest of the containing content stream, which it most definitely is not. I just happened to notice this because the attached PDF has this problem (the symptom in particular is that it comes across the sequence "\611" which is a very invalid octal escape):

222-2008-zonage-annexe-c-carte-25b-innond.pdf

The PDF spec is not tremendously helpful, but once you realize that an ASCII85Decode stream must end with "~>" it's obvious that we should be looking for that, and not "EI" followed by whitespace, in the case where this encoding is used. This should be as simple as checking if /A85 is in the image dictionary and then passing "~>" instead of "EI" to PDFContentParser.get_inline_data. I'll make a PR.

The text was updated successfully, but these errors were encountered:

pietermarsman · 2024-07-15T06:13:46Z

I can reproduce this with:

$ python tools/pdf2txt.py 222-2008-zonage-annexe-c-carte-25b-innond.pdf
Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 318, in <module>
    sys.exit(main())
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 312, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 133, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1027, in execute
    (_, obj) = parser.nextobject()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 601, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 518, in nexttoken
    self.charpos = self._parse1(self.buf, self.charpos)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 467, in _parse_string_1
    self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)

dhdaines · 2024-07-15T12:56:21Z

Yes - as you note in the PR, it is when /A85 is the first filter in the list, for obvious reasons (it is very unlikely that it wouldn't be the first filter, since that makes no sense, but you never know with PDFs...)

dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Jul 10, 2024

fix: correctly support ASCII85 in inline images (fixes: pdfminer#1008)

a51f033

dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Jul 10, 2024

fix: correctly support ASCII85 in inline images (fixes: pdfminer#1008)

f39109e

dhdaines mentioned this issue Jul 10, 2024

Do not crash on ASCII85 in inline images and properly support their colorspaces #1010

Merged

pietermarsman added type: bug component: interpreter Related to PDFInterpreter status: accepted labels Jul 15, 2024

pietermarsman closed this as completed in #1010 Jul 15, 2024

dhdaines mentioned this issue Nov 25, 2024

Rectangles are misrecognized as curves when they contain a redundant final h operator #1065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline image parsing fails when stream data contains "EI\n" #1008

inline image parsing fails when stream data contains "EI\n" #1008

dhdaines commented Jul 10, 2024 •

edited

Loading

pietermarsman commented Jul 15, 2024

dhdaines commented Jul 15, 2024

inline image parsing fails when stream data contains "EI\n" #1008

inline image parsing fails when stream data contains "EI\n" #1008

Comments

dhdaines commented Jul 10, 2024 • edited Loading

pietermarsman commented Jul 15, 2024

dhdaines commented Jul 15, 2024

dhdaines commented Jul 10, 2024 •

edited

Loading