Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline image parsing fails when stream data contains "EI\n" #1008

Closed
dhdaines opened this issue Jul 10, 2024 · 2 comments · Fixed by #1010
Closed

inline image parsing fails when stream data contains "EI\n" #1008

dhdaines opened this issue Jul 10, 2024 · 2 comments · Fixed by #1010

Comments

@dhdaines
Copy link
Contributor

dhdaines commented Jul 10, 2024

In the case where an inline image uses the ASCII85Decode filter, it can (and frequently does) contain the sequence "EI" internally at the end of a line. This confuses the pdf parser and can lead to a variety of weird symptoms, since it will attempt to parse the rest of the stream data as the rest of the containing content stream, which it most definitely is not. I just happened to notice this because the attached PDF has this problem (the symptom in particular is that it comes across the sequence "\611" which is a very invalid octal escape):

222-2008-zonage-annexe-c-carte-25b-innond.pdf

The PDF spec is not tremendously helpful, but once you realize that an ASCII85Decode stream must end with "~>" it's obvious that we should be looking for that, and not "EI" followed by whitespace, in the case where this encoding is used. This should be as simple as checking if /A85 is in the image dictionary and then passing "~>" instead of "EI" to PDFContentParser.get_inline_data. I'll make a PR.

@pietermarsman
Copy link
Member

I can reproduce this with:

$ python tools/pdf2txt.py 222-2008-zonage-annexe-c-carte-25b-innond.pdf
Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 318, in <module>
    sys.exit(main())
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 312, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 133, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1027, in execute
    (_, obj) = parser.nextobject()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 601, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 518, in nexttoken
    self.charpos = self._parse1(self.buf, self.charpos)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 467, in _parse_string_1
    self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)

@dhdaines
Copy link
Contributor Author

Yes - as you note in the PR, it is when /A85 is the first filter in the list, for obvious reasons (it is very unlikely that it wouldn't be the first filter, since that makes no sense, but you never know with PDFs...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants