-
Notifications
You must be signed in to change notification settings - Fork 945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inline image parsing fails when stream data contains "EI\n" #1008
Labels
Comments
dhdaines
added a commit
to dhdaines/pdfminer.six
that referenced
this issue
Jul 10, 2024
dhdaines
added a commit
to dhdaines/pdfminer.six
that referenced
this issue
Jul 10, 2024
I can reproduce this with:
|
pietermarsman
added
type: bug
component: interpreter
Related to PDFInterpreter
status: accepted
labels
Jul 15, 2024
Yes - as you note in the PR, it is when /A85 is the first filter in the list, for obvious reasons (it is very unlikely that it wouldn't be the first filter, since that makes no sense, but you never know with PDFs...) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the case where an inline image uses the ASCII85Decode filter, it can (and frequently does) contain the sequence "EI" internally at the end of a line. This confuses the pdf parser and can lead to a variety of weird symptoms, since it will attempt to parse the rest of the stream data as the rest of the containing content stream, which it most definitely is not. I just happened to notice this because the attached PDF has this problem (the symptom in particular is that it comes across the sequence
"\611"
which is a very invalid octal escape):222-2008-zonage-annexe-c-carte-25b-innond.pdf
The PDF spec is not tremendously helpful, but once you realize that an ASCII85Decode stream must end with
"~>"
it's obvious that we should be looking for that, and not "EI" followed by whitespace, in the case where this encoding is used. This should be as simple as checking if/A85
is in the image dictionary and then passing"~>"
instead of "EI" toPDFContentParser.get_inline_data
. I'll make a PR.The text was updated successfully, but these errors were encountered: