Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrypting pdf owner password fails - treats /O value as unicode #557

Closed
brianm78 opened this issue Jun 1, 2020 · 3 comments
Closed

Decrypting pdf owner password fails - treats /O value as unicode #557

brianm78 opened this issue Jun 1, 2020 · 3 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected workflow-encryption From a users perspective, encryption is the affected feature/workflow

Comments

@brianm78
Copy link

brianm78 commented Jun 1, 2020

This came up from a question on learnpython that looks to be due to a bug in PyPDF2 on python 3. When decrypting an owner password, RC4_encrypt gets passed the value of encrypt["/O"] as a TextStringObject, so it treats it as unicode when decrypting, meaning it can end up trying to treat multibyte characters as bytes, resulting in a UnicodeDecodeError.

User passwords seem to work fine, presumably since it uses the .original_bytes value of the object, and changing the code to use this instead (ie real_O = encrypt["/O"].getObject().original_bytes resolves it for the owner case too.

Steps to reproduce:

created a rev3 encrypted pdf with:

qpdf --encrypt "" "ownerpw" 128 -- input.pdf output.pdf

Then tried to decrypt with:

with open("en.pdf", "rb") as f:
    p = PyPDF2.PdfFileReader(f)
    p.decrypt("any_password")

Results in:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0193' in position 0: ordinal not in range(256)

(Edit): Actually, looking further, it seems like this bug will only be triggered when it incorrectly interprets the "/O" value as being unicode which will depend on the password being used. This seems to be down to createStringObject trying to interpret it as such if it possibly can. If the password just happens to generate a block here with no sub-24 control codes (which for a 32 byte string will happen around 4% of the time), it'll interpret it as unicode and return a TextStringObject instead of a ByteStringObject.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-encryption From a users perspective, encryption is the affected feature/workflow PdfReader The PdfReader component is affected labels Apr 7, 2022
@MartinThoma
Copy link
Member

I tried to reproduce it with the following, but that worked fine. Is the issue gone?

Create PDF

qpdf --encrypt "" "ownerpw" 128 -- crazyones.pdf output.pdf

Code

from PyPDF2 import PdfReader
with open("output.pdf", "rb") as f:
    reader = PdfReader(f)
    reader.decrypt("any_password")

@brianm78
Copy link
Author

Tried it here, and it does now seem to run without reporting an error.

However, I was a bit suspicious, since on debugging through, the real_O object it gets from the /O section does still seem to be a TextStringObject rather than as bytes object for the same test case.

Debugging down to _security.py::RC4_encrypt, the plaintext parameter here (typed as bytes) is now that TextString object, and it ends up doing b_(chr(ord(plaintext[x]) ^t) for each character, so XORing it with a sequence of bytes derived from the key, and then reconstituting it into a bytes object from this point, however, the process looks like it ends up corrupting the result because of the wrong type.

Eg. in my test case, plantext ended up here as

  "Z\x0c\x0bUÞÚr=jˆwÊ\x02\x0e\x0bù\x12sÐ'3Ät„p\rž¹¦Ö−\t"

(Note: the type is a TextStringObject object, holding a unicode string, rather than a ByteStringObject with bytes data as is true for the non-wrongly detected cases).

This has codepoints beyond 256 (eg. plaintext[9] is 710). Thus for this character, the The b_() function tries to encode "ʩ" as latin-1, which obviously fails, then tries again encoding it as utf-8, ending up with a 2-byte sequence which it returns.

That seems like it's definitely going to mangle the decrpytion: It's only XORed the low byte of a unicode codepoint we've incorrectly content-sniffed from some binary data, then translated that to utf8. The way it processes things has changed so that it no longer gives an error, but I think it's actually worse now in that it's silently mangling data and reporting success.

That said, I'm not too familiar with owner passwords or how you'd even test they're correctly decrypted (I think they may just be client-enforced anyway. And from what I can see, PyPDF2 doesn't actually seem to be validating the password anyway, as you can give it the completely wrong password and decrypyt will not still not raise an error, so it may be a bit of a moot point unless that's actually checked at some point. (by comparison, qpdf will fail with anything except the correct password)

(For reference, I also tried saving the decrpyted file with PdfWriter and comparing against qpdf --decrypt. The file does end up different in size etc from qpdf's version , but that may not be unexpected given differences in implementation. It does appear valid, and readable, and "qpdf --check" reports it's valid and unencrypted. However, this is also true when you give completely the wrong password, so potentially it's just stripping out the encrypted section and everything it does inside decrypt is completely irrelevant for owner passwords anyway)

In any case, I still think the simplest fix is something like changing _reader.py:1672 from:

real_O = cast(bytes, encrypt["/O"].get_object())

To:

real_O = cast(bytes, encrypt["/O"].get_object().original_bytes)

To cover the cases where encrypt["/O"] randomly gets wrongly autodetected as a unicode string, instead of binary bytes.

But given nothing done here seems to matter to the end result, it may not be too relevant unless that changes.

@MartinThoma MartinThoma changed the title Decrypting pdf owner password fails on python 3 - treats /O value as unicode Decrypting pdf owner password fails - treats /O value as unicode Jul 9, 2022
@exiledkingcc
Copy link
Contributor

@MartinThoma this was fixed by #749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF PdfReader The PdfReader component is affected workflow-encryption From a users perspective, encryption is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants