-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add decrypt support for V5 and AES-128, AES-256 (R5 only) #749
Conversation
local tox test is OK with py38. |
CI (Github Actions) is defined here. It uses requirements/ci.txt - and manually entered packages for Python 2.7 |
As this PR is rather big, I would add it in the 2.0 release or later. Simply to ensure we can leave the 1.X parts soon-ish. Then you also don't have to care about the 2.7 support as it will be dropped with PyPDF2 version 2.0 |
NOTICE: 1. Python3 only 2. need PyCryptodome for AES
encryption R=6 is used by PDF 2.0. |
|
||
|
||
try: | ||
from Crypto.Cipher import ARC4, AES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation to have it include both attempt to use https://github.com/Legrandin/pycryptodome and also a fallback to an embedded method? Feels like should either add dependencies or have decryption be included, but having both seems like added maintenance for little gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at first, I used PyCryptodome, then I found PyPDF2 has no dependencies at all, so I try not to introduce a dependency by adding AES code in PyPDF2.
encryption/decryption is not always needed for PDF, it's better to make PyCryptodome optional for users who do not need it.
for some other users, they may prefer PyCryptodome for better performance.
so I provide both.
maybe there should be a discussion for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PDF encryption / decryption in the wild, it's really just using AES out in the wild, right? So it's not like if to have in-house cryptography around this, the code for that wouldn't be too large, and should be relatively stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PDF specification also defined Public Key algorithms for PDF encryption, but I never see such PDF files.
most encrypted PDF files use only RC4 and AES, and some hash functions.
AES code can be very stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I really would like to avoid having to maintain security-critical parts. I worry more about the encryption part than the decryption part.
If we mess up encryption, users might end up less secure then they expect. If we mess up decryption, users will just see an error.
Maybe we can have "inline imports" (importing Crypto within the encrypt / decrypt function). So we could make PyCryptodome an optional dependency that people would need to install if they want to use encryption / decryption. Maybe we could also get rid of a part of the current codebase this way?
To clarify: I don't see that I will properly maintain the crypto parts. This is the reason why I have this tendency. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline imports is a good choice, actually it's what I did at first.
when I added AES code, I try to make the lib independent, but ignored the burden of maintaining.
I agree with that it's better to leave the AES to PyCryptodome when users need it.
I will remove _aes.py later.
Very nice, thank you 🤗 |
One adjustment is still necessary: In
So people can install it with Is there a minimum version of PyCryptodome this PR needs? |
good idea!
no |
Nice! From my perspective this looks ready to be merged. However, due to the missing 2.7 support, it will take a while (until the PyPDF2 2.0 release). I hope I can make that release on 1st of July. I really want to get this done: #752 If I don't get any comments on what should be part of PyPDF2 1.x?, I will start working on 2.0 on 1st of May. |
I think I'll add a "no-python27" pytest marker that CI uses ... on the other hand, if nobody reacts to #753 I will start working on the PyPDF2 2.0 release from 1st of May 😄 |
I still want to go through #817 and merge
Plus maybe add a contributors.md and maybe add some magic methods / camel_case method names + deprecation warnings for snakeCase method names: #751 |
it seems OK for me, no change is needed. |
except DependencyError as e: | ||
# make dependency error clear to users | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that part necessary? What does it do? It seems to me that you just catch an exception and raise exactly the same exception again.... hence removing this block would not change the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if your PDF need AES algorithm to decrypt, but you don't install pycrytodome, without this two line,you will just got "the pdf is not decrypted", with this, you will know that you just need install the dependcy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting! Took me a minute to understand - thank you for pointing out that I need to look at the following two lines!
The highlight of this release is improved support for file encryption (AES-128 and AES-256, R5 only). See #749 for the amazing work of @exiledkingcc 🎊 Thank you 🤗 Deprecations (DEP): - Rename names to be PEP8-compliant (#967) - `PdfWriter.get_page`: the pageNumber parameter is renamed to page_number - `PyPDF2.filters`: * For all classes, a parameter rename: decodeParms ➔ decode_parms * decodeStreamData ➔ decode_stream_data - `PyPDF2.xmp`: * XmpInformation.rdfRoot ➔ XmpInformation.rdf_root * XmpInformation.xmp_createDate ➔ XmpInformation.xmp_create_date * XmpInformation.xmp_creatorTool ➔ XmpInformation.xmp_creator_tool * XmpInformation.xmp_metadataDate ➔ XmpInformation.xmp_metadata_date * XmpInformation.xmp_modifyDate ➔ XmpInformation.xmp_modify_date * XmpInformation.xmpMetadata ➔ XmpInformation.xmp_metadata * XmpInformation.xmpmm_documentId ➔ XmpInformation.xmpmm_document_id * XmpInformation.xmpmm_instanceId ➔ XmpInformation.xmpmm_instance_id - `PyPDF2.generic`: * readHexStringFromStream ➔ read_hex_string_from_stream * initializeFromDictionary ➔ initialize_from_dictionary * createStringObject ➔ create_string_object * TreeObject.hasChildren ➔ TreeObject.has_children * TreeObject.emptyTree ➔ TreeObject.empty_tree New Features (ENH): - Add decrypt support for V5 and AES-128, AES-256 (R5 only) (#749) Robustness (ROB): - Fix corrupted (wrongly) linear PDF (#1008) Maintenance (MAINT): - Move PDF_Samples folder into ressources - Fix typos (#1007) Testing (TST): - Improve encryption/decryption test (#1009) - Add merger test cases with real PDFs (#1006) - Add mutmut config Code Style (STY): - Put pure data mappings in separate files (#1005) - Make encryption module private, apply pre-commit (#1010) Full Changelog: 2.2.1...2.3.0
Is there any encrypted PDF file available that I could use for testing this PR? I mean some file that I couldn't decrypt before and now it's possible. |
@xilopaint, test samples are included in this PR for autotest |
@xilopaint You can also easily create them with qpdf:
and then run from PyPDF2 import PdfReader
reader = PdfReader("crazyones-256.pdf", password="foo")
print(reader.pages[0].extract_text()) If you checkout
|
@MartinThoma the v2.3.0 is broken for me. Also, for using the new enhanced decryption I had to install |
@xilopaint Thank you for pointing it out - I just fixed it and released
|
Yeah, I had tried What's the reason for |
Yes exactly! Quite a lot of users don't need any encryption / decryption capabilities. Additionally, pycryptodome is not a pure-python dependency. That means if it was a non-optional dependency, it might make it for some cases way harder / impossible to install / use |
@MartinThoma my tests are failing in GitHub Actions since I've installed https://github.com/xilopaint/alfred-pdf-tools/runs/6956399382?check_suite_focus=true Although they're passing locally with no errors. Could you help me out? |
I just had a quick glance, but it seems like a Linux kernel module is missing. That is exactly the kind of reason why I wanted this dependency to be optional. If you don't install the crypto part, the pure-python implementation is used. That still offers encryption / decryption, but only older algorithms |
Hey @MartinThoma, what does R5 mean in the context of cryptography? |
Thanks! |
rewrite the encryption part to support V4 and AES-128 encryption (ONLY decrypt for now).
i like the idea of PyPDF2 cleanup, so this is Python3 ONLY.
this commit needs PyCryptodome for AES operations.
the encrypt part will be added some time later, and maybe AES-256 support will be added too, if it's not difficult.