-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: consider images inside PDF made with onlyoffice #2637
base: main
Are you sure you want to change the base?
ENH: consider images inside PDF made with onlyoffice #2637
Conversation
I made a PR to add the sample files : Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded ? |
In general, files should go into the sample files if you own all copyrights and are okay with the license terms there. Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself. The usual approach we take apart from this is to upload to an issue or PR and download the corresponding files from the GitHub URLs. Unfortunately, this will not work for image files any more as far as I know as GitHub started to add JWTs to their URLs which expire accordingly and are user-specific. |
Can you rename the title of your PR : this is not a a STYle but an ENHancement. you should be able to "factorize" your code to use the code for image extraction. the current code looking for "/Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object You should also confirm your code works with patterns within forms Also, It might be interesting to extract the thumbnail and confirm that no other images are ignored |
I reused images that were already present in the repo so it should be good in term of rights. The issue is that only office convert images to jpeg when included inside a document, so I had to re-upload them once converted to pass content verification in the test unit. Let's wait to see if my PR on the other repo get merged then, thanks. |
I was not sure which one to use, thanks.
You're right, I didn't thought about it, let's do that.
It will probably not in the current state, I'll check what are forms, thank you.
Yes I agree. I'll need to read a bit more the PDF standard, currently I've only read the part about patterns and image objects so I'm lacking a lot about how images object are used/located. Maybe it would be nicer to have an issue for each case where images are missing and have their matching PR ? |
I have tested onlyoffice forms and images inside it are not taken into account. Images inside PDF from libreoffice without and with forms are taken into account, as well as images from word forms. |
@0xNath |
I'll add the code for extracting the images from the form PDF and the corresponding test. I'm going to follow the same logic as for the patterns, but with "/Annots/.../..." as image identifiers iss2613-onlyoffice-form.pdf |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2637 +/- ##
==========================================
- Coverage 95.12% 95.10% -0.03%
==========================================
Files 51 51
Lines 8557 8593 +36
Branches 1707 1725 +18
==========================================
+ Hits 8140 8172 +32
Misses 263 263
- Partials 154 158 +4 ☔ View full report in Codecov by Sentry. |
Added code to detect patterns in "_get_ids_image". To avoid any conflicts with images that could be located directly in a page or images using the same ID in differents patterns, images ids under patterns are returned in this form : "/Pattern/patternNameHere/imageNameHere" Added code to deal with Pattern images in "_get_image".
fixed code style
fixed code style
fcb3b1f
to
d524672
Compare
commit d0493ae Author: Nathanaël Renaud <[email protected]> Date: Fri May 17 23:25:13 2024 +0200 Modified _get_ids_image and _get_image so they work with onlyoffice images commit 53a3781 Author: Nathanaël Renaud <[email protected]> Date: Fri May 17 23:22:27 2024 +0200 Added tests units about images extractions from PDF generated using onlyoffice
@0xNath I don't see a PR in https://github.com/py-pdf/sample-files |
@stefan6419846 I'll give you the permissions in a second :-) |
Hello, |
I am currently trying to understand what the current state of this PR this:
|
Hey
It's done.
I don't think PageObject.images and decode_as_image have the same purpose, and consequent changes would need to be done in the first one to implement the second one in it, with no real advantages in my opinion. decode_as_image is great when an image is not covered by PageObject.images. PageObject.images should be expended to improve cases were images are not detected so that user don't have to rely on decode_as_image.
This is done.
This as already been done in commit 0e81eee if I'm not mistaken.
This has been implemented as well with the test that goes with it.
In my opinion thumbnails should be extracted by another method than PageObject.images, like PageObject.thumbnail.
I had tried other softwares like Word and Libreoffice, and they were fine, now OnlyOffice should be fine too. Maybe there are other special case of images that we still don't cover but that's gonna be it for my PR. If i ever encounter those other case I will probably submit issue and PR if it's at my reach. |
if PG.ANNOTS in obj: | ||
for annot in cast(DictionaryObject, obj[PG.ANNOTS]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would more likey propose
if PG.ANNOTS in obj: | |
for annot in cast(DictionaryObject, obj[PG.ANNOTS]): | |
for annot_idx,annot in enumerate(cast(DictionaryObject, obj.get(PG.ANNOTS, {}))): |
annot_idx should be use to name of the image instead of ["/T"]
Also I would prefer to parse annotation after processing the page original resources in order to keeps "direct" images and inlines before annots'.
|
||
if PG.ANNOTS in obj: | ||
for annot in cast(DictionaryObject, obj[PG.ANNOTS]): | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add:
annot = annot.get_object()
and simplify your code below (specially the .keys()
should not be required
I agree here the wallrus operator would ease readability 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
walrus can now be used as python 3.7 is over. if it helps you
DictionaryObject, annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT] | ||
) | ||
): | ||
frame = annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT]["/FRM"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the ["\FRM"]
I've produced a test with a stamp and the image is called ["\Icon"]:
tttt.pdf
) | ||
): | ||
frame = annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT]["/FRM"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best would be to call recursively _get_image_ids()
lst.extend( | ||
[ | ||
( | ||
f"{PG.ANNOTS}/{annot['/T']}{o}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about naming
@@ -112,5 +112,7 @@ | |||
url: https://github.com/py-pdf/pypdf/files/12050253/tt.pdf | |||
- local_filename: Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf | |||
url: https://www.joinville.sc.gov.br/wp-content/uploads/2023/11/Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf | |||
- local_filename: iss2138.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these lines dissaperared?
@@ -214,6 +214,66 @@ def test_image_extraction(src, page_index, image_key, expected): | |||
assert image_similarity(BytesIO(actual_image.data), expected) >= 0.99 | |||
|
|||
|
|||
@pytest.mark.enable_socket() | |||
def test_onlyoffice_standard_images_extraction(): | |||
reader = PdfReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a reference to the issue numbers would be apreciated to ease later analysis
str(reader.pages[0].images) | ||
== "[Image_0=/Pattern/P1/X1, Image_1=/Pattern/P2/X1, Image_2=/Pattern/P3/X1]" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str(reader.pages[0].images) | |
== "[Image_0=/Pattern/P1/X1, Image_1=/Pattern/P2/X1, Image_2=/Pattern/P3/X1]" | |
) | |
reader.pages[0].images | |
== ["Image_0":"/Pattern/P1/X1", "Image_1":"/Pattern/P2/X1", "Image_2"="/Pattern/P3/X1"] | |
) |
@pytest.mark.enable_socket() | ||
def test_onlyoffice_standard_images_extraction(): | ||
reader = PdfReader( | ||
BytesIO(get_data_from_url(name="iss2613-onlyoffice-standardImages.pdf")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you are having url/name defined locally I recommend to do the same for the pdf and not change the yaml file above
|
||
@pytest.mark.enable_socket() | ||
def test_onlyoffice_form_images_extraction(): | ||
reader = PdfReader(BytesIO(get_data_from_url(name="iss2613-onlyoffice-form.pdf"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
@0xNath do you know when you will be able to complete the PR ? |
Hey, my apologies for not replying, thank you a lot for all those advices I will tackle this down this weekend. |
Sorry, I have not found the motivation to work on it this weekend... |
@0xNath |
I have the time but I lack motivation. I have started to get back on the little project I have with this lib so it will get back eventually |
@0xNath take courage😉, I will be happy to see you as a contributor |
closes #2613
Added code to detect patterns in "_get_ids_image".
To avoid any conflicts with images that could be located directly in a page or images using the same ID in different patterns, images ids under patterns are returned in this form : "/Pattern/patternNameHere/imageNameHere"
Added code to deal with Pattern images in "_get_image".
Added a new test to verify ids and image data returned related to the modifications done above.