[chore]: Pypdfium2 compatibility fix #1239

felixT2K · 2023-07-04T13:48:10Z

This PR:

applies changes to be compatible with pypdfium2 >= 4.0.0

felixdittrich92 · 2023-07-04T13:51:59Z

@mara004 temp fix for compatibility

codecov · 2023-07-04T14:05:59Z

Codecov Report

Merging #1239 (ac5edc2) into main (e04e183) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head ac5edc2 differs from pull request most recent head b70de51. Consider uploading reports for the commit b70de51 to get more accurate results

@@            Coverage Diff             @@
##             main    #1239      +/-   ##
==========================================
- Coverage   95.66%   95.66%   -0.01%     
==========================================
  Files         154      154              
  Lines        6877     6873       -4     
==========================================
- Hits         6579     6575       -4     
  Misses        298      298

Flag	Coverage Δ
unittests	`95.66% <100.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/io/pdf.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

mara004 · 2023-07-04T17:34:37Z

doctr/io/pdf.py

-    renderer = pdf.render_to(pdfium.BitmapConv.numpy_ndarray, scale=scale, rev_byteorder=rgb_mode, **kwargs)
-    return [img for img, _ in renderer]
+    pdf = pdfium.PdfDocument(file, password=password, autoclose=True)
+    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]


No, please don't do this. This will result in a major slowdown compared to the multi-page renderer, as mentioned in the other thread.

😅 but what would you suggest ? As mentioned using the render from PdfDocument instead of the PagePdf seems to be broken for bytes input... which is a bit confusing 😁

What we need for the moment is a workaround to be compatible with your current and future versions... the io module refactoring to generators is another topic :)

just write a tempfile, as I already said, no?

Ah sry i missed this part in your comment 🙈 do you know how big the impact is ? I think there shouldn't be much difference from current usage !?
I would really don't like to start with tricks like this just to ensure compatibility if we probably rewrite the io module after the next release one way or another.

No problem. The performance advantage depends on use case and host system (or config): If len(pdf) >= n_processes, the multi-page renderer will be factor n_processes faster (well, minus process pool setup time).

If you only use it with single-page PDFs, then the multi-page renderer is actually slower ATM. (It should really handle that as special case and render the one page directly instead of setting up the pool. I wanted to change that a while ago but got distracted by other tasks...)

Hey @mara004 👋 ,
Thanks for sharing your findings.
Ok that was just a guess on my part, as I noticed similar behavior in another context before. 😅
But nice if you can take some insight from the test for your library as well. 👍

I think too. Until we go ahead with generators we can keep this as robust solution for the moment. :)

Simply using linear rendering should be OK for doctr for now.

However, it turns out the situation is more complicated for pypdfium2's CLI, which I use for testing.
I figured the problem was that we had a fairly expensive PIL save call in the main process, not in workers, which meant the process pool queued up results in memory without limit while the generator in the main process lacked behind. (That may also be something to take into account for the upcoming refactoring of doctr...)
Now when including the save call in workers parallelization is still a major advantage for our CLI.

I tried to catch up on the discussion, I'm fine to stay out of tempfile hacks. Also, a bumping pypdfium2 version is fine to me.

Thanks for the green light @odulcy-mindee.

(Not that it matters now, but as already mentioned I wouldn't call tempfiles a hack in that case. With multiprocessing data transfer is just a necessity, and writing a tempfile once is much better than piping a large bytes object with each job.)

odulcy-mindee · 2023-07-10T08:42:25Z

doctr/io/pdf.py

-    renderer = pdf.render_to(pdfium.BitmapConv.numpy_ndarray, scale=scale, rev_byteorder=rgb_mode, **kwargs)
-    return [img for img, _ in renderer]
+    pdf = pdfium.PdfDocument(file, password=password, autoclose=True)
+    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]


I tried to catch up on the discussion, I'm fine to stay out of tempfile hacks. Also, a bumping pypdfium2 version is fine to me.

felixT2K added 2 commits July 4, 2023 15:44

provide compatibility with pypdfium2 >= 4.0.0

09231ad

change to pdfpage

ac5edc2

felixdittrich92 requested a review from odulcy-mindee July 4, 2023 13:49

felixdittrich92 self-assigned this Jul 4, 2023

felixdittrich92 added topic: build Related to dependencies and build module: io Related to doctr.io type: misc Miscellaneous labels Jul 4, 2023

felixdittrich92 added this to the 0.6.1 milestone Jul 4, 2023

mara004 reviewed Jul 4, 2023

View reviewed changes

felixdittrich92 marked this pull request as draft July 5, 2023 03:32

felixdittrich92 changed the title ~~[chore]: Pypdfium2 compatibility fix~~ Draft: [chore]: Pypdfium2 compatibility fix Jul 5, 2023

felixdittrich92 changed the title ~~Draft: [chore]: Pypdfium2 compatibility fix~~ [chore]: Pypdfium2 compatibility fix Jul 5, 2023

felixdittrich92 marked this pull request as ready for review July 5, 2023 20:27

add numpy deprecation fix

b70de51

odulcy-mindee approved these changes Jul 10, 2023

View reviewed changes

felixdittrich92 merged commit 4e1985f into mindee:main Jul 10, 2023

felixT2K deleted the pypdfium2-temp-compatibility-fix branch July 10, 2023 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chore]: Pypdfium2 compatibility fix #1239

[chore]: Pypdfium2 compatibility fix #1239

felixT2K commented Jul 4, 2023

felixdittrich92 commented Jul 4, 2023

codecov bot commented Jul 4, 2023 •

edited

Loading

mara004 Jul 4, 2023

felixdittrich92 Jul 4, 2023

mara004 Jul 4, 2023 •

edited

Loading

felixdittrich92 Jul 4, 2023

mara004 Jul 4, 2023 •

edited

Loading

felixdittrich92 Jul 5, 2023

felixdittrich92 Jul 5, 2023

mara004 Jul 9, 2023 •

edited

Loading

odulcy-mindee Jul 10, 2023

mara004 Jul 10, 2023

odulcy-mindee Jul 10, 2023

[chore]: Pypdfium2 compatibility fix #1239

[chore]: Pypdfium2 compatibility fix #1239

Conversation

felixT2K commented Jul 4, 2023

felixdittrich92 commented Jul 4, 2023

codecov bot commented Jul 4, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mara004 Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mara004 Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mara004 Jul 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 4, 2023 •

edited

Loading

mara004 Jul 4, 2023 •

edited

Loading

mara004 Jul 4, 2023 •

edited

Loading

mara004 Jul 9, 2023 •

edited

Loading