Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

Remove dependency on distro specific packages #96

Closed
vinayak-mehta opened this issue Sep 6, 2018 · 6 comments
Closed

Remove dependency on distro specific packages #96

vinayak-mehta opened this issue Sep 6, 2018 · 6 comments

Comments

@vinayak-mehta
Copy link
Contributor

vinayak-mehta commented Sep 6, 2018

Something to think about for the future:

  • OpenCV: maybe implement morph transform within the library itself/vendorize the code (not sure about dependency on C extensions)?
  • tk: Required for matplotlib.
  • ghostscript: maybe use some Python library to convert PDF to image (same quality as ghostscript).
@vinayak-mehta vinayak-mehta changed the title Remove dependency on apt install Remove dependency on distro specific packages Sep 6, 2018
@vinayak-mehta
Copy link
Contributor Author

Some questions:
[1] Can pdftoppm be an alternative to ghostscript?
[2] Are poppler-utils more widely available (pre-installed) than ghostscript?

@tkelman
Copy link

tkelman commented Oct 9, 2018

Could the matplotlib dependency be made optional? The plotting features here look like not a lot of code, and it's a pretty complicated dependency to pull in.

Similarly might pillow be a viable smaller alternative to the use of opencv here?

@vinayak-mehta
Copy link
Contributor Author

vinayak-mehta commented Oct 9, 2018

Hello @tkelman! I think making matplotlib optional makes sense. Let me look into it as I go on to adding more tests for the plotting code #127.

Camelot uses adaptive threshold and morphological transformations from opencv. I haven't worked with pillow in the past but a quick google search got me this morph transform equivalent in pillow. I think removing opencv as a dependency would mean replacing the current image processing code with a combination of pillow + adaptive threshold / morph transform implementations. Let me explore this a bit further. Meanwhile if you have any other alternatives or suggestions on how we could do this, would love if you could share them on this thread!

@vinayak-mehta
Copy link
Contributor Author

matplotlib is now an optional requirement!

@sweco-sekrsv
Copy link

I'm not exaclty sure what you are using Ghostscript for but I switched to pdftoppm for rasterizing pdf to images. I'm using the CLI tool and calling it from python.
For my scenarios, it's stable and generate images quicker than Ghostscript. I have had better success with fonts using pdftoppm as well.

I'm on windows and are using the latest binaries from here:
http://blog.alivate.com.au/poppler-windows

On a side note it can also fix "broken" PDF' files. As the ones in this ticket:
#306
Resaving them with pdftocairo in the poppler tools makes the file load ok with pdf-miner

On another side note I tried making Ghostscript run using multiprocessing (to speed things up) but that did not seem to work very good. Not sure Ghostscript is designed to run using several threads.

@vinayak-mehta
Copy link
Contributor Author

Moved to #13.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants