Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

read_pdf from a URL #91

Closed
vinayak-mehta opened this issue Sep 2, 2018 · 11 comments
Closed

read_pdf from a URL #91

vinayak-mehta opened this issue Sep 2, 2018 · 11 comments
Milestone

Comments

@vinayak-mehta
Copy link
Contributor

No description provided.

@pecey
Copy link
Contributor

pecey commented Oct 3, 2018

Is this still something that needs to be done?

@vinayak-mehta
Copy link
Contributor Author

vinayak-mehta commented Oct 3, 2018

Hey @pecey, it would be good to have this as a feature. A good starting point would be pandas' read_html. Do post here about how you're planning to do this if you take this up 😄

EDIT: You can check out read_csv too which says "Valid URL schemes include http, ftp, s3, and file."

@rra94
Copy link

rra94 commented Oct 14, 2018

Hey I'm interested in this issue. Can you explain a little more?

@vinayak-mehta
Copy link
Contributor Author

Hey @rra94, you can check out the links in my comment above. Just like pandas.read_html reads html from a URL, camelot.read_pdf would read a PDF from a URL.

@rra94
Copy link

rra94 commented Oct 15, 2018

Hey @vinayak-mehta I can take this. I have two ideas:

  1. Use Urllib2 to get the pdf as a stringIO object and pass it to the PdfFileReader function in the pdfhandler
  2. Download a local copy again using URLLIB and then use the local copy to with the PdfFileReader.

What do you recommend?

@vinayak-mehta
Copy link
Contributor Author

1 sounds better than 2 since we don't have to worry about cleaning up the downloaded file afterwards. (Though it'll fill up memory in case of very large files, we could give this out as a warning. Did you take a look at pandas.read_html? How is this implemented there?)

We might also need to think about differentiating between filepaths and URLs using regexes maybe.

@rra94
Copy link

rra94 commented Oct 15, 2018

in pandas it's the same thing afik

if _is_url(obj):
with urlopen(obj) as url:
text = url.read()

we can use regex or simply the first four chars to be equal to ftp or http

@vinayak-mehta
Copy link
Contributor Author

Regex sounds good, please open a PR, we can continue the discussion there! Do check out the contribution guidelines here https://camelot-py.readthedocs.io/en/master/dev/contributing.html#pull-requests.

@pecey
Copy link
Contributor

pecey commented Oct 29, 2018

We can also download the file in tmp. Then we would be able to reuse the read_pdf method in io.py. Cleanup shouldn't be much big of a task. In many systems, it is cleaned up pretty frequently, so unless someone is downloading huge files back-to-back, this shouldn't cause an issue.

@vinayak-mehta
Copy link
Contributor Author

vinayak-mehta commented Oct 31, 2018

Then we would be able to reuse the read_pdf method in io.py.

read_pdf won't be reused since it's the top-level interface. A step could be added to the lower-level PDFHandler which would differentiate if the input is a URL or file-like object or filepath, and then download into tmp like it already does with the TemporaryDirectory context manager.

@vinayak-mehta
Copy link
Contributor Author

@vinayak-mehta vinayak-mehta added this to the v0.6.0 milestone Dec 2, 2018
kirbs- pushed a commit to kirbs-/camelot that referenced this issue Jul 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants