-
Notifications
You must be signed in to change notification settings - Fork 360
read_pdf from a URL #91
Comments
Is this still something that needs to be done? |
Hey I'm interested in this issue. Can you explain a little more? |
Hey @rra94, you can check out the links in my comment above. Just like pandas.read_html reads html from a URL, camelot.read_pdf would read a PDF from a URL. |
Hey @vinayak-mehta I can take this. I have two ideas:
What do you recommend? |
1 sounds better than 2 since we don't have to worry about cleaning up the downloaded file afterwards. (Though it'll fill up memory in case of very large files, we could give this out as a warning. Did you take a look at pandas.read_html? How is this implemented there?) We might also need to think about differentiating between filepaths and URLs using regexes maybe. |
in pandas it's the same thing afik if _is_url(obj): we can use regex or simply the first four chars to be equal to ftp or http |
Regex sounds good, please open a PR, we can continue the discussion there! Do check out the contribution guidelines here https://camelot-py.readthedocs.io/en/master/dev/contributing.html#pull-requests. |
We can also download the file in |
read_pdf won't be reused since it's the top-level interface. A step could be added to the lower-level PDFHandler which would differentiate if the input is a URL or file-like object or filepath, and then download into |
Leaving this here to check out later. https://stackoverflow.com/questions/22800100/parsing-a-pdf-via-url-with-python-using-pdfminer |
[MRG] Update how-it-works.rst
No description provided.
The text was updated successfully, but these errors were encountered: