Skip to content

Commit

Permalink
Merge remote-tracking branch 'remotes/upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
interpolatio committed Jul 21, 2019
2 parents 63b5f17 + 0f17449 commit ebeab51
Show file tree
Hide file tree
Showing 8 changed files with 379 additions and 58 deletions.
141 changes: 108 additions & 33 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,25 @@ smart_open — utils for streaming large files in Python
What?
=====

``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.

``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API:

Why?
====

Working with large remote files, for example using Amazon's `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_ Python library, is a pain.
``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming.
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.


How?
=====

``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:


.. _doctools_before_examples:
Expand Down Expand Up @@ -80,6 +94,28 @@ Other examples of URLs that ``smart_open`` accepts::

.. _doctools_after_examples:


Documentation
=============

Installation
------------
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.


Built-in help
-------------

For detailed API info, see the online help:

.. code-block:: python
Expand All @@ -88,7 +124,8 @@ For detailed API info, see the online help:
or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.

More examples:
More examples
-------------

.. code-block:: python
Expand Down Expand Up @@ -134,29 +171,6 @@ More examples:
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')
Why?
----

Working with large S3 files using Amazon's default Python library, `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_, is a pain.
Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming).
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make.

Installation
------------
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.

Supported Compression Formats
-----------------------------

Expand Down Expand Up @@ -185,6 +199,7 @@ For 2.7, use `backports.lzma`_.

.. _backports.lzma: https://pypi.org/project/backports.lzma/


Transport-specific Options
--------------------------

Expand Down Expand Up @@ -279,6 +294,62 @@ The ''open'' function has the parameter version_id, which allows you to get the
b's'
Specific S3 object version
--------------------------

The ``version_id`` transport parameter enables you to get the desired version of the object from an S3 bucket.

.. Important::
S3 disables version control by default.
Before using the ``version_id`` parameter, you must explicitly enable version control for your S3 bucket.
Read https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html for details.

.. code-block:: python
>>> # Read previous versions of an object in a versioned bucket
>>> bucket, key = 'smart-open-versioned', 'demo.txt'
>>> versions = [v.id for v in boto3.resource('s3').Bucket(bucket).object_versions.filter(Prefix=key)]
>>> for v in versions:
... with open('s3://%s/%s' % (bucket, key), transport_params={'version_id': v}) as fin:
... print(v, repr(fin.read()))
KiQpZPsKI5Dm2oJZy_RzskTOtl2snjBg 'second version\n'
N0GJcE3TQCKtkaS.gF.MUBZS85Gs3hzn 'first version\n'
>>> # If you don't specify a version, smart_open will read the most recent one
>>> with open('s3://%s/%s' % (bucket, key)) as fin:
... print(repr(fin.read()))
'second version\n'
File-like Binary Streams
------------------------

The ``open`` function also accepts file-like objects.
This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression:


.. code-block:: python
>>> import io, gzip
>>>
>>> # Prepare some gzipped binary data in memory, as an example.
>>> # Any binary file will do; we're using BytesIO here for simplicity.
>>> buf = io.BytesIO()
>>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
... _ = fout.write(b'this is a bytestring')
>>> _ = buf.seek(0)
>>>
>>> # Use case starts here.
>>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use
>>> import smart_open
>>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently!
b'this is a bytestring'
In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use.
If your file object doesn't have one, set the ``.name`` attribute to an appropriate value.
Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function).
Otherwise, the transparent decompression will not occur.


Migrating to the new ``open`` function
--------------------------------------
Expand Down Expand Up @@ -306,21 +377,25 @@ code already uses ``open`` for local file I/O, then it will continue to work.
If you want to continue using the built-in ``open`` function for e.g. debugging,
then you can ``import smart_open`` and use ``smart_open.open``.

**The default read mode is now "r" (read text) by default.**
**The default read mode is now "r" (read text).**
If your code was implicitly relying on the default mode being "rb" (read
binary), then you'll need to update it and pass "r" explicitly.
binary), you'll need to update it and pass "rb" explicitly.

Before:

.. code-block:: python
>>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be default
>>> import smart_open
>>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be the default
b'User-Agent: *\nDisallow: /'
After:

.. code-block:: python
>>> import smart_open
>>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32)
b'User-Agent: *\nDisallow: /'
The ``ignore_extension`` keyword parameter is now called ``ignore_ext``.
It behaves identically otherwise.
Expand All @@ -332,7 +407,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly:
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> session = boto3.Session(profile_name='smart_open')
>>> smart_open(url, 'r', session=session).read(32)
>>> smart_open.smart_open(url, 'r', session=session).read(32)
'first line\nsecond line\nthird lin'
The new function accepts a ``transport_params`` keyword argument. It's a dict.
Expand All @@ -355,14 +430,14 @@ Removed parameters:
- ``profile_name``

**The profile_name parameter has been removed.**
Pass an entire boto3.Session object instead.
Pass an entire ``boto3.Session`` object instead.

Before:

.. code-block:: python
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> smart_open(url, 'r', profile_name='smart_open').read(32)
>>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32)
'first line\nsecond line\nthird lin'
After:
Expand All @@ -381,7 +456,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war
Keep an eye on your logs for WARNING messages from ``smart_open``.

Comments, bug reports
---------------------
=====================

``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file
issues or pull requests there. Suggestions, pull requests and improvements welcome!
Expand Down
6 changes: 6 additions & 0 deletions help.txt
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ FUNCTIONS
multipart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's initiate_multipart_upload function.
For writing only.
version_id: str, optional
Version of the object, used when reading object. If None, will fetch the most recent version.

HTTP (for details, see :mod:`smart_open.http` and :func:`smart_open.http.open`):

Expand All @@ -108,6 +110,10 @@ FUNCTIONS
The username for authenticating over HTTP
password: str, optional
The password for authenticating over HTTP
headers: dict, optional
Any headers to send in the request. If ``None``, the default headers are sent:
``{'Accept-Encoding': 'identity'}``. To use no headers at all,
set this variable to an empty dict, ``{}``.

WebHDFS (for details, see :mod:`smart_open.webhdfs` and :func:`smart_open.webhdfs.open`):

Expand Down
41 changes: 41 additions & 0 deletions integration-tests/test_version_id.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Tests the version_id transport parameter for S3 against real S3."""

import boto3
from smart_open import open

BUCKET, KEY = 'smart-open-versioned', 'demo.txt'
"""Our have a public-readable bucket with a versioned object."""

URL = 's3://%s/%s' % (BUCKET, KEY)


def assert_equal(a, b):
assert a == b, '%r != %r' % (a, b)


def main():
versions = [
v.id for v in boto3.resource('s3').Bucket(BUCKET).object_versions.filter(Prefix=KEY)
]
expected_versions = [
'KiQpZPsKI5Dm2oJZy_RzskTOtl2snjBg',
'N0GJcE3TQCKtkaS.gF.MUBZS85Gs3hzn',
]
assert_equal(versions, expected_versions)

contents = [
open(URL, transport_params={'version_id': v}).read()
for v in versions
]
expected_contents = ['second version\n', 'first version\n']
assert_equal(contents, expected_contents)

with open(URL) as fin:
most_recent_contents = fin.read()
assert_equal(most_recent_contents, expected_contents[0])

print('OK')


if __name__ == '__main__':
main()
37 changes: 27 additions & 10 deletions smart_open/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"""


def open(uri, mode, kerberos=False, user=None, password=None):
def open(uri, mode, kerberos=False, user=None, password=None, headers=None):
"""Implement streamed reader from a web site.
Supports Kerberos and Basic HTTP authentication.
Expand All @@ -39,20 +39,29 @@ def open(uri, mode, kerberos=False, user=None, password=None):
The username for authenticating over HTTP
password: str, optional
The password for authenticating over HTTP
headers: dict, optional
Any headers to send in the request. If ``None``, the default headers are sent:
``{'Accept-Encoding': 'identity'}``. To use no headers at all,
set this variable to an empty dict, ``{}``.
Note
----
If neither kerberos or (user, password) are set, will connect unauthenticated.
If neither kerberos or (user, password) are set, will connect
unauthenticated, unless set separately in headers.
"""
if mode == 'rb':
return BufferedInputBase(uri, mode, kerberos=kerberos, user=user, password=password)
return BufferedInputBase(
uri, mode, kerberos=kerberos,
user=user, password=password, headers=headers
)
else:
raise NotImplementedError('http support for mode %r not implemented' % mode)


class BufferedInputBase(io.BufferedIOBase):
def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE, kerberos=False, user=None, password=None):
def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
kerberos=False, user=None, password=None, headers=None):
if kerberos:
import requests_kerberos
auth = requests_kerberos.HTTPKerberosAuth()
Expand All @@ -64,7 +73,12 @@ def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE, kerberos=Fals
self.buffer_size = buffer_size
self.mode = mode

self.response = requests.get(url, auth=auth, stream=True, headers=_HEADERS)
if headers is None:
self.headers = _HEADERS.copy()
else:
self.headers = headers

self.response = requests.get(url, auth=auth, stream=True, headers=self.headers)

if not self.response.ok:
self.response.raise_for_status()
Expand Down Expand Up @@ -154,7 +168,7 @@ class SeekableBufferedInputBase(BufferedInputBase):
"""

def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
kerberos=False, user=None, password=None):
kerberos=False, user=None, password=None, headers=None):
"""
If Kerberos is True, will attempt to use the local Kerberos credentials.
Otherwise, will try to use "basic" HTTP authentication via username/password.
Expand All @@ -171,6 +185,11 @@ def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
else:
self.auth = None

if headers is None:
self.headers = _HEADERS.copy()
else:
self.headers = headers

self.buffer_size = buffer_size
self.mode = mode
self.response = self._partial_request()
Expand Down Expand Up @@ -253,10 +272,8 @@ def truncate(self, size=None):
raise io.UnsupportedOperation

def _partial_request(self, start_pos=None):
headers = _HEADERS.copy()

if start_pos is not None:
headers.update({"range": s3.make_range_string(start_pos)})
self.headers.update({"range": s3.make_range_string(start_pos)})

response = requests.get(self.url, auth=self.auth, stream=True, headers=headers)
response = requests.get(self.url, auth=self.auth, stream=True, headers=self.headers)
return response
Loading

0 comments on commit ebeab51

Please sign in to comment.