Merge remote-tracking branch 'remotes/upstream/master'

piskvorky · Jul 21, 2019 · ebeab51 · ebeab51
2 parents 63b5f17 + 0f17449
commit ebeab51
Show file tree

Hide file tree

Showing 8 changed files with 379 additions and 58 deletions.
diff --git a/README.rst b/README.rst
@@ -12,11 +12,25 @@ smart_open — utils for streaming large files in Python
 What?
 =====
 
-``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
+``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
 
 ``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.
 
-``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API:
+
+Why?
+====
+
+Working with large remote files, for example using Amazon's  `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_ Python library, is a pain.
+``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming.
+There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.
+
+``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.
+
+
+How?
+=====
+
+``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:
 
 
 .. _doctools_before_examples:
@@ -80,6 +94,28 @@ Other examples of URLs that ``smart_open`` accepts::
 
 .. _doctools_after_examples:
 
+
+Documentation
+=============
+
+Installation
+------------
+::
+
+    pip install smart_open
+
+Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::
+
+    python setup.py test  # run unit tests
+    python setup.py install
+
+To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
+The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.
+
+
+Built-in help
+-------------
+
 For detailed API info, see the online help:
 
 .. code-block:: python
@@ -88,7 +124,8 @@ For detailed API info, see the online help:
 
 or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.
 
-More examples:
+More examples
+-------------
 
 .. code-block:: python
 
@@ -134,29 +171,6 @@ More examples:
     with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
         fout.write(b'here we stand')
 
-Why?
-----
-
-Working with large S3 files using Amazon's default Python library, `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_, is a pain.
-Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming).
-There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.
-
-``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make.
-
-Installation
-------------
-::
-
-    pip install smart_open
-
-Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::
-
-    python setup.py test  # run unit tests
-    python setup.py install
-
-To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
-The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.
-
 Supported Compression Formats
 -----------------------------
 
@@ -185,6 +199,7 @@ For 2.7, use `backports.lzma`_.
 
 .. _backports.lzma: https://pypi.org/project/backports.lzma/
 
+
 Transport-specific Options
 --------------------------
 
@@ -279,6 +294,62 @@ The ''open'' function has the parameter version_id, which allows you to get the
   b's'
   
 
+Specific S3 object version
+--------------------------
+
+The ``version_id`` transport parameter enables you to get the desired version of the object from an S3 bucket.
+
+.. Important::
+    S3 disables version control by default.
+    Before using the ``version_id`` parameter, you must explicitly enable version control for your S3 bucket.
+    Read https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html for details.
+
+.. code-block:: python
+
+  >>> # Read previous versions of an object in a versioned bucket
+  >>> bucket, key = 'smart-open-versioned', 'demo.txt'
+  >>> versions = [v.id for v in boto3.resource('s3').Bucket(bucket).object_versions.filter(Prefix=key)]
+  >>> for v in versions:
+  ...     with open('s3://%s/%s' % (bucket, key), transport_params={'version_id': v}) as fin:
+  ...         print(v, repr(fin.read()))
+  KiQpZPsKI5Dm2oJZy_RzskTOtl2snjBg 'second version\n'
+  N0GJcE3TQCKtkaS.gF.MUBZS85Gs3hzn 'first version\n'
+
+  >>> # If you don't specify a version, smart_open will read the most recent one
+  >>> with open('s3://%s/%s' % (bucket, key)) as fin:
+  ...     print(repr(fin.read()))
+  'second version\n'
+
+File-like Binary Streams
+------------------------
+
+The ``open`` function also accepts file-like objects.
+This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression:
+
+
+.. code-block:: python
+
+    >>> import io, gzip
+    >>>
+    >>> # Prepare some gzipped binary data in memory, as an example.
+    >>> # Any binary file will do; we're using BytesIO here for simplicity.
+    >>> buf = io.BytesIO()
+    >>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
+    ...     _ = fout.write(b'this is a bytestring')
+    >>> _ = buf.seek(0)
+    >>>
+    >>> # Use case starts here.
+    >>> buf.name = 'file.gz'  # add a .name attribute so smart_open knows what compressor to use
+    >>> import smart_open
+    >>> smart_open.open(buf, 'rb').read()  # will gzip-decompress transparently!
+    b'this is a bytestring'
+
+
+In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use.
+If your file object doesn't have one, set the ``.name`` attribute to an appropriate value.
+Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function).
+Otherwise, the transparent decompression will not occur.
+
 
 Migrating to the new ``open`` function
 --------------------------------------
@@ -306,21 +377,25 @@ code already uses ``open`` for local file I/O, then it will continue to work.
 If you want to continue using the built-in ``open`` function for e.g. debugging,
 then you can ``import smart_open`` and use ``smart_open.open``.
 
-**The default read mode is now "r" (read text) by default.**
+**The default read mode is now "r" (read text).**
 If your code was implicitly relying on the default mode being "rb" (read
-binary), then you'll need to update it and pass "r" explicitly.
+binary), you'll need to update it and pass "rb" explicitly.
 
 Before:
 
 .. code-block:: python
 
-  >>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32)  # 'rb' used to be default
+  >>> import smart_open
+  >>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32)  # 'rb' used to be the default
+  b'User-Agent: *\nDisallow: /'
 
 After:
 
 .. code-block:: python
 
+  >>> import smart_open
   >>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32)
+  b'User-Agent: *\nDisallow: /'
 
 The ``ignore_extension`` keyword parameter is now called ``ignore_ext``.
 It behaves identically otherwise.
@@ -332,7 +407,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly:
 
   >>> url = 's3://smart-open-py37-benchmark-results/test.txt'
   >>> session = boto3.Session(profile_name='smart_open')
-  >>> smart_open(url, 'r', session=session).read(32)
+  >>> smart_open.smart_open(url, 'r', session=session).read(32)
   'first line\nsecond line\nthird lin'
 
 The new function accepts a ``transport_params`` keyword argument.  It's a dict.
@@ -355,14 +430,14 @@ Removed parameters:
 - ``profile_name``
 
 **The profile_name parameter has been removed.**
-Pass an entire boto3.Session object instead.
+Pass an entire ``boto3.Session`` object instead.
 
 Before:
 
 .. code-block:: python
 
   >>> url = 's3://smart-open-py37-benchmark-results/test.txt'
-  >>> smart_open(url, 'r', profile_name='smart_open').read(32)
+  >>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32)
   'first line\nsecond line\nthird lin'
 
 After:
@@ -381,7 +456,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war
 Keep an eye on your logs for WARNING messages from ``smart_open``.
 
 Comments, bug reports
----------------------
+=====================
 
 ``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file
 issues or pull requests there. Suggestions, pull requests and improvements welcome!

diff --git a/help.txt b/help.txt
@@ -99,6 +99,8 @@ FUNCTIONS
         multipart_upload_kwargs: dict, optional
             Additional parameters to pass to boto3's initiate_multipart_upload function.
             For writing only.
+        version_id: str, optional
+            Version of the object, used when reading object. If None, will fetch the most recent version.
 
         HTTP (for details, see :mod:`smart_open.http` and :func:`smart_open.http.open`):
 
@@ -108,6 +110,10 @@ FUNCTIONS
             The username for authenticating over HTTP
         password: str, optional
             The password for authenticating over HTTP
+        headers: dict, optional
+            Any headers to send in the request. If ``None``, the default headers are sent:
+            ``{'Accept-Encoding': 'identity'}``. To use no headers at all,
+            set this variable to an empty dict, ``{}``.
 
         WebHDFS (for details, see :mod:`smart_open.webhdfs` and :func:`smart_open.webhdfs.open`):
 

diff --git a/integration-tests/test_version_id.py b/integration-tests/test_version_id.py
@@ -0,0 +1,41 @@
+"""Tests the version_id transport parameter for S3 against real S3."""
+
+import boto3
+from smart_open import open
+
+BUCKET, KEY = 'smart-open-versioned', 'demo.txt'
+"""Our have a public-readable bucket with a versioned object."""
+
+URL = 's3://%s/%s' % (BUCKET, KEY)
+
+
+def assert_equal(a, b):
+    assert a == b, '%r != %r' % (a, b)
+
+
+def main():
+    versions = [
+        v.id for v in boto3.resource('s3').Bucket(BUCKET).object_versions.filter(Prefix=KEY)
+    ]
+    expected_versions = [
+        'KiQpZPsKI5Dm2oJZy_RzskTOtl2snjBg',
+        'N0GJcE3TQCKtkaS.gF.MUBZS85Gs3hzn',
+    ]
+    assert_equal(versions, expected_versions)
+
+    contents = [
+        open(URL, transport_params={'version_id': v}).read()
+        for v in versions
+    ]
+    expected_contents = ['second version\n', 'first version\n']
+    assert_equal(contents, expected_contents)
+
+    with open(URL) as fin:
+        most_recent_contents = fin.read()
+    assert_equal(most_recent_contents, expected_contents[0])
+
+    print('OK')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/smart_open/http.py b/smart_open/http.py
@@ -22,7 +22,7 @@
 """
 
 
-def open(uri, mode, kerberos=False, user=None, password=None):
+def open(uri, mode, kerberos=False, user=None, password=None, headers=None):
     """Implement streamed reader from a web site.
 
     Supports Kerberos and Basic HTTP authentication.
@@ -39,20 +39,29 @@ def open(uri, mode, kerberos=False, user=None, password=None):
         The username for authenticating over HTTP
     password: str, optional
         The password for authenticating over HTTP
+    headers: dict, optional
+        Any headers to send in the request. If ``None``, the default headers are sent:
+        ``{'Accept-Encoding': 'identity'}``. To use no headers at all,
+        set this variable to an empty dict, ``{}``.
 
     Note
     ----
-    If neither kerberos or (user, password) are set, will connect unauthenticated.
+    If neither kerberos or (user, password) are set, will connect
+    unauthenticated, unless set separately in headers.
 
     """
     if mode == 'rb':
-        return BufferedInputBase(uri, mode, kerberos=kerberos, user=user, password=password)
+        return BufferedInputBase(
+            uri, mode, kerberos=kerberos,
+            user=user, password=password, headers=headers
+        )
     else:
         raise NotImplementedError('http support for mode %r not implemented' % mode)
 
 
 class BufferedInputBase(io.BufferedIOBase):
-    def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE, kerberos=False, user=None, password=None):
+    def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
+                 kerberos=False, user=None, password=None, headers=None):
         if kerberos:
             import requests_kerberos
             auth = requests_kerberos.HTTPKerberosAuth()
@@ -64,7 +73,12 @@ def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE, kerberos=Fals
         self.buffer_size = buffer_size
         self.mode = mode
 
-        self.response = requests.get(url, auth=auth, stream=True, headers=_HEADERS)
+        if headers is None:
+            self.headers = _HEADERS.copy()
+        else:
+            self.headers = headers
+
+        self.response = requests.get(url, auth=auth, stream=True, headers=self.headers)
 
         if not self.response.ok:
             self.response.raise_for_status()
@@ -154,7 +168,7 @@ class SeekableBufferedInputBase(BufferedInputBase):
     """
 
     def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
-                 kerberos=False, user=None, password=None):
+                 kerberos=False, user=None, password=None, headers=None):
         """
         If Kerberos is True, will attempt to use the local Kerberos credentials.
         Otherwise, will try to use "basic" HTTP authentication via username/password.
@@ -171,6 +185,11 @@ def __init__(self, url, mode='r', buffer_size=DEFAULT_BUFFER_SIZE,
         else:
             self.auth = None
 
+        if headers is None:
+            self.headers = _HEADERS.copy()
+        else:
+            self.headers = headers
+
         self.buffer_size = buffer_size
         self.mode = mode
         self.response = self._partial_request()
@@ -253,10 +272,8 @@ def truncate(self, size=None):
         raise io.UnsupportedOperation
 
     def _partial_request(self, start_pos=None):
-        headers = _HEADERS.copy()
-
         if start_pos is not None:
-            headers.update({"range": s3.make_range_string(start_pos)})
+            self.headers.update({"range": s3.make_range_string(start_pos)})
 
-        response = requests.get(self.url, auth=self.auth, stream=True, headers=headers)
+        response = requests.get(self.url, auth=self.auth, stream=True, headers=self.headers)
         return response