Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring smart_open to share compression and encoding functionality #185

Merged
merged 14 commits into from
Apr 15, 2018
Merged
61 changes: 61 additions & 0 deletions integration-tests/test_http.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
from __future__ import unicode_literals

import logging
import subprocess
import unittest

import six

import smart_open

PORT = 8008
GZIP_MAGIC = b'\x1f\x8b'


def startup_server(port=PORT):
command = ['python', '-m', 'SimpleHTTPServer', str(port)]
return subprocess.Popen(command)


class ReadTest(unittest.TestCase):
def setUp(self):
self.sub = startup_server()

def tearDown(self):
self.sub.terminate()

def test_read_text(self):
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt' % PORT
with smart_open.smart_open(url, encoding='utf-8') as fin:
text = fin.read()
self.assertTrue(text.startswith('В начале июля, в чрезвычайно жаркое время,'))
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'))

def test_read_binary(self):
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt' % PORT
with smart_open.smart_open(url, 'rb') as fin:
text = fin.read()
self.assertTrue(text.startswith('В начале июля, в чрезвычайно'.encode('utf-8')))
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'.encode('utf-8')))

@unittest.skipIf(six.PY2, 'gzip support does not work on Py2')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I'm pretty sure I've been opening .gz files with smart_open in Python 2.

Copy link
Collaborator Author

@mpenkov mpenkov Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true. The master branch of smart_open currently has limited support for gzip: it works for local files and S3 only, regardless of which Python version you have installed. To the best of my understanding, on-the-fly gzip decompression never worked for HTTP, WebHDFS and HDFS. You can confirm this by running these same integration tests against master. You'll get an error similar to the following:

======================================================================
ERROR: test_read_gzip_text (__main__.ReadTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "integration-tests/test_http_copy.py", line 47, in test_read_gzip_text
    text = fin.read()
  File "/Users/misha/envs/smartopen2/lib/python2.7/codecs.py", line 486, in read
    newdata = self.stream.read()
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read
    self._read(readsize)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read
    pos = self.fileobj.tell()   # Save current position
UnsupportedOperation: seek

Basically, Py2.7 gzip expects a .seek() operation to be implemented on the file object. Until someone explicitly implements seeking for HTTP, we won't be able to use Py2.7 gzip.

@menshikh-iv Can you please double-check and correct me if I'm wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some code here (https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L756) to address the seek issue, but it doesn't seem to be helping, because the integration test above is failing.

Copy link
Contributor

@menshikh-iv menshikh-iv Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov I checked, works fine with py2 and smart_open==1.5.7, using this code

import subprocess
import smart_open
import time

port = 8008


command = ['python', '-m', 'SimpleHTTPServer', str(port)]
s = subprocess.Popen(command)
time.sleep(1)

url = 'http://localhost:%d/smart_open/tests/test_data/crlf_at_1k_boundary.warc.gz' % port
with smart_open.smart_open(url, encoding='utf-8') as fin:
    text = fin.read()

print(text)
s.terminate()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky @menshikh-iv Thanks for checking! I can confirm your code works. I will investigate and fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a closer look at why gzip was working in Py2 despite the lack of seek. Unfortunately, it seems like it works at the expense of streaming functionality: this line reads the entire file into memory before gzip-decompressing. We could reimplement the same thing in the refactored branch, but is it worth it? We're basically surrendering the benefit of streaming without the user knowing it - it could cause out-of-memory situations on the user side if the file is sufficiently large.

@piskvorky @menshikh-iv How do you think it is best to proceed?

Copy link
Owner

@piskvorky piskvorky Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- a lack of streaming is definitely a bug. Can you open an issue for it?

Thanks for investigating @mpenkov! It's a pleasure to work with such knowledgeable and dedicated people.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Thank you! I've opened #189.

Sorry for misleading you earlier, my first investigation overlooked this buffering detail.

def test_read_gzip_text(self):
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt.gz' % PORT
with smart_open.smart_open(url, encoding='utf-8') as fin:
text = fin.read()
self.assertTrue(text.startswith('В начале июля, в чрезвычайно жаркое время,'))
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'))

def test_read_gzip_binary(self):
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt.gz' % PORT
with smart_open.smart_open(url, 'rb', ignore_extension=True) as fin:
binary = fin.read()
self.assertTrue(binary.startswith(GZIP_MAGIC))


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
unittest.main()