-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring smart_open to share compression and encoding functionality #185
Merged
Merged
Changes from 1 commit
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
3c1eddf
Refactoring smart_open to share compression and encoding functionality
mpenkov b9c5ebc
fixup in webhdfs.py: detect Py2 correctly
mpenkov 1ceaab2
Disable gzip tests for Py2
mpenkov 9ed0aa5
fix unit tests by explicitly specifying encoding
mpenkov 945e763
adding HTTP integration tests
mpenkov 0ce0651
when seek is missing but required, buffer in memory
mpenkov 65d05c4
work around missing .seekable in py2
mpenkov df5893c
fixup in webhdfs.py: expect response.content to be bytes
mpenkov edd291f
added some sample code for WebHDFS/HDFS integration tests
mpenkov ec944cf
specify working directory for HTTP server
mpenkov de7be43
include http tests in travis.yml
mpenkov 92e4c5f
sleep for 1s to avoid race condition
mpenkov 54b78cf
fixup in http integration tests
mpenkov 4418b38
set Accept-Encoding header, point tests at github.com
mpenkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
from __future__ import unicode_literals | ||
|
||
import logging | ||
import subprocess | ||
import unittest | ||
|
||
import six | ||
|
||
import smart_open | ||
|
||
PORT = 8008 | ||
GZIP_MAGIC = b'\x1f\x8b' | ||
|
||
|
||
def startup_server(port=PORT): | ||
command = ['python', '-m', 'SimpleHTTPServer', str(port)] | ||
return subprocess.Popen(command) | ||
|
||
|
||
class ReadTest(unittest.TestCase): | ||
def setUp(self): | ||
self.sub = startup_server() | ||
|
||
def tearDown(self): | ||
self.sub.terminate() | ||
|
||
def test_read_text(self): | ||
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt' % PORT | ||
with smart_open.smart_open(url, encoding='utf-8') as fin: | ||
text = fin.read() | ||
self.assertTrue(text.startswith('В начале июля, в чрезвычайно жаркое время,')) | ||
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n')) | ||
|
||
def test_read_binary(self): | ||
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt' % PORT | ||
with smart_open.smart_open(url, 'rb') as fin: | ||
text = fin.read() | ||
self.assertTrue(text.startswith('В начале июля, в чрезвычайно'.encode('utf-8'))) | ||
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'.encode('utf-8'))) | ||
|
||
@unittest.skipIf(six.PY2, 'gzip support does not work on Py2') | ||
def test_read_gzip_text(self): | ||
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt.gz' % PORT | ||
with smart_open.smart_open(url, encoding='utf-8') as fin: | ||
text = fin.read() | ||
self.assertTrue(text.startswith('В начале июля, в чрезвычайно жаркое время,')) | ||
self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n')) | ||
|
||
def test_read_gzip_binary(self): | ||
url = 'http://localhost:%d/smart_open/tests/test_data/crime-and-punishment.txt.gz' % PORT | ||
with smart_open.smart_open(url, 'rb', ignore_extension=True) as fin: | ||
binary = fin.read() | ||
self.assertTrue(binary.startswith(GZIP_MAGIC)) | ||
|
||
|
||
if __name__ == '__main__': | ||
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG) | ||
unittest.main() |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I'm pretty sure I've been opening
.gz
files with smart_open in Python 2.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's true. The master branch of smart_open currently has limited support for gzip: it works for local files and S3 only, regardless of which Python version you have installed. To the best of my understanding, on-the-fly gzip decompression never worked for HTTP, WebHDFS and HDFS. You can confirm this by running these same integration tests against master. You'll get an error similar to the following:
Basically, Py2.7 gzip expects a .seek() operation to be implemented on the file object. Until someone explicitly implements seeking for HTTP, we won't be able to use Py2.7 gzip.
@menshikh-iv Can you please double-check and correct me if I'm wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some code here (https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L756) to address the seek issue, but it doesn't seem to be helping, because the integration test above is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenkov I checked, works fine with
py2
andsmart_open==1.5.7
, using this codeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky @menshikh-iv Thanks for checking! I can confirm your code works. I will investigate and fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a closer look at why gzip was working in Py2 despite the lack of seek. Unfortunately, it seems like it works at the expense of streaming functionality: this line reads the entire file into memory before gzip-decompressing. We could reimplement the same thing in the refactored branch, but is it worth it? We're basically surrendering the benefit of streaming without the user knowing it - it could cause out-of-memory situations on the user side if the file is sufficiently large.
@piskvorky @menshikh-iv How do you think it is best to proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- a lack of streaming is definitely a bug. Can you open an issue for it?
Thanks for investigating @mpenkov! It's a pleasure to work with such knowledgeable and dedicated people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Thank you! I've opened #189.
Sorry for misleading you earlier, my first investigation overlooked this buffering detail.