-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading S3 files becomes slow after 1.5.4 #152
Comments
I'm looking into this now. |
I've reproduced the issue, but the performance gap between 1.5.3 and .4 wasn't as large as originally reported. @appierys, could you tell us the extension of the S3 file you were trying to open? The extension influences how smart_open actually opens the file. Here are my efforts to reproduce the bug:
1.5.4 appears to be around 6 times slower than 1.5.3. reproduce.py: import sys
from tqdm import tqdm
from smart_open import smart_open
for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
if i > 1e3:
break Some more measurements:
Baseline using AWS CLI tools:
smart_open takes significantly longer to read the entire file (even with the faster 1.5.3):
|
Some basic profiling:
Examining the profiles:
Interestingly, 1.5.4 calls functions like _SSLSocket.read and _FileIO.read more often than 1.5.3. The former also appears to be opening two SSL connections (connect, handshake). 1.5.4 spends the majority of the time in a read() method inside s3.py. The number of calls is quite large: perhaps the buffer size is not being set correctly? |
Hi, Here is my investigation. I used Confirming 1.5.3 and 1.5.4 performance: tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master)
$ virtualenv venv
tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master)
$ . ./venv/bin/activate
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master)
$ pip install -e . tqdm
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master)
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3))
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
620it [00:01, 500.94it/s]
real 0m2.447s
user 0m0.156s
sys 0m0.016s
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3))
$ git checkout 1.5.4
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
976it [00:07, 135.19it/s]
real 0m8.429s
user 0m0.892s
sys 0m0.064s After increasing the buffer size by 10, the performance looks better. (venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ git diff
diff --git a/smart_open/s3.py b/smart_open/s3.py
index 420afbd..2b0fcae 100644
--- a/smart_open/s3.py
+++ b/smart_open/s3.py
@@ -196,7 +196,7 @@ class BufferedInputBase(io.BufferedIOBase):
#
# logger.debug('filling %r byte-long buffer up to %r bytes', len(self._buffer), size)
while len(self._buffer) < size and not self._eof:
- raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE)
+ raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE*10)
if len(raw):
self._buffer += raw
else:
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
976it [00:03, 287.71it/s]
real 0m4.676s
user 0m1.260s
sys 0m0.040s However, when I increase the buffer size by 100, the performance gets worse. (venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ git diff
diff --git a/smart_open/s3.py b/smart_open/s3.py
index 420afbd..d87881e 100644
--- a/smart_open/s3.py
+++ b/smart_open/s3.py
@@ -196,7 +196,7 @@ class BufferedInputBase(io.BufferedIOBase):
#
# logger.debug('filling %r byte-long buffer up to %r bytes', len(self._buffer), size)
while len(self._buffer) < size and not self._eof:
- raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE)
+ raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE*100)
if len(raw):
self._buffer += raw
else:
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
997it [00:17, 56.48it/s]
real 0m18.878s
user 0m16.272s
sys 0m0.072s I assumed this is because the byte string slicing is becoming the bottleneck, I refactored the code without using slicing(otamachan@6989112) The result: (venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ git remote add otamachan https://github.com/otamachan/smart_open.git
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ git fetch otamachan
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4))
$ git checkout improve-read-performance
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance)
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
997it [00:02, 419.11it/s]
real 0m3.613s
user 0m0.596s
sys 0m0.092s Looks better but still worse than 1.5.3 . Hope this helps. |
Hi @mpenkov, perhaps the gap is larger for me because I am accessing the bucket from a different region while the bucket is in us-east-1. The file extension is |
Some statistics when Results from 1.5.3:
For 1.5.4, it takes much time to complete the test:
|
@appierys thank you for providing more details. I identified the problem as a buffering issue. It’s addressed in this branch: https://github.com/mpenkov/smart_open Could you try it and let me know if it resolves your issue? |
Hi @mpenkov, Thanks. The results from your master branch is as below, it's faster than before but still slower than 1.5.3 when 600K lines are read:
|
@appierys OK, I can see that it is still several times slower. This matches the results on my side as well. Will continue looking into it. Thank you. |
Hi, I also implemented the (venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master)
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3))
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done598857it [00:32, 18355.81it/s]
real 0m33.760s
user 0m2.904s
sys 0m0.444s
598857it [00:24, 24435.69it/s]
real 0m25.990s
user 0m2.724s
sys 0m0.332s
598857it [00:27, 22121.49it/s]
real 0m28.339s
user 0m2.944s
sys 0m0.276s
598857it [00:25, 23430.73it/s]
real 0m26.758s
user 0m2.684s
sys 0m0.392s
598857it [00:34, 17357.63it/s]
real 0m35.697s
user 0m2.848s
sys 0m0.364s
$ git checkout improve-read-performance
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance)
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:35, 16928.79it/s]
real 0m36.654s
user 0m2.252s
sys 0m0.604s
598857it [00:36, 16548.59it/s]
real 0m37.339s
user 0m2.488s
sys 0m0.732s
598857it [00:32, 18385.05it/s]
real 0m33.761s
user 0m2.244s
sys 0m0.580s
598857it [00:28, 20821.97it/s]
real 0m29.974s
user 0m2.216s
sys 0m0.716s
598857it [00:24, 24546.06it/s]
real 0m25.625s
user 0m2.184s
sys 0m0.616s Thanks. |
What's behind this decrease in performance, despite buffering and threading? What changed so fundamentally? |
I assume that On the other hand, |
@piskvorky Tamaki is right. I implemented the new seekable S3 reader based on boto3’s ranged GET queries, as we discussed in Issue #42. It’s possible that boto3 introduced some overhead, but I’m surprised it’s this much. I’m still hunting around with a profiler. |
@otamachan Threading is a good idea for some applications, but I don't think introducing it is appropriate, because it limits our (and our users') options with regards to multiprocessing. More specifically, once you start up threads, you can't fork your process (because the that will copy synchronization primitives like mutexes to the child process). This means people who already use smart_open in a multiprocessing-based application will experience problems. |
@mpenkov Thanks for your comment. I agree with you. I just wanted to confirm whether threading could improve the performance or not. For the |
Here are some more profiling results. They were obtained by:
using theh 6c22ed8 commit. I used this reproduce.py: import sys
import logging
from tqdm import tqdm
from smart_open import smart_open
logging.basicConfig(level=logging.CRITICAL)
length_sum = 0
limit = int(sys.argv[2])
for i, line in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
if i > limit:
break
length_sum += len(line)
print(length_sum / i) I used different input files from the same bucket because I noticed there is caching behavior that confounds tests. So, for timing, it makes sense to change the input file (but keep the same number of iterations) between tests. I found that for both 1.5.3 and the current master branch (mine), the majority of the time is being spent in SSLSocket.read. (top: 1.5.3, bottom: master)
1.5.3 calls the method 2679 times, taking 0.027s per call. master calls the method 1428 times, taking 0.077 per call. This is rather odd given that they're reading approximately the same amounts of data. Perhaps master has a buffer that is twice the size of 1.5.3's counterpart somewhere? That would explain the larger number of calls, but not why they are taking so long. (2679/1428 ~= 2, but 0.027 / 0.077 ~= 3). I tried adjusting the buffer in the s3 module, but it appears to have no effect. |
Hi, (venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance2)
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3))
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:45, 13091.32it/s]
real 0m46.958s
user 0m3.056s
sys 0m0.376s
598857it [00:26, 22613.33it/s]
real 0m27.647s
user 0m2.720s
sys 0m0.364s
598857it [00:25, 23576.58it/s]
real 0m26.655s
user 0m2.800s
sys 0m0.340s
598857it [00:21, 27806.55it/s]
real 0m22.771s
user 0m2.848s
sys 0m0.288s
598857it [00:35, 16639.95it/s]
real 0m37.176s
user 0m2.952s
sys 0m0.352s
$ git checkout improve-read-performance2
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance2)
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:31, 19086.47it/s]
real 0m34.027s
user 0m9.868s
sys 0m0.184s
598857it [00:26, 22517.10it/s]
real 0m29.215s
user 0m10.084s
sys 0m0.184s
598857it [00:26, 22240.85it/s]
real 0m29.382s
user 0m9.848s
sys 0m0.224s
598857it [00:37, 15814.44it/s]
real 0m40.821s
user 0m10.348s
sys 0m0.244s
598857it [00:26, 22711.12it/s]
real 0m28.834s
user 0m9.936s
sys 0m0.224s |
@otamachan Thank you for your suggestion! I had a closer look at your implementation and finally realized what the remaining problem was. It isn't necessary to go back to boto to achieve the same performance: we can do the same thing with the newer boto3. Thank you for pointing me the right way. どうもありがとうございました! |
* Resolve #152: perform a .get only when necessary The previous implementation performed a .get every time when reading to the buffer. This was wasteful and unnecessary. A single .get is sufficient to read from the current position until the end of the file. Another .get is only necessary when seeking. * decrease default buffer size to 128kB This matches the behavior of 1.5.3 more closely, as observed during live tests against S3.
@mpenkov I'm happy that the performance is now back by your right implementation and I could be of help. I really appreciate your great effort to improve this wonderful library! こちらこそありがとうございます! |
Not sure if it's just me, but I tested 1.5.5 and obtained the following results:
|
@appierys This was fixed by fbc82cc and the commit is only in the master branch. Please try the master branch. |
@otamachan Thanks~ It works. |
@otamachan I'll release 1.5.6 but slightly later (after setup integration testing contour). |
As mentioned earlier in #74, it appears that the reading speed is very slow after 1.5.4.
$ pyvenv-3.4 env $ source env/bin/activate $ pip install smart_open==1.5.3 tqdm ipython $ ipython
2868923it [00:53, 53888.94it/s]
$ pyvenv-3.4 env $ source env/bin/activate $ pip install smart_open==1.5.4 tqdm ipython $ ipython
8401it [00:18, 442.64it/s] (too slow so I could not wait for it to finish.)
The text was updated successfully, but these errors were encountered: