Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading S3 files becomes slow after 1.5.4 #152

Closed
appierys opened this issue Dec 1, 2017 · 26 comments · Fixed by #157
Closed

Reading S3 files becomes slow after 1.5.4 #152

appierys opened this issue Dec 1, 2017 · 26 comments · Fixed by #157

Comments

@appierys
Copy link

appierys commented Dec 1, 2017

As mentioned earlier in #74, it appears that the reading speed is very slow after 1.5.4.

$ pyvenv-3.4 env
$ source env/bin/activate
$ pip install smart_open==1.5.3 tqdm ipython
$ ipython
from tqdm import tqdm
from smart_open import smart_open
for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
    pass

2868923it [00:53, 53888.94it/s]

$ pyvenv-3.4 env
$ source env/bin/activate
$ pip install smart_open==1.5.4 tqdm ipython
$ ipython
from tqdm import tqdm
from smart_open import smart_open
for _ in tqdm(smart_open('s3://xxxxx', 'rb')):
    pass

8401it [00:18, 442.64it/s] (too slow so I could not wait for it to finish.)

@menshikh-iv
Copy link
Contributor

Thanks for report @appierys, that's really suspicious, we'll investigate it, related issue #151.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 1, 2017

I'm looking into this now.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 1, 2017

I've reproduced the issue, but the performance gap between 1.5.3 and .4 wasn't as large as originally reported. @appierys, could you tell us the extension of the S3 file you were trying to open? The extension influences how smart_open actually opens the file.

Here are my efforts to reproduce the bug:

(so1.5.3)sergeyich:issue152 misha$ time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx
705it [00:01, 383.09it/s]

real    0m3.395s
user    0m0.233s
sys     0m0.044s
(so1.5.3)sergeyich:issue152 misha$ deactivate
sergeyich:issue152 misha$ source so1.5.4/bin/activate
(so1.5.4)sergeyich:issue152 misha$ time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx
977it [00:15, 64.74it/s]
real    0m18.065s
user    0m0.813s
sys     0m0.164s
(so1.5.4)sergeyich:issue152 misha$

1.5.4 appears to be around 6 times slower than 1.5.3.

reproduce.py:

import sys

from tqdm import tqdm
from smart_open import smart_open
for i, _ in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
    if i > 1e3:
        break

Some more measurements:

sergeyich:issue152 misha$ source so1.5.4/bin/activate
(so1.5.4)sergeyich:issue152 misha$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx; done
977it [00:11, 87.34it/s]
real    0m14.850s
user    0m0.877s
sys     0m0.153s
977it [00:23, 41.57it/s]
real    0m28.734s
user    0m0.696s
sys     0m0.080s
977it [00:10, 96.27it/s]
real    0m14.238s
user    0m0.761s
sys     0m0.090s
977it [00:10, 95.36it/s]
real    0m13.201s
user    0m0.876s
sys     0m0.153s
977it [00:11, 82.13it/s]
real    0m14.708s
user    0m0.696s
sys     0m0.084s
(so1.5.4)sergeyich:issue152 misha$ source so1.5.3/bin/activate
(so1.5.3)sergeyich:issue152 misha$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx; done
977it [00:07, 135.18it/s]

real    0m9.209s
user    0m0.370s
sys     0m0.096s
977it [00:05, 163.32it/s]

real    0m7.956s
user    0m0.234s
sys     0m0.051s
977it [00:06, 162.83it/s]

real    0m7.706s
user    0m0.248s
sys     0m0.048s
977it [00:09, 102.62it/s]

real    0m12.362s
user    0m0.243s
sys     0m0.048s
834it [00:03, 277.33it/s]

real    0m6.153s
user    0m0.242s
sys     0m0.047s

Baseline using AWS CLI tools:

(so1.5.3)sergeyich:issue152 misha$ time !!
time aws s3 cp s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx .
download: s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx to ./cluster.idx

real    4m1.325s
user    0m2.154s
sys     0m1.587s

smart_open takes significantly longer to read the entire file (even with the faster 1.5.3):

(so1.5.3)sergeyich:issue152 misha$ time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx
598857it [22:41, 439.92it/s]

real    22m43.424s
user    0m3.633s
sys     0m1.095s

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 1, 2017

Some basic profiling:

time python -m cProfile -o 154.out reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx
time python -m cProfile -o 153.out reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx

Examining the profiles:

(so1.5.3)sergeyich:issue152 misha$ python -m pstats 153.out
Welcome to the profile statistics browser.
153.out% sort time
153.out% stats 10
Sat Dec  2 00:14:37 2017    153.out

         160797 function calls (156627 primitive calls) in 7.134 seconds

   Ordered by: internal time
   List reduced from 2097 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32    5.513    0.172    5.513    0.172 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.921    0.921    0.921    0.921 {method 'do_handshake' of '_ssl._SSLSocket' objects}
        1    0.250    0.250    0.250    0.250 {method 'connect' of '_socket.socket' objects}
      250    0.057    0.000    0.057    0.000 {method 'read' of '_io.FileIO' objects}
      250    0.041    0.000    0.041    0.000 {built-in method marshal.loads}
        1    0.037    0.037    0.037    0.037 {built-in method _socket.getaddrinfo}
     1464    0.029    0.000    0.029    0.000 {built-in method posix.stat}
  846/845    0.024    0.000    0.045    0.000 {built-in method builtins.__build_class__}
      250    0.012    0.000    0.069    0.000 <frozen importlib._bootstrap_external>:830(get_data)
       25    0.011    0.000    0.011    0.000 {built-in method _imp.create_dynamic}


153.out%
Goodbye.
(so1.5.3)sergeyich:issue152 misha$ python -m pstats 154.out
Welcome to the profile statistics browser.
154.out% sort time
154.out% stats 10
Sat Dec  2 00:14:06 2017    154.out

         777545 function calls (770324 primitive calls) in 9.819 seconds

   Ordered by: internal time
   List reduced from 3531 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       47    7.066    0.150    7.066    0.150 {method 'read' of '_ssl._SSLSocket' objects}
        2    0.932    0.466    0.932    0.466 {method 'do_handshake' of '_ssl._SSLSocket' objects}
        2    0.403    0.202    0.403    0.202 {method 'connect' of '_socket.socket' objects}
      415    0.158    0.000    0.158    0.000 {method 'read' of '_io.FileIO' objects}
   118917    0.135    0.000    0.145    0.000 /Users/misha/git/smart_open/issue152/so1.5.4/lib/python3.6/site-packages/smart_open/s3.py:228(_read_from_buffer)
   118917    0.082    0.000    6.312    0.000 /Users/misha/git/smart_open/issue152/so1.5.4/lib/python3.6/site-packages/smart_open/s3.py:172(read)
      415    0.081    0.000    0.081    0.000 {built-in method marshal.loads}
     2828    0.058    0.000    0.058    0.000 {built-in method posix.stat}
     1003    0.055    0.000    6.369    0.006 /Users/misha/git/smart_open/issue152/so1.5.4/lib/python3.6/site-packages/tqdm/_tqdm.py:923(__iter__)
1412/1408    0.053    0.000    0.086    0.000 {built-in method builtins.__build_class__}

Interestingly, 1.5.4 calls functions like _SSLSocket.read and _FileIO.read more often than 1.5.3. The former also appears to be opening two SSL connections (connect, handshake).

1.5.4 spends the majority of the time in a read() method inside s3.py. The number of calls is quite large: perhaps the buffer size is not being set correctly?

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 1, 2017

153.out.gz
154.out.gz

@otamachan
Copy link

Hi,
I also investigated this issue, as I encountered the performance problem too.

Here is my investigation.

I used reproduce.py written by @mpenkov .

Confirming 1.5.3 and 1.5.4 performance:

tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master) 
$ virtualenv venv
tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master) 
$ . ./venv/bin/activate
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master) 
$ pip install -e . tqdm
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master) 
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3)) 
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
620it [00:01, 500.94it/s]

real    0m2.447s
user    0m0.156s
sys     0m0.016s
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3)) 
$ git checkout 1.5.4
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
976it [00:07, 135.19it/s]

real    0m8.429s
user    0m0.892s
sys     0m0.064s

After increasing the buffer size by 10, the performance looks better.

(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ git diff
diff --git a/smart_open/s3.py b/smart_open/s3.py
index 420afbd..2b0fcae 100644
--- a/smart_open/s3.py
+++ b/smart_open/s3.py
@@ -196,7 +196,7 @@ class BufferedInputBase(io.BufferedIOBase):
         #
         # logger.debug('filling %r byte-long buffer up to %r bytes', len(self._buffer), size)
         while len(self._buffer) < size and not self._eof:
-            raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE)
+            raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE*10)
             if len(raw):
                 self._buffer += raw
             else:
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
976it [00:03, 287.71it/s]

real    0m4.676s
user    0m1.260s
sys     0m0.040s

However, when I increase the buffer size by 100, the performance gets worse.
(For a larger file such as 1G file, I think the buffer size should be increased to dozens of mega bytes.)

(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ git diff
diff --git a/smart_open/s3.py b/smart_open/s3.py
index 420afbd..d87881e 100644
--- a/smart_open/s3.py
+++ b/smart_open/s3.py
@@ -196,7 +196,7 @@ class BufferedInputBase(io.BufferedIOBase):
         #
         # logger.debug('filling %r byte-long buffer up to %r bytes', len(self._buffer), size)
         while len(self._buffer) < size and not self._eof:
-            raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE)
+            raw = self._raw_reader.read(size=io.DEFAULT_BUFFER_SIZE*100)
             if len(raw):
                 self._buffer += raw
             else:
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
997it [00:17, 56.48it/s]
real    0m18.878s
user    0m16.272s
sys     0m0.072s

I assumed this is because the byte string slicing is becoming the bottleneck, I refactored the code without using slicing(otamachan@6989112)

The result:

(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ git remote add otamachan https://github.com/otamachan/smart_open.git
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ git fetch otamachan
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.4)) 
$ git checkout improve-read-performance 
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance) 
$ time python reproduce.py s3://otamachan-test-us-east-1/test.tgz
997it [00:02, 419.11it/s]

real    0m3.613s
user    0m0.596s
sys     0m0.092s

Looks better but still worse than 1.5.3 .

Hope this helps.

@appierys
Copy link
Author

appierys commented Dec 4, 2017

Hi @mpenkov, perhaps the gap is larger for me because I am accessing the bucket from a different region while the bucket is in us-east-1.
In addition, the speed of 1.5.3 is initially slow, but gets faster over time, so if you only test 1K lines, the difference would be smaller.

The file extension is .json, but it's actually many lines of jsons, and the total file size is 920M.

@appierys
Copy link
Author

appierys commented Dec 4, 2017

Some statistics when 1e3 is changed to 1e6:

Results from 1.5.3:

(1.5.3) $ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:43, 13639.56it/s]

real    0m45.683s
user    0m4.916s
sys     0m0.876s
598857it [00:42, 14215.39it/s]

real    0m43.882s
user    0m4.688s
sys     0m0.784s
598857it [00:35, 16742.60it/s]

real    0m37.726s
user    0m4.744s
sys     0m0.836s
598857it [01:35, 6293.18it/s]

real    1m36.958s
user    0m5.036s
sys     0m0.784s
598857it [00:50, 11935.76it/s]

real    0m52.096s
user    0m5.020s
sys     0m0.772s

For 1.5.4, it takes much time to complete the test:

(1.5.4) $ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [43:17, 230.51it/s]

real    43m20.637s
user    3m0.732s
sys     0m2.332s
598857it [43:08, 231.39it/s]

real    43m10.560s
user    2m40.092s
sys     0m2.764s
598857it [45:03, 221.52it/s]

real    45m5.965s
user    3m16.744s
sys     0m11.140s
598857it [44:55, 222.17it/s]

real    44m59.066s
user    3m34.840s
sys     0m2.552s
598857it [43:40, 228.52it/s]

real    43m43.102s
user    2m56.752s
sys     0m2.072s

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 4, 2017

@appierys thank you for providing more details.

I identified the problem as a buffering issue. It’s addressed in this branch: https://github.com/mpenkov/smart_open

Could you try it and let me know if it resolves your issue?

@appierys
Copy link
Author

appierys commented Dec 4, 2017

Hi @mpenkov,

Thanks. The results from your master branch is as below, it's faster than before but still slower than 1.5.3 when 600K lines are read:

(master) $ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [02:00, 4958.64it/s]

real    2m3.100s
user    0m16.044s
sys     0m0.760s
598857it [03:02, 3284.05it/s]

real    3m4.428s
user    0m16.248s
sys     0m0.764s
598857it [03:15, 3064.41it/s]

real    3m17.652s
user    0m16.532s
sys     0m0.532s
598857it [02:56, 3389.60it/s]

real    2m58.812s
user    0m16.764s
sys     0m0.696s
598857it [02:40, 3727.32it/s]

real    2m43.201s
user    0m14.988s
sys     0m0.844s

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 4, 2017

@appierys OK, I can see that it is still several times slower. This matches the results on my side as well. Will continue looking into it. Thank you.

@otamachan
Copy link

Hi,

I also implemented the readline method and made the raw reader threaded.(otamachan@569169d)
When I increase the buffer size to 20 Mbytes, it looks more better but still worse than 1.5.3.

(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (master) 
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3)) 
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done598857it [00:32, 18355.81it/s]

real    0m33.760s
user    0m2.904s
sys     0m0.444s
598857it [00:24, 24435.69it/s]

real    0m25.990s
user    0m2.724s
sys     0m0.332s
598857it [00:27, 22121.49it/s]

real    0m28.339s
user    0m2.944s
sys     0m0.276s
598857it [00:25, 23430.73it/s]

real    0m26.758s
user    0m2.684s
sys     0m0.392s
598857it [00:34, 17357.63it/s]

real    0m35.697s
user    0m2.848s
sys     0m0.364s
$ git checkout improve-read-performance 
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance) 
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:35, 16928.79it/s]

real    0m36.654s
user    0m2.252s
sys     0m0.604s
598857it [00:36, 16548.59it/s]

real    0m37.339s
user    0m2.488s
sys     0m0.732s
598857it [00:32, 18385.05it/s]

real    0m33.761s
user    0m2.244s
sys     0m0.580s
598857it [00:28, 20821.97it/s]

real    0m29.974s
user    0m2.216s
sys     0m0.716s
598857it [00:24, 24546.06it/s]

real    0m25.625s
user    0m2.184s
sys     0m0.616s

Thanks.

@piskvorky
Copy link
Owner

piskvorky commented Dec 5, 2017

What's behind this decrease in performance, despite buffering and threading? What changed so fundamentally?

@otamachan
Copy link

I assume that
1.5.3 uses https://github.com/boto/boto/blob/develop/boto/s3/key.py#L308 .
This is one GET request and reads the body asynchronously. But this is not seekable.

On the other hand,
1.5.4 uses https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/s3.py#L89 .
This separates the GET request into several parts and waits until the reading is done. This decreases the throughput but is seekable.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 6, 2017

@piskvorky Tamaki is right. I implemented the new seekable S3 reader based on boto3’s ranged GET queries, as we discussed in Issue #42. It’s possible that boto3 introduced some overhead, but I’m surprised it’s this much.

I’m still hunting around with a profiler.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 6, 2017

@otamachan Threading is a good idea for some applications, but I don't think introducing it is appropriate, because it limits our (and our users') options with regards to multiprocessing. More specifically, once you start up threads, you can't fork your process (because the that will copy synchronization primitives like mutexes to the child process).

This means people who already use smart_open in a multiprocessing-based application will experience problems.

@otamachan
Copy link

otamachan commented Dec 6, 2017

@mpenkov Thanks for your comment. I agree with you. I just wanted to confirm whether threading could improve the performance or not. For the reproduce.py test, the performance was not so improved even when threading is introduced. It should be avoided for multiprocessing applications as you said.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 6, 2017

Here are some more profiling results. They were obtained by:

$ time python -m cProfile -o 153.out reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2017-43/indexes/cluster.idx 100000
$ time python -m cProfile -o master.out reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2017-39/indexes/cluster.idx 100000

using theh 6c22ed8 commit.

I used this reproduce.py:

import sys
import logging

from tqdm import tqdm
from smart_open import smart_open

logging.basicConfig(level=logging.CRITICAL)
length_sum = 0
limit = int(sys.argv[2])
for i, line in enumerate(tqdm(smart_open(sys.argv[1], 'rb'))):
    if i > limit:
        break
    length_sum += len(line)
print(length_sum / i)

I used different input files from the same bucket because I noticed there is caching behavior that confounds tests. So, for timing, it makes sense to change the input file (but keep the same number of iterations) between tests.

I found that for both 1.5.3 and the current master branch (mine), the majority of the time is being spent in SSLSocket.read. (top: 1.5.3, bottom: master)

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2679   71.599    0.027   71.599    0.027 {method 'read' of '_ssl._SSLSocket' objects

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1428  110.426    0.077  110.426    0.077 {method 'read' of '_ssl._SSLSocket' objects}

1.5.3 calls the method 2679 times, taking 0.027s per call. master calls the method 1428 times, taking 0.077 per call. This is rather odd given that they're reading approximately the same amounts of data.

Perhaps master has a buffer that is twice the size of 1.5.3's counterpart somewhere? That would explain the larger number of calls, but not why they are taking so long. (2679/1428 ~= 2, but 0.027 / 0.077 ~= 3). I tried adjusting the buffer in the s3 module, but it appears to have no effect.

153.out.gz
master.out.gz

@menshikh-iv
Copy link
Contributor

Need to continue investigation here (but @mpenkov fix it partially in #155)

@menshikh-iv menshikh-iv reopened this Dec 6, 2017
@otamachan
Copy link

Hi,
I finally found a way to get not only seekability but also almost same performance as 1.5.3 .
otamachan@8089460
Does this work?

(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance2) 
$ git checkout 1.5.3
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open ((1.5.3)) 
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:45, 13091.32it/s]

real    0m46.958s
user    0m3.056s
sys     0m0.376s
598857it [00:26, 22613.33it/s]

real    0m27.647s
user    0m2.720s
sys     0m0.364s
598857it [00:25, 23576.58it/s]

real    0m26.655s
user    0m2.800s
sys     0m0.340s
598857it [00:21, 27806.55it/s]

real    0m22.771s
user    0m2.848s
sys     0m0.288s
598857it [00:35, 16639.95it/s]

real    0m37.176s
user    0m2.952s
sys     0m0.352s
$ git checkout improve-read-performance2
(venv) tamaki@tamaki-ThinkPad-X201 ~/proj/smart_open (improve-read-performance2) 
$ for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [00:31, 19086.47it/s]

real    0m34.027s
user    0m9.868s
sys     0m0.184s
598857it [00:26, 22517.10it/s]

real    0m29.215s
user    0m10.084s
sys     0m0.184s
598857it [00:26, 22240.85it/s]

real    0m29.382s
user    0m9.848s
sys     0m0.224s
598857it [00:37, 15814.44it/s]

real    0m40.821s
user    0m10.348s
sys     0m0.244s
598857it [00:26, 22711.12it/s]

real    0m28.834s
user    0m9.936s
sys     0m0.224s

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 6, 2017

@otamachan Thank you for your suggestion! I had a closer look at your implementation and finally realized what the remaining problem was. It isn't necessary to go back to boto to achieve the same performance: we can do the same thing with the newer boto3.

#157

Thank you for pointing me the right way. どうもありがとうございました!

menshikh-iv pushed a commit that referenced this issue Dec 6, 2017
* Resolve #152: perform a .get only when necessary

The previous implementation performed a .get every time when reading to
the buffer.  This was wasteful and unnecessary.  A single .get is
sufficient to read from the current position until the end of the file.
Another .get is only necessary when seeking.

* decrease default buffer size to 128kB

This matches the behavior of 1.5.3 more closely, as observed during live
tests against S3.
@otamachan
Copy link

@mpenkov I'm happy that the performance is now back by your right implementation and I could be of help. I really appreciate your great effort to improve this wonderful library!

こちらこそありがとうございます!

@appierys
Copy link
Author

appierys commented Dec 11, 2017

Not sure if it's just me, but I tested 1.5.5 and obtained the following results:

for i in 1 2 3 4 5; do time python reproduce.py s3://commoncrawl/cc-index/collections/CC-MAIN-2013-20/indexes/cluster.idx ; done
598857it [03:12, 3103.18it/s]

real    3m15.232s
user    0m17.400s
sys     0m0.800s
598857it [03:14, 3081.01it/s]

real    3m16.483s
user    0m16.724s
sys     0m0.812s
598857it [03:12, 3115.35it/s]
real    4m4.408s
user    0m17.616s
sys     0m0.848s

@otamachan
Copy link

@appierys This was fixed by fbc82cc and the commit is only in the master branch. Please try the master branch.
@menshikh-iv I would appreciate if you would release 1.5.6 because the performance issue is not fully resolved by 1.5.5 . Thanks in advance.

@appierys
Copy link
Author

@otamachan Thanks~ It works.

@menshikh-iv
Copy link
Contributor

@otamachan I'll release 1.5.6 but slightly later (after setup integration testing contour).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants