Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP download very slow for slightly large files #555

Open
arunmk opened this issue Nov 6, 2020 · 14 comments
Open

GCP download very slow for slightly large files #555

arunmk opened this issue Nov 6, 2020 · 14 comments

Comments

@arunmk
Copy link

arunmk commented Nov 6, 2020

Problem description

I am trying to download a slightly large file (1.1GB) and the attached code with smart_open takes a long time (15m40s) while a gsutil cp takes about 25s. The storage.blob API of google is also quite fast (and comparable to gsutil).

Steps/code to reproduce the problem

Code used:

import time
import sys
from smart_open import open as cloud_open

gcs_uri = "<redacted file name>"
dl_path = "./test.pkl"

current_secs_func = lambda: int(round(time.time()))

chunk_size = 256 * 1024 * 1024  # 256M
count = 0
with cloud_open(gcs_uri, mode="rb") as cloud_fd:  # Same slowness even with `transport_params={'min_part_size': chunk_size}`
    with open(dl_path, mode="wb+") as local_fd:
        print("Start time: ", current_secs_func())
        sys.stdout.flush()
        while True:
            current = current_secs_func()
            data = cloud_fd.read(chunk_size)
            print("Read chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()
            if not data:
                break
            count += 1
            current = current_secs_func()
            local_fd.write(data)
            print("Wrote chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()

Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).

Versions

Please provide the output of:

Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.4.0-1011-gcp-x86_64-with-Ubuntu-20.04-focal
>>> print("Python", sys.version)
Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 3.0.0
@arunmk
Copy link
Author

arunmk commented Nov 6, 2020

I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 6, 2020

@petedannemann Are you able to investigate?

@petedannemann
Copy link
Contributor

pytest integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     2.1291  2.2363  2.1769  0.0431  2.1742  0.0688       2;0  0.4594       5           1
-------------------------------------------------------------------------------------------------------------

Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I'm not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are.

@petedannemann
Copy link
Contributor

petedannemann commented Nov 6, 2020

pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

ncalls  tottime percall cumtime percall filename:lineno(function)
930     1.3052  0.0014  1.3052  0.0014  ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
5       0.3016  0.0603  0.3016  0.0603  ~:0(<method 'do_handshake' of '_ssl._SSLSocket' objects>)
5       0.2555  0.0511  0.2555  0.0511  ~:0(<method 'connect' of '_socket.socket' objects>)
12      0.1437  0.0120  0.1437  0.0120  ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
5       0.0599  0.0120  0.0599  0.0120  ~:0(<method 'load_verify_locations' of '_ssl._SSLContext' objects>)

Seems like reading is the bottleneck and writing is performing fine. Since we have so many calls to read from sockets it seems like buffering for reads is probably broken.

@tylerkohn
Copy link

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

@petedannemann
Copy link
Contributor

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

I don't have time right now to work on this. Feel free to look at it yourself. If you want an alternative, I'd suggest gcsfs

@petedannemann
Copy link
Contributor

petedannemann commented Jan 14, 2021

Follow up on this. Buffering works as intended. I tested this by adding some logging to smart_open.gcs._RawReader

def _download_blob_chunk(self, size):
        start = position = self._position
        if position == self._size:
            #
            # When reading, we can't seek to the first byte of an empty file.
            # Similarly, we can't seek past the last byte.  Do nothing here.
            #
            binary = b''
        elif size == -1:
            logger.info("downloaded")
            binary = self._blob.download_as_bytes(start=start)
        else:
            end = position + size
            logger.info(f"downloaded from bytes {start} to {end}")
            binary = self._blob.download_as_bytes(start=start, end=end)
        return binary

When reading a large file I see:

INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 0 to 262144
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 262144 to 524288
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 524288 to 786432
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 786432 to 1048576
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 1048576 to 1310720
...

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

@petedannemann
Copy link
Contributor

petedannemann commented Jan 14, 2021

Note that comparing speeds to gsutil is not an apples to apples comparison. gsutil almost certainly using gRPC to download the file over a streaming connection.

For next steps I will profile how long google.cloud.storage.Blob.upload_from_string and google.cloud.storage.Blob.download_as_bytes take to understand what overhead smart-open adds

@petedannemann
Copy link
Contributor

petedannemann commented Jan 14, 2021

benchmark of smart_open reads (I commented out the file writing in read_write)

pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in ms)             Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_gcs_performance     762.5399  942.5238  849.3104  65.7851  835.9554  76.6498       2;0  1.1774       5           1
-----------------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.download_as_bytes (performed on the same file as above).

import google.cloud.storage


def download_blob(bucket, key):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()


def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    actual = benchmark(download_blob, bucket, key)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_download.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      802.8116  886.9596  834.0021  32.4173  823.4367  37.8678       1;0  1.1990       5           1
--------------------------------------------------------------------------------------------------------------------

benchmark of smart-open writes (I commented out the file reading in read_write and added a large min_part_size)

$ pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     1.0568  1.7717  1.2996  0.2871  1.2318  0.3631       1;0  0.7695       5           1
-------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.upload_from_string** (performed on the same file as above).

import google.cloud.storage


def upload_blob(bucket, key, data):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).blob(key)
    blob.upload_from_string(data)


def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()
    actual = benchmark(upload_blob, bucket, key, data)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_upload.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      712.8647  746.4870  731.4032  15.3004  732.1812  28.5586       1;0  1.3672       5           1
--------------------------------------------------------------------------------------------------------------------

Read speed for smart-open is basically the same as using the google.cloud.storage library. Increasing buffer_size should substantially speed up downloading larger files. Writing via smart-open is slower.

@piskvorky
Copy link
Owner

piskvorky commented Jan 14, 2021

Thanks for looking into this @petedannemann .

Writing via smart-open is slower.

Do you know the cause?

@arunmk
Copy link
Author

arunmk commented Jan 14, 2021

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

@petedannemann I initially used the api calls without the buffer_size parameters. Because they were extremely slow, I added the buffer_size parameters to check if they would help. I have also tried with large buffer sizes (1Gi etc) and they did not help in any noticeable manner.

@arunmk
Copy link
Author

arunmk commented Jan 15, 2021

@petedannemann could you also mention the version of the library used? I am not able to get to that codebase anymore,but I remember you also being able to replicate the slowness as per the initial comment.

@tylerkohn
Copy link

tylerkohn commented Jan 15, 2021

thanks @arunmk for following up on this. As arun noted initially the version we tested was 3.0.0 which is an older one now!

@q-aaronzolnailucas
Copy link

Does anyone know if #599 and its resolution #729 solves the slow reading issue? I'm trying to weigh up the benefits of upgrading our infrastructure to use google-cloud-storage >=2.6.0 for use with smart-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants