GCP download very slow for slightly large files #555

arunmk · 2020-11-06T04:10:22Z

Problem description

I am trying to download a slightly large file (1.1GB) and the attached code with smart_open takes a long time (15m40s) while a gsutil cp takes about 25s. The storage.blob API of google is also quite fast (and comparable to gsutil).

Steps/code to reproduce the problem

Code used:

import time
import sys
from smart_open import open as cloud_open

gcs_uri = "<redacted file name>"
dl_path = "./test.pkl"

current_secs_func = lambda: int(round(time.time()))

chunk_size = 256 * 1024 * 1024  # 256M
count = 0
with cloud_open(gcs_uri, mode="rb") as cloud_fd:  # Same slowness even with `transport_params={'min_part_size': chunk_size}`
    with open(dl_path, mode="wb+") as local_fd:
        print("Start time: ", current_secs_func())
        sys.stdout.flush()
        while True:
            current = current_secs_func()
            data = cloud_fd.read(chunk_size)
            print("Read chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()
            if not data:
                break
            count += 1
            current = current_secs_func()
            local_fd.write(data)
            print("Wrote chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()

Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).

Versions

Please provide the output of:

Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.4.0-1011-gcp-x86_64-with-Ubuntu-20.04-focal
>>> print("Python", sys.version)
Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 3.0.0

The text was updated successfully, but these errors were encountered:

arunmk · 2020-11-06T04:29:43Z

I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times.

mpenkov · 2020-11-06T07:39:00Z

@petedannemann Are you able to investigate?

petedannemann · 2020-11-06T14:00:35Z

pytest integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     2.1291  2.2363  2.1769  0.0431  2.1742  0.0688       2;0  0.4594       5           1
-------------------------------------------------------------------------------------------------------------

Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I'm not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are.

petedannemann · 2020-11-06T14:37:30Z

pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

ncalls  tottime percall cumtime percall filename:lineno(function)
930     1.3052  0.0014  1.3052  0.0014  ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
5       0.3016  0.0603  0.3016  0.0603  ~:0(<method 'do_handshake' of '_ssl._SSLSocket' objects>)
5       0.2555  0.0511  0.2555  0.0511  ~:0(<method 'connect' of '_socket.socket' objects>)
12      0.1437  0.0120  0.1437  0.0120  ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
5       0.0599  0.0120  0.0599  0.0120  ~:0(<method 'load_verify_locations' of '_ssl._SSLContext' objects>)

Seems like reading is the bottleneck and writing is performing fine. Since we have so many calls to read from sockets it seems like buffering for reads is probably broken.

tylerkohn · 2020-11-23T19:34:49Z

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

petedannemann · 2020-11-23T21:57:21Z

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

I don't have time right now to work on this. Feel free to look at it yourself. If you want an alternative, I'd suggest gcsfs

petedannemann · 2021-01-14T19:13:27Z

Follow up on this. Buffering works as intended. I tested this by adding some logging to smart_open.gcs._RawReader

def _download_blob_chunk(self, size):
        start = position = self._position
        if position == self._size:
            #
            # When reading, we can't seek to the first byte of an empty file.
            # Similarly, we can't seek past the last byte.  Do nothing here.
            #
            binary = b''
        elif size == -1:
            logger.info("downloaded")
            binary = self._blob.download_as_bytes(start=start)
        else:
            end = position + size
            logger.info(f"downloaded from bytes {start} to {end}")
            binary = self._blob.download_as_bytes(start=start, end=end)
        return binary

When reading a large file I see:

INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 0 to 262144
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 262144 to 524288
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 524288 to 786432
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 786432 to 1048576
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 1048576 to 1310720
...

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

petedannemann · 2021-01-14T19:15:24Z

Note that comparing speeds to gsutil is not an apples to apples comparison. gsutil almost certainly using gRPC to download the file over a streaming connection.

For next steps I will profile how long google.cloud.storage.Blob.upload_from_string and google.cloud.storage.Blob.download_as_bytes take to understand what overhead smart-open adds

petedannemann · 2021-01-14T21:59:34Z

benchmark of smart_open reads (I commented out the file writing in read_write)

pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in ms)             Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_gcs_performance     762.5399  942.5238  849.3104  65.7851  835.9554  76.6498       2;0  1.1774       5           1
-----------------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.download_as_bytes (performed on the same file as above).

import google.cloud.storage


def download_blob(bucket, key):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()


def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    actual = benchmark(download_blob, bucket, key)

$ pytest --benchmark-cprofile=tottime benchmark_gcs_download.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      802.8116  886.9596  834.0021  32.4173  823.4367  37.8678       1;0  1.1990       5           1
--------------------------------------------------------------------------------------------------------------------

benchmark of smart-open writes (I commented out the file reading in read_write and added a large min_part_size)

$ pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     1.0568  1.7717  1.2996  0.2871  1.2318  0.3631       1;0  0.7695       5           1
-------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.upload_from_string** (performed on the same file as above).

import google.cloud.storage


def upload_blob(bucket, key, data):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).blob(key)
    blob.upload_from_string(data)


def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()
    actual = benchmark(upload_blob, bucket, key, data)

$ pytest --benchmark-cprofile=tottime benchmark_gcs_upload.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      712.8647  746.4870  731.4032  15.3004  732.1812  28.5586       1;0  1.3672       5           1
--------------------------------------------------------------------------------------------------------------------

Read speed for smart-open is basically the same as using the google.cloud.storage library. Increasing buffer_size should substantially speed up downloading larger files. Writing via smart-open is slower.

piskvorky · 2021-01-14T22:27:57Z

Thanks for looking into this @petedannemann .

Writing via smart-open is slower.

Do you know the cause?

arunmk · 2021-01-14T23:20:04Z

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

@petedannemann I initially used the api calls without the buffer_size parameters. Because they were extremely slow, I added the buffer_size parameters to check if they would help. I have also tried with large buffer sizes (1Gi etc) and they did not help in any noticeable manner.

arunmk · 2021-01-15T06:13:20Z

@petedannemann could you also mention the version of the library used? I am not able to get to that codebase anymore,but I remember you also being able to replicate the slowness as per the initial comment.

tylerkohn · 2021-01-15T17:25:15Z

thanks @arunmk for following up on this. As arun noted initially the version we tested was 3.0.0 which is an older one now!

q-aaronzolnailucas · 2023-12-14T14:35:19Z

Does anyone know if #599 and its resolution #729 solves the slow reading issue? I'm trying to weigh up the benefits of upgrading our infrastructure to use google-cloud-storage >=2.6.0 for use with smart-open.

mpenkov added the performance label Nov 6, 2020

petedannemann mentioned this issue Mar 24, 2021

Refactor Google Cloud Storage to use blob.open #599

Closed

3 tasks

mpenkov added bug awaiting-response labels Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP download very slow for slightly large files #555

GCP download very slow for slightly large files #555

arunmk commented Nov 6, 2020 •

edited

Loading

arunmk commented Nov 6, 2020 •

edited

Loading

mpenkov commented Nov 6, 2020

petedannemann commented Nov 6, 2020

petedannemann commented Nov 6, 2020 •

edited

Loading

tylerkohn commented Nov 23, 2020

petedannemann commented Nov 23, 2020

petedannemann commented Jan 14, 2021 •

edited

Loading

petedannemann commented Jan 14, 2021 •

edited

Loading

petedannemann commented Jan 14, 2021 •

edited

Loading

piskvorky commented Jan 14, 2021 •

edited

Loading

arunmk commented Jan 14, 2021

arunmk commented Jan 15, 2021

tylerkohn commented Jan 15, 2021 •

edited

Loading

q-aaronzolnailucas commented Dec 14, 2023

GCP download very slow for slightly large files #555

GCP download very slow for slightly large files #555

Comments

arunmk commented Nov 6, 2020 • edited Loading

Problem description

Steps/code to reproduce the problem

Versions

arunmk commented Nov 6, 2020 • edited Loading

mpenkov commented Nov 6, 2020

petedannemann commented Nov 6, 2020

petedannemann commented Nov 6, 2020 • edited Loading

tylerkohn commented Nov 23, 2020

petedannemann commented Nov 23, 2020

petedannemann commented Jan 14, 2021 • edited Loading

petedannemann commented Jan 14, 2021 • edited Loading

petedannemann commented Jan 14, 2021 • edited Loading

piskvorky commented Jan 14, 2021 • edited Loading

arunmk commented Jan 14, 2021

arunmk commented Jan 15, 2021

tylerkohn commented Jan 15, 2021 • edited Loading

q-aaronzolnailucas commented Dec 14, 2023

arunmk commented Nov 6, 2020 •

edited

Loading

arunmk commented Nov 6, 2020 •

edited

Loading

petedannemann commented Nov 6, 2020 •

edited

Loading

petedannemann commented Jan 14, 2021 •

edited

Loading

petedannemann commented Jan 14, 2021 •

edited

Loading

petedannemann commented Jan 14, 2021 •

edited

Loading

piskvorky commented Jan 14, 2021 •

edited

Loading

tylerkohn commented Jan 15, 2021 •

edited

Loading