-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP download very slow for slightly large files #555
Comments
I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times. |
@petedannemann Are you able to investigate? |
pytest integration-tests/test_gcs.py::test_gcs_performance
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance 2.1291 2.2363 2.1769 0.0431 2.1742 0.0688 2;0 0.4594 5 1
------------------------------------------------------------------------------------------------------------- Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I'm not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are. |
pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance
ncalls tottime percall cumtime percall filename:lineno(function)
930 1.3052 0.0014 1.3052 0.0014 ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
5 0.3016 0.0603 0.3016 0.0603 ~:0(<method 'do_handshake' of '_ssl._SSLSocket' objects>)
5 0.2555 0.0511 0.2555 0.0511 ~:0(<method 'connect' of '_socket.socket' objects>)
12 0.1437 0.0120 0.1437 0.0120 ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
5 0.0599 0.0120 0.0599 0.0120 ~:0(<method 'load_verify_locations' of '_ssl._SSLContext' objects>) Seems like reading is the bottleneck and writing is performing fine. Since we have so many calls to read from sockets it seems like buffering for reads is probably broken. |
Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks! |
I don't have time right now to work on this. Feel free to look at it yourself. If you want an alternative, I'd suggest gcsfs |
Follow up on this. Buffering works as intended. I tested this by adding some logging to def _download_blob_chunk(self, size):
start = position = self._position
if position == self._size:
#
# When reading, we can't seek to the first byte of an empty file.
# Similarly, we can't seek past the last byte. Do nothing here.
#
binary = b''
elif size == -1:
logger.info("downloaded")
binary = self._blob.download_as_bytes(start=start)
else:
end = position + size
logger.info(f"downloaded from bytes {start} to {end}")
binary = self._blob.download_as_bytes(start=start, end=end)
return binary When reading a large file I see:
Note that @arunmk seemed to be using |
Note that comparing speeds to gsutil is not an apples to apples comparison. gsutil almost certainly using gRPC to download the file over a streaming connection. For next steps I will profile how long |
benchmark of smart_open reads (I commented out the file writing in
benchmark of import google.cloud.storage
def download_blob(bucket, key):
client = google.cloud.storage.Client()
blob = client.bucket(bucket).get_blob(key)
data = blob.download_as_bytes()
def test_performance(benchmark):
bucket = 'smart-open'
key = 'tests/performance.txt'
actual = benchmark(download_blob, bucket, key)
benchmark of smart-open writes (I commented out the file reading in
benchmark of import google.cloud.storage
def upload_blob(bucket, key, data):
client = google.cloud.storage.Client()
blob = client.bucket(bucket).blob(key)
blob.upload_from_string(data)
def test_performance(benchmark):
bucket = 'smart-open'
key = 'tests/performance.txt'
client = google.cloud.storage.Client()
blob = client.bucket(bucket).get_blob(key)
data = blob.download_as_bytes()
actual = benchmark(upload_blob, bucket, key, data)
Read speed for smart-open is basically the same as using the google.cloud.storage library. Increasing buffer_size should substantially speed up downloading larger files. Writing via smart-open is slower. |
Thanks for looking into this @petedannemann .
Do you know the cause? |
@petedannemann I initially used the api calls without the buffer_size parameters. Because they were extremely slow, I added the buffer_size parameters to check if they would help. I have also tried with large buffer sizes (1Gi etc) and they did not help in any noticeable manner. |
@petedannemann could you also mention the version of the library used? I am not able to get to that codebase anymore,but I remember you also being able to replicate the slowness as per the initial comment. |
thanks @arunmk for following up on this. As arun noted initially the version we tested was 3.0.0 which is an older one now! |
Problem description
I am trying to download a slightly large file (1.1GB) and the attached code with
smart_open
takes a long time (15m40s) while agsutil cp
takes about 25s. Thestorage.blob
API of google is also quite fast (and comparable to gsutil).Steps/code to reproduce the problem
Code used:
Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).
Versions
Please provide the output of:
The text was updated successfully, but these errors were encountered: