Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance due to lack of buffering for GzipFile.write #760

Open
svisser opened this issue Feb 17, 2023 · 6 comments
Open

Slow performance due to lack of buffering for GzipFile.write #760

svisser opened this issue Feb 17, 2023 · 6 comments

Comments

@svisser
Copy link

svisser commented Feb 17, 2023

Problem description

When writing a .csv.gz file to S3, performance was slow as it's calling GzipFile.write() for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering in GzipFile.write() (see: python/cpython#89550) and I was able to improve performance by implementing a solution similar to python/cpython#101251 (i.e., register a custom compression handler for the '.gz' file extension in which GzipFile.write does have buffering).

I'm opening this issue for the smart_open library to discuss:

  • Should the smart_open library make it easier to enable and/or configure buffering for GzipFile.write() for Python versions that don't yet have the above fix?
  • If not, should we document this or make it clear to users of the smart_open library that performance can be improved by adding buffering yourself?

Steps/code to reproduce the problem

Use smart_open.open in write mode for a .csv.gz file to S3:

import csv

import smart_open

column_names = ("a", "b")
my_data = [.......]
with smart_open.open("s3://.../example.csv.gz", "wb", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=column_names)
    writer.writerows(my_data)

(writerows, in C, calls writerow for each line, which in turn call Gzip.write() for each line)

Versions

  • Python 3.8
  • smart_open 6.2.0
@rhpvorderman
Copy link

rhpvorderman commented May 17, 2023

What we do in xopen is simply put all GzipFile stuff in a io.BufferedWriter and return that to the user. Should be a one or two-liner fix for the smart_open people.

I'd like to plug xopen here, it has a much less broad focus than smart_open, but as a result it is very fast, especially for any gzip related work. I notice that the smart_open library does not use any accelerated libraries such as python-isal. Those can have great impact on performance.

@mpenkov
Copy link
Collaborator

mpenkov commented May 18, 2023

@svisser Are you interested in making a PR to implement @rhpvorderman 's suggestion?

@piskvorky
Copy link
Owner

piskvorky commented May 18, 2023

Is smart_open core the right place for such buffering? IMO that should either go directly in Python (as your CPython ticket suggests), or live in user-land as a custom registered extension handle (as you say).

But I can also see how that's a pretty common use-case, and smart_open does focus on performance. So it's not a bad fit. So I'm on the fence here. How far do we want to take this? Implement buffering / paralellized .bz2, .xz too? Support custom random-access seeking via igzip? Then we may end up maintaining a lot of compatibility / 3rd party API stuff.

Not an issue with just buffered write to .gz. Just thinking ahead about the core mission, scope.

@mpenkov
Copy link
Collaborator

mpenkov commented May 18, 2023

I think just do the bare minimum to address the issue: add buffering to gzip as an edge case. Once the CPython guys fix this on their side we can remove the buffering on our side.

@rhpvorderman
Copy link

@mpenkov The lines in xopen are here https://github.com/pycompression/xopen/blob/cc75cbcd17130a4c004cf44c46121da51be582d8/src/xopen/__init__.py#LL1297C1-L1309C23

@rbramley
Copy link

rbramley commented Dec 1, 2023

This fix could be really important for my similar use case, streaming large gzipped tar files from an HTTPS endpoint to an S3 bucket. The throughput I've been seeing with smart_open (6.4.0 with Python 3.10) is inadequate, taking around 130 seconds per ~50MB part, meaning over 8 hours to transfer an 11GB file!

Previously we'd streamed files to temporary storage before performing a multipart upload to S3. This was very quick, given sufficient disk space, as there is a lot less overhead than with the current smart_open implementation. Tackling a new larger dataset prompted the move to chaining the streams as the cloud service has disk space constraints that are not sufficient for the largest files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants