Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in MultipartWriter._upload_next_part #511

Open
SchopenhauerZhang opened this issue Jun 16, 2020 · 6 comments
Open
Assignees
Labels

Comments

@SchopenhauerZhang
Copy link

SchopenhauerZhang commented Jun 16, 2020

MultipartWriter._upload_next_part Memory leak

the MultipartWriter._upload_next_part function upload the bytesIo from Memory to s3(by upload[CreateMultipartUpload]),but do not clean bytesIO (in Memory).
i use the smart_open.open() (gz file) upload to s3,the Memory Keep increasing。 I check the code and focus the MultipartWriter._upload_next_part() :

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

        self._total_parts += 1
        self._buf = io.BytesIO()

after upload the bytesIO, not clean the Memory.

bug fix code

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

       
        self._buf.truncate()# clean the Memory , avoid Memory leak
        self._total_parts += 1
        self._buf = io.BytesIO()

Versions

python 3.7
smart_open 2.0.0

Checklist

https://docs.python.org/zh-cn/3/library/io.html

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 16, 2020

Thank you for reporting this.

self._buf.truncate()  # (1)
self._buf = io.BytesIO()  # (2)

Interesting, wouldn't the garbage collector take care of this? We're throwing away the old value of self._buf entirely in (2). Perhaps something else is holding on to a reference of the old self._buf somewhere...

I will investigate. Thanks again.

@mpenkov mpenkov added the bug label Jun 16, 2020
@mpenkov mpenkov self-assigned this Jun 16, 2020
@SchopenhauerZhang
Copy link
Author

wouldn't the garbage collector take care of this? em, i use smart_open in
High concurrency and high throughput distributed system. wait the garbage collector(python's) to deal it is not good idea. I suppose to recycle the memory by self.
I will try to recycle the memory by self, later i write the result at here. Thanks!

@mpenkov mpenkov changed the title BUG:MultipartWriter._upload_next_part Memory leak Potential memory leak in MultipartWriter._upload_next_part Jun 16, 2020
@piskvorky
Copy link
Owner

piskvorky commented Jun 16, 2020

wait the garbage collector(python's) to deal it is not good idea.

If you use cpython (the standard python) there is no "wait". Memory is reclaimed as soon as it's not needed = its reference count goes to zero.

There are some exceptions with reference cycles but not leaks. I'd expect some other code (boto?) is holding on to the reference longer than it should.

@cnmoise
Copy link

cnmoise commented Jan 30, 2021

Hello! Any updates on this? I'm trying to stream large files using this library but it seems that the whole file is being kept in memory before it can be written to S3, which negates my reason for wanting to use this library in the first place.

I did a memory benchmark for a 140.6 MB file using memory_profiler and it seems that the whole file was held in memory for the duration
Screen Shot 2021-01-30 at 11 09 22 AM

I tried to implement the fix mentioned by OP but that doesn't seem to have prevented the memory leak

By comparison if I try to use the vanilla Python open I use much less memory
Screen Shot 2021-01-30 at 11 53 23 AM

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 31, 2021

it seems that the whole file is being kept in memory before it can be written to S3

I'd be surprised if that's really the case. I think it's more likely the entire part (a single part of the multipart upload) is being kept in memory. The default part size is 50MB. You can tweak it by passing min_part_size as a transport_param (please see the README).

Also, check out #547 - it has the option of buffering to disk instead of in memory. It's not ready to merge yet, unfortunately.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 10, 2021

@cnmoise Can you reproduce the problem with the newest version of smart_open? We've reworked the buffering in the S3 submodule significantly, and it's likely the problem is no longer there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants