Potential memory leak in MultipartWriter._upload_next_part #511

SchopenhauerZhang · 2020-06-16T00:35:43Z

MultipartWriter._upload_next_part Memory leak

the MultipartWriter._upload_next_part function upload the bytesIo from Memory to s3(by upload[CreateMultipartUpload]),but do not clean bytesIO (in Memory).
i use the smart_open.open() (gz file) upload to s3,the Memory Keep increasing。 I check the code and focus the MultipartWriter._upload_next_part() :

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

        self._total_parts += 1
        self._buf = io.BytesIO()

after upload the bytesIO, not clean the Memory.

bug fix code

        self._buf.seek(0)
        part = self._mp.Part(part_num)
        upload = _retry_if_failed(functools.partial(part.upload, Body=self._buf))
        self._parts.append({'ETag': upload['ETag'], 'PartNumber': part_num})
        logger.debug("upload of part #%i finished" % part_num)

       
        self._buf.truncate()# clean the Memory , avoid Memory leak
        self._total_parts += 1
        self._buf = io.BytesIO()

Versions

python 3.7
smart_open 2.0.0

Checklist

https://docs.python.org/zh-cn/3/library/io.html

mpenkov · 2020-06-16T00:49:52Z

Thank you for reporting this.

self._buf.truncate()  # (1)
self._buf = io.BytesIO()  # (2)

Interesting, wouldn't the garbage collector take care of this? We're throwing away the old value of self._buf entirely in (2). Perhaps something else is holding on to a reference of the old self._buf somewhere...

I will investigate. Thanks again.

SchopenhauerZhang · 2020-06-16T01:05:56Z

wouldn't the garbage collector take care of this? em， i use smart_open in
High concurrency and high throughput distributed system. wait the garbage collector(python's) to deal it is not good idea. I suppose to recycle the memory by self.
I will try to recycle the memory by self, later i write the result at here. Thanks!

piskvorky · 2020-06-16T08:09:18Z

wait the garbage collector(python's) to deal it is not good idea.

If you use cpython (the standard python) there is no "wait". Memory is reclaimed as soon as it's not needed = its reference count goes to zero.

There are some exceptions with reference cycles but not leaks. I'd expect some other code (boto?) is holding on to the reference longer than it should.

cnmoise · 2021-01-30T17:16:14Z

Hello! Any updates on this? I'm trying to stream large files using this library but it seems that the whole file is being kept in memory before it can be written to S3, which negates my reason for wanting to use this library in the first place.

I did a memory benchmark for a 140.6 MB file using memory_profiler and it seems that the whole file was held in memory for the duration

I tried to implement the fix mentioned by OP but that doesn't seem to have prevented the memory leak

By comparison if I try to use the vanilla Python open I use much less memory

mpenkov · 2021-01-31T02:07:38Z

it seems that the whole file is being kept in memory before it can be written to S3

I'd be surprised if that's really the case. I think it's more likely the entire part (a single part of the multipart upload) is being kept in memory. The default part size is 50MB. You can tweak it by passing min_part_size as a transport_param (please see the README).

Also, check out #547 - it has the option of buffering to disk instead of in memory. It's not ready to merge yet, unfortunately.

mpenkov · 2021-10-10T13:36:18Z

@cnmoise Can you reproduce the problem with the newest version of smart_open? We've reworked the buffering in the S3 submodule significantly, and it's likely the problem is no longer there.

mpenkov added the bug label Jun 16, 2020

mpenkov self-assigned this Jun 16, 2020

mpenkov changed the title ~~BUG:MultipartWriter._upload_next_part Memory leak~~ Potential memory leak in MultipartWriter._upload_next_part Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak in MultipartWriter._upload_next_part #511

Potential memory leak in MultipartWriter._upload_next_part #511

SchopenhauerZhang commented Jun 16, 2020 •

edited by mpenkov

Loading

mpenkov commented Jun 16, 2020

SchopenhauerZhang commented Jun 16, 2020

piskvorky commented Jun 16, 2020 •

edited

Loading

cnmoise commented Jan 30, 2021 •

edited

Loading

mpenkov commented Jan 31, 2021

mpenkov commented Oct 10, 2021

Potential memory leak in MultipartWriter._upload_next_part #511

Potential memory leak in MultipartWriter._upload_next_part #511

Comments

SchopenhauerZhang commented Jun 16, 2020 • edited by mpenkov Loading

MultipartWriter._upload_next_part Memory leak

bug fix code

Versions

Checklist

mpenkov commented Jun 16, 2020

SchopenhauerZhang commented Jun 16, 2020

piskvorky commented Jun 16, 2020 • edited Loading

cnmoise commented Jan 30, 2021 • edited Loading

mpenkov commented Jan 31, 2021

mpenkov commented Oct 10, 2021

SchopenhauerZhang commented Jun 16, 2020 •

edited by mpenkov

Loading

piskvorky commented Jun 16, 2020 •

edited

Loading

cnmoise commented Jan 30, 2021 •

edited

Loading