-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimalPartInfo() default partSize results in running out of memory #730
Comments
(In my test doing minPartSize chunks also resulted in greater speed, so I'm not sure why larger sizes are considered more optimal?) |
You should always make sure that the size of the stream is known for AWS S3 APIs to work in optimal manner. If not we fallback to a buffering technique where maximum object chosen is 5TB (lack of knowing file size) - If you do know the file size wrap it around into a struct with So this is an expected behavior. Also the buffering technique only uses 576MB re-usable buffer are you saying you don't have 1GB > memory on the system where you are using this? |
@harshavardhana I don't know the size of the file beforehand; the use case here is creating data that is too big to fit on the local hard drive and instead directly streaming it to S3. From my brief reading of the code a new ~600MB buffer is created for every call to PutObjectStreaming(), so if someone wanted to stream multiple files simultaneously they'd run out of memory even on large memory systems. But I'm running out of memory streaming just 1 file on a 4GB system. According to
Now, during the stream of a 1GB file I go from 68MB used to 1.3GB used (2.3GB available) during the stream, and don't run out of memory. (If it makes any difference, I'm using |
It is incorrect to assume a lower size just to make sure that it fits into your memory. I agree that the crash is incorrect and at max we should only be using 600MB max. Unfortunately on AWS S3 it is not possible to upload a file for which content length is not known. So we optimize based on available information to provide a best effort mechanism.
I see now you are right and it should be avoided, there should be a fixed buffer which should be re-used. The problem is caused by usage of |
This is not fixed. It actually got worse (my workaround of calling debug.FreeOSMemory() before doing the next thing no longer works because now it fails directly in the minio-go function call). Tested with latest commit 1a09415 .
Where mycode line 290 is:
|
The memory is protected by pooling the buffers you can't use I would suggest an architecture change on your end from what you are doing right now and not provide minio-go a stream with no length. If you do provide final length then we do know how much memory to allocate optimally. You can implement a custom |
I'll look in to exactly what minio-go is doing tomorrow, but what stops you from uploading in user-defined or small chunks? If you only had a few 64MB buffers, I wouldn't run out of memory. Is there a provable benefit to having buffers larger than this size? |
APIs need to be changed to address this and this would come across as a low-level API. Now if you do know that 64MB is good enough per chunk - then you are sort of aware of the final length of the object that you are going to upload. So i would suggest wrap the reader and implement Size() function which would then be used internally to calculate the optimal chunk. If you wish to use low level APIs then you can simply use http://godoc.org/github.com/minio/minio-go#Core - here chunking and memory management is all yours - you can decide when to do multipart etc.
The API needs to cater for all use cases, if it supports no-length io.Reader then it should support uploading till 5TB (maximum allowable object size for AWS S3) in such a case 500MB+ per part is the most optimal number based on the 10,000 part limit of AWS S3. |
I have no idea about the file size; files are being streamed to me externally. In terms of catering for all use cases, how about limiting the size and number of buffers in the pool based on available memory? I feel like it's more useful to avoid crashing than it is to be able to upload a 5TB file. After all, you can't upload a 5TB file if you crash. The problem I was having earlier was when I only had a few hundred MB of memory free on my machine. The very first attempt to create the ~600MB buffer failed. But on a fresh 4GB machine with no other user processes running:
And having altered minio-go like this:
Trying to upload 2 (~10 byte) files (sequentially) normally results in:
But sometimes, just running the same code over and over, I get:
I don't really understand where the kill signal is coming from, nor why a second buffer gets created before I see the log call for trying to stream the second file. And, since I'm uploading files sequentially, why doesn't the first buffer get reused? Isn't that the point of having a pool? Why do I need to spend 1.2GB of memory to upload ~20bytes of data? |
Here's code that triggers the issue:
If I run
But sometimes get:
So it gets killed 100% of the time, but sometimes creates the second buffer early. Why is it creating a second buffer? |
(It seems to always work fine without |
It looks like the first and single call to From the docs for sync.Pool:
So maybe minio-go's behaviour isn't so surprising. As such I'd suggest that sync.Pool is not the ideal way to implement this. It would be great if instead your pool guaranteed to return any previously created buffer if one is available, and didn't create more buffers than can fit in memory. |
Always return buffers to the pool.
I was wrong about a single Fixing this doesn't solve my issue though, since sync.Pool still creates unnecessary buffers. Please see my pull request for a pool implementation that works with the above code. |
Always return buffers to the pool.
Always return buffers to the pool.
Always return buffers to the pool.
Always return buffers to the pool. New ReleaseBuffers() method to empty the pool and trigger garbage collection and release of memory to the OS.
Always return buffers to the pool. New ReleaseBuffers() method to empty the pool and trigger garbage collection and release of memory to the OS.
runtime/debug.FreeOSMemory() is used to ensure that after a write, the buffer is immediately released so that subsequent operations do not run out of memory.
Like #404, I'm streaming data which could be over 1GB and trying to use PutObjectStreaming(), but this results in optimalPartInfo(-1) being called and a partSize of 603979776 being returned.
The problem I face is that this partSize is used as, essentially, the size of the read buffer, and I'm running out of memory (I end up with errors like
fork/exec ...: cannot allocate memory
after the stream completes).I've tried setting partSize = minPartSize in optimalPartInfo() when the input size is -1, and this solves my memory issue.
Regardless of exactly how it's done, can we get (or choose) a reasonable sized read buffer during PutObjectStreaming() so that we can stream in constant small amount of memory?
The text was updated successfully, but these errors were encountered: