Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise memory allocation for s3 sending #1098

Closed
lstrzebinczyk opened this issue Feb 20, 2016 · 5 comments
Closed

Optimise memory allocation for s3 sending #1098

lstrzebinczyk opened this issue Feb 20, 2016 · 5 comments
Labels
feature-request A feature should be added or improved.

Comments

@lstrzebinczyk
Copy link

Hello. In one of the processes used in my application we send a lot of data to s3 using this gem, and I noticed 2 methods are allocating a significant amount of memory, specifically these 2 methods:

https://github.com/aws/aws-sdk-ruby/blob/master/aws-sdk-core/lib/aws-sdk-core/plugins/s3_md5s.rb#L36
https://github.com/aws/aws-sdk-ruby/blob/master/aws-sdk-core/lib/aws-sdk-core/signers/v4.rb#L219

I experimented with them and came up with such implementation:

  def hexdigest(value)
    if value.is_a? File
      Digest::SHA256.file(value).hexdigest
    elsif value.respond_to?(:read)
      Digest::SHA256.hexdigest(value.read)
    else
      Digest::SHA256.hexdigest(value)
    end
  end

  def md5(body)
    if body.is_a? File
      Base64.encode64(Digest::MD5.file(body).digest).strip
    else
      Base64.encode64(Digest::MD5.digest(body.read)).strip
    end
  end

Which reduces the memory foorprint significantly. I wrote a small script to gather rss, and prepared this graph:

newplot 1

Both lines represent memory usage at given time, through the whole time of process. Orange line represents case when 2.2.16 version is used, red represents the process with my changes. Notice significant reduction of memory usage at the end. This is where s3 sending takes place. I also checked this with ruby-prof and memory_profiler, and the reduction is shown by alll of those tools. I was going to send PR, but some tests are not passing, and I also do realize this is probably not a change that can be done in a single file, since the OpenSSL::Digest seems to be used globally. However, the memory problem is real. How can we solve this?

EDIT:

Most of the not passing tests check that input is being processed in one megabyte chunks. I think I can provide a PR with backwards compatible changes, where just files will be processed more efficiently.

@awood45
Copy link
Member

awood45 commented Feb 29, 2016

If you'd like to open a PR, happy to look at it. I think that if the unit tests are based on an assumption that would be changing, it would be okay to modify them (and then we can look at the change as a whole).

@awood45 awood45 closed this as completed Feb 29, 2016
@awood45 awood45 reopened this Feb 29, 2016
@awood45
Copy link
Member

awood45 commented Feb 29, 2016

(Accidental close.)

@georgepalmer
Copy link

+1 for this. We're seeing big memory spikes since upgrading a load of gems the last week (including AWS to v2) and I've spent most the day tracking down the culprit!

@georgepalmer
Copy link

We've been playing with this today and have found a small correction required to the above code:

module Aws
  module Signers
    class V4
      def hexdigest(value)
        if value.is_a? File
          Digest::SHA256.file(value).hexdigest
        elsif value.respond_to?(:read)
          str = value.read
          value.rewind
          Digest::SHA256.hexdigest(str)
        else
          Digest::SHA256.hexdigest(value)
        end
      end
    end
  end
end

@trevorrowe
Copy link
Member

I went ahead and added a pair of utility methods to the SDK:
https://github.com/aws/aws-sdk-ruby/blob/master/aws-sdk-core/lib/aws-sdk-core/checksums.rb

The sigv4 signer and S3 MD5 plugin both use these methods. This should improve memory usage in these scenarios. There are a couple of outstanding issues:

  • Aws::Glacier::Client - Glacier requires the payload of an uploaded archive to be sent with a tree-hash that is computed from the archive in rolling 1MB chunks. Unless we can get OpenSSL::Digest to compute a digest of a file part, this will have to be done in Ruby. We could read the 1MB in smaller chunks to reduce memory usage though.
  • Aws::S3::Object#upload_file - Uses a wrapper around a file object to send to Aws::S3::Client#upload_part. These file part wrappers are not File or Tempfile objects themselves as they represent a slice of the entire file. I'm not sure how to get OpenSSL::Digest to compute a digest of a file part. If I can find this, then we can improve the memory usage here as well. Alliteratively, we can also reduce the chunk size that we read and pass to the digest.

Please feel-free to re-open to discuss alternatives or also feel free to send a PR.

awood45 added a commit that referenced this issue Jun 21, 2016
@diehlaws diehlaws added feature-request A feature should be added or improved. and removed enhancement labels Jan 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

5 participants