Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON::Util::URI.parse memory leak #329

Closed
skippy opened this issue Jun 3, 2016 · 7 comments
Closed

JSON::Util::URI.parse memory leak #329

skippy opened this issue Jun 3, 2016 · 7 comments

Comments

@skippy
Copy link

skippy commented Jun 3, 2016

hey folks,

I benchmarked my app, which does JSON Schema validation in a tight loop (10ks of validations per minute per thread, 10m of validations per run), and I noticed the following:

  • the single largest class of objects, by far, and across my entire code base, is Addressable::URI
  • the Addressable::URI objects are not being GC'ed

Using the (damn wonderful) stackprof gem, almost all the Addressable URI objects are created by this helper: https://github.com/ruby-json-schema/json-schema/blob/master/lib/json-schema/util/uri.rb#L26. 500m+ Addressable URI objects are created per run, as my json-schemas use $refs quite a bit.

But what is odd is none of these objects are being GC'ed. At a cursory glance of the code, it seems that they should be as #dup is used. However, if I remove the class-level cache and the use of #dup, like so:

module JSON
  module Util
    module URI

      def self.parse(uri)
        # do NOT parse a Addressable::URI; behind the scenes it creates a dup
        # which leaves a reference to the parent object (as it is not duping the 
        # string objects), which causes a memory explosion
        # 
        return uri if uri.is_a?(Addressable::URI)
        Addressable::URI.parse(uri)
      rescue Addressable::URI::InvalidURIError => e
        raise JSON::Schema::UriError.new(e.message)
      end

    end
  end
end

the memory leak is gone. My process went from growing to ~ 1.5GB per run to staying at a consistent ~ 250MB.

I suspect the issue is the @parse_cache combined with the dup. The #dup creates a new object, and within Addressable::URI it calls #dup on the strings. However my understanding is that this is a copy-on-write process (see http://www.justskins.com/forums/memory-behavior-of-string-8195.html), which means that a reference will be kept around to the shared parent object until it is modified. All it would take is one string reference from the parent object (which is never deleted as it is stored within @parse_cache) to stick around to prevent any of the dupped object from being GC'ed. I haven't dived into this a ton or into the C code (outside of benchmarking and patching) but this is what I suspect is going on.

The above patch is marginally slower as it has to parse the strings with regexs (which is slow, plus the regexs used by Addressable::URI are not class variables but created every time, which is slower and less efficient). BUT I think the slight speed hit (which I only benchmarked within my app, so while it wasn't in isolation, within my app it caused less than a 0.01% decrease in speed) is more than acceptable in exchange for plugging a memory leak.

Thoughts?

@RST-J
Copy link
Contributor

RST-J commented Jul 11, 2016

Nice work, thanks a lot! I agree with your comment wrt the performance vs. memory tradeoff.
@iainbeeston Did you also work on that caching stuff? I know some contributer started it and we changed something in the meantime.
What I want to say is, it would be good if someone who knows more about that code would create a PR (or if you'd be up to, you can also open a PR @skippy).

@iainbeeston
Copy link
Contributor

iainbeeston commented Jul 11, 2016 via email

@iainbeeston
Copy link
Contributor

iainbeeston commented Jul 11, 2016 via email

@skippy
Copy link
Author

skippy commented Jul 11, 2016

@iainbeeston In my case the cache was just a couple of dozen keys; the problem seemed to be the use of Addressable::URI#dup, which caused each of those keys to hold onto millions of objects. While clearing out the cache after every parse would be nice, it looks like caching here is of minimal benefit from a CPU perspective.

iainbeeston added a commit to iainbeeston/json-schema that referenced this issue Oct 1, 2016
There are two main caches in json-schema. One is for schemas, one is for
parsed uris. Previously, setting `:clear_cache` as a validation option
would clear the schema cache but not the uri cache. This has been
reported to cause a memory leak (see issue voxpupuli#329) as over time the uri
cache gets bigger and bigger and is never cleared.

This change makes sure that whenever the schema cache is cleared, the
uri cache is also cleared.

Incidentally this change is not thread safe, and multithreaded
applications that clear the cache may see  issues. But until we can
refactor to either disable caching or get rid of class variables,
I don't believe there's much we can (realistically) do about that. This
is still an improvement on what's gone before.
@iainbeeston
Copy link
Contributor

I've made a potential fix for this in #361 - it's not a long-term solution but it solves the immediate issue.

iainbeeston added a commit to iainbeeston/json-schema that referenced this issue Oct 1, 2016
There are two main caches in json-schema. One is for schemas, one is for
parsed uris. Previously, setting `:clear_cache` as a validation option
would clear the schema cache but not the uri cache. This has been
reported to cause a memory leak (see issue voxpupuli#329) as over time the uri
cache gets bigger and bigger and is never cleared.

This change makes sure that whenever the schema cache is cleared, the
uri cache is also cleared.

Incidentally this change is not thread safe, and multithreaded
applications that clear the cache may see  issues. But until we can
refactor to either disable caching or get rid of class variables,
I don't believe there's much we can (realistically) do about that. This
is still an improvement on what's gone before.
iainbeeston added a commit to iainbeeston/json-schema that referenced this issue Oct 2, 2016
There are two main caches in json-schema. One is for schemas, one is for
parsed uris. Previously, setting `:clear_cache` as a validation option
would clear the schema cache but not the uri cache. This has been
reported to cause a memory leak (see issue voxpupuli#329) as over time the uri
cache gets bigger and bigger and is never cleared.

This change makes sure that whenever the schema cache is cleared, the
uri cache is also cleared.

Incidentally this change is not thread safe, and multithreaded
applications that clear the cache may see  issues. But until we can
refactor to either disable caching or get rid of class variables,
I don't believe there's much we can (realistically) do about that. This
is still an improvement on what's gone before.
@iainbeeston
Copy link
Contributor

I'm closing this now #361 has been merged. Should be available in the next release

@skippy
Copy link
Author

skippy commented Oct 5, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants