[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

techaddict · 2016-11-10T16:24:15Z

What changes were proposed in this pull request?

InJavaWrapper 's destructor make Java Gateway dereference object in destructor, using SparkContext._active_spark_context._gateway.detach
Fixing the copying parameter bug, by moving the copy method from JavaModel to JavaParams

How was this patch tested?

import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)

Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
Mem footprint tested using profiler
Added a parameter copy related test which was failing before.

SparkQA · 2016-11-10T16:48:28Z

Test build #68478 has finished for PR 15843 at commit a493c19.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-10T17:08:24Z

Test build #68482 has finished for PR 15843 at commit f25b099.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

techaddict · 2016-11-11T02:15:49Z

cc: @jkbradley @davies @holdenk

jkbradley · 2016-11-11T06:49:32Z

Thanks a lot for finding & reporting this! The fix should probably go in JavaWrapper, not JavaModel, right?

I tested this manually (in JavaWrapper), and it seems to fix the problematic case with StringIndexer.

techaddict · 2016-11-11T06:51:03Z

@jkbradley yes I did it for JavaWrapper first, but try running tests with it gives https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68478/consoleFull

jkbradley · 2016-11-11T07:48:45Z

You're right! It's another bug: copy should be implemented in JavaParams, not JavaModel. I'm sending this PR to fix that: techaddict#1

Can you please check it out and merge it into your PR if it looks OK to you? All pyspark.ml tests ran successfully with it.

* moved copy from JavaModel to JavaParams. mv del from JavaModel to JavaWrapper * added test which fails before this fix

techaddict · 2016-11-11T07:56:22Z

@jkbradley looks good, merged 👍

causes this error while quitting pyspark: Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ SparkContext._active_spark_context._gateway.detach(self._java_obj) AttributeError: 'NoneType' object has no attribute '_gateway' Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ AttributeError: 'NoneType' object has no attribute '_gateway'

techaddict · 2016-11-11T08:06:56Z

python/pyspark/ml/wrapper.py

@@ -33,6 +33,10 @@ def __init__(self, java_obj=None):
        super(JavaWrapper, self).__init__()
        self._java_obj = java_obj

+    def __del__(self):
+        if SparkContext._active_spark_context:


checking if there is active spark context, got this error after quit() in pyspark

Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ SparkContext._active_spark_context._gateway.detach(self._java_obj) AttributeError: 'NoneType' object has no attribute '_gateway' Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ AttributeError: 'NoneType' object has no attribute '_gateway'

SparkQA · 2016-11-11T08:18:05Z

Test build #68513 has finished for PR 15843 at commit 3d858a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-11T08:27:55Z

Test build #68514 has finished for PR 15843 at commit dc5aee3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- final class ParquetLogRedirector implements Serializable
- case class OutputSpec(

viirya · 2016-11-11T08:55:32Z

python/pyspark/ml/wrapper.py

+        """
+        Creates a copy of this instance with the same uid and some
+        extra params. This implementation first calls Params.copy and
+        then make a copy of the companion Java model with extra params.


nit: Java model -> Java pipeline component.

viirya · 2016-11-11T09:00:49Z

python/pyspark/ml/wrapper.py

+        Creates a copy of this instance with the same uid and some
+        extra params. This implementation first calls Params.copy and
+        then make a copy of the companion Java pipeline component with
+        extra params. So both the Python wrapper and the Java model get


nit: There is another "Java model" here too.

viirya · 2016-11-11T09:01:47Z

LGTM with minor doc comment.

SparkQA · 2016-11-11T09:22:59Z

Test build #68515 has finished for PR 15843 at commit 01a80b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-11T09:28:00Z

Test build #68516 has finished for PR 15843 at commit a76a1fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-11-11T19:43:50Z

So this change looks good to me, but it seems like it fixes more than just the bug described in the JIRA & PR description with @jkbradley's change integrated (namely the issue with param copy which we have). For people who are looking for what's changed between versions it might make sense to explain the copy related fix the PR description as well since that is what is used in the commit log.

techaddict · 2016-11-11T19:53:15Z

@holdenk updated the description.

holdenk · 2016-11-11T21:13:48Z

LGTM thanks for fixing this @techaddict :D :)

holdenk · 2016-11-11T21:20:15Z

ping @davies if you have time for final review/merge?

jkbradley · 2016-11-11T21:39:50Z

Good point @holdenk --- @techaddict could you also please update the PR title to say "JavaWrapper" instead of "StringIndexer"?

jkbradley · 2016-11-11T23:50:40Z

I'm now wondering if __del___ should be in JavaParams instead of JavaWrapper since JavaWrapper does not override copy. Do yall agree it will be safer if it's moved there?

viirya · 2016-11-12T01:46:45Z

@jkbradley Sounds making sense more.

holdenk · 2016-11-12T17:50:00Z

I'm not so sure about that, we still would want to cleanup the underlying Java reference object on delete if it isn't needed anymore. I think the question is do we want to support shallow copy of javawrapper objects?

viirya · 2016-11-13T00:10:52Z

Oh. I thought JavaWrapper is only used on JavaParams. But there are also others like LogisticRegressionSummary which directly inherits JavaWrapper. Looks like we should still put __del__ in JavaWrapper?

jkbradley · 2016-11-14T19:28:27Z

I don't see a need to do deep copies of model summaries, but I agree I don't like how JavaWrapper is ambiguous about whether it does shallow or deep copies.

I'd say the confusion comes from us having a mix of immutable Java types (like model summaries) and mutable Java types (like Params subclasses). What do you think of these 2 options?

Distinguish mutability within Python wrappers: JavaWrapper is usable for immutable types. JavaParams (or other subtypes, if needed) is usable for mutable types. I.e., __del__ and copy go in JavaParams.
Distinguish mutability within Java only: Use the same wrapper types for both in Python, and Java copy methods can do deep or shallow copies. I.e., in JavaWrapper, implement copy() which copies the Java instance, and implement __del__ to release that instance's handle.

I don't think either option does much for enforcing these semantics. Barring GC issues, I'd pick option 1 since it's simpler. But if option 2 is better for GC issues, then I'd vote for it.

Thoughts?

viirya · 2016-11-15T04:30:57Z

I'd prefer option2 for safety since the model summaries should be an issue for GC. And looks like Java model summaries don't have copy method.

holdenk · 2016-11-15T16:07:02Z

From the Py4J documentation it seems like we could be leaking memory with the first option, although perhaps not a lot of memory, but if it was being used in an iterative Python algorithm for training many models it could start to have some impact. I'd be in favor of option 2, but that could be done as a follow up issue if the required copy methods aren't generally available.

viirya · 2016-11-21T15:32:59Z

@jkbradley @holdenk @techaddict Do we want to implement copy() in JavaWrapper in this PR too? Or separate it to follow ones with the required copy methods on JVM side?

jkbradley · 2016-11-28T20:08:57Z

Sorry for the slow response on this. Given the time pressure for 2.1, let's go with option 1 for now with a follow-up task to implement option 2. It would be great to include this fix in 2.1.

@techaddict will you have time to update your PR quickly? Thank you!

holdenk · 2016-11-29T18:58:49Z

I agree, for a follow up (so we don't lose track of it) - I've created SPARK-18630 but option 1 for now is a strict improvement over the current situation. Thanks for all of your work on this @techaddict

SparkQA · 2016-12-01T14:14:54Z

Test build #69476 has finished for PR 15843 at commit 37e83e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

techaddict · 2016-12-01T14:27:48Z

@jkbradley @holdenk @viirya PR updated

viirya · 2016-12-01T14:37:23Z

LGTM

jkbradley · 2016-12-01T21:22:22Z

LGTM too
Thanks a lot!
Merging with master, branch-2.1, branch-2.0

Has anyone heard of complaints of this in current use cases of earlier branches? If not, I won't backport it further than 2.0.

## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes #15843 from techaddict/SPARK-18274. (cherry picked from commit 78bb7f8) Signed-off-by: Joseph K. Bradley <[email protected]>

## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes apache#15843 from techaddict/SPARK-18274.

techaddict added 2 commits November 10, 2016 21:46

[SPARK-18274] Memory leak in PySpark StringIndexer

a493c19

change

f25b099

techaddict changed the title ~~[SPARK-18274] Memory leak in PySpark StringIndexer~~ [SPARK-18274][ML][PYSPARK] Memory leak in PySpark StringIndexer Nov 11, 2016

Fixing copy bug (#1)

3d858a2

* moved copy from JavaModel to JavaParams. mv del from JavaModel to JavaWrapper * added test which fails before this fix

techaddict added 2 commits November 11, 2016 13:34

Merge branch 'master' into SPARK-18274

dc5aee3

techaddict commented Nov 11, 2016

View reviewed changes

viirya reviewed Nov 11, 2016

View reviewed changes

nit: doc fix

01a80b9

viirya reviewed Nov 11, 2016

View reviewed changes

doc

a76a1fb

Merge branch 'master' into SPARK-18274

d7848c8

techaddict changed the title ~~[SPARK-18274][ML][PYSPARK] Memory leak in PySpark StringIndexer~~ [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper Nov 11, 2016

techaddict added 2 commits December 1, 2016 18:50

Merge branch 'master' into SPARK-18274

ddad40b

move del to javaparams

37e83e8

asfgit closed this in 78bb7f8 Dec 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

techaddict commented Nov 10, 2016 •

edited

Loading

SparkQA commented Nov 10, 2016

SparkQA commented Nov 10, 2016

techaddict commented Nov 11, 2016

jkbradley commented Nov 11, 2016

techaddict commented Nov 11, 2016

jkbradley commented Nov 11, 2016

techaddict commented Nov 11, 2016

techaddict Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

viirya Nov 11, 2016

viirya Nov 11, 2016

viirya commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

holdenk commented Nov 11, 2016

techaddict commented Nov 11, 2016

holdenk commented Nov 11, 2016

holdenk commented Nov 11, 2016

jkbradley commented Nov 11, 2016

jkbradley commented Nov 11, 2016

viirya commented Nov 12, 2016

holdenk commented Nov 12, 2016

viirya commented Nov 13, 2016 •

edited

Loading

jkbradley commented Nov 14, 2016

viirya commented Nov 15, 2016

holdenk commented Nov 15, 2016

viirya commented Nov 21, 2016

jkbradley commented Nov 28, 2016

holdenk commented Nov 29, 2016

SparkQA commented Dec 1, 2016

techaddict commented Dec 1, 2016

viirya commented Dec 1, 2016

jkbradley commented Dec 1, 2016

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

Conversation

techaddict commented Nov 10, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 10, 2016

SparkQA commented Nov 10, 2016

techaddict commented Nov 11, 2016

jkbradley commented Nov 11, 2016

techaddict commented Nov 11, 2016

jkbradley commented Nov 11, 2016

techaddict commented Nov 11, 2016

techaddict Nov 11, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

viirya Nov 11, 2016

Choose a reason for hiding this comment

viirya Nov 11, 2016

Choose a reason for hiding this comment

viirya commented Nov 11, 2016

SparkQA commented Nov 11, 2016

SparkQA commented Nov 11, 2016

holdenk commented Nov 11, 2016

techaddict commented Nov 11, 2016

holdenk commented Nov 11, 2016

holdenk commented Nov 11, 2016

jkbradley commented Nov 11, 2016

jkbradley commented Nov 11, 2016

viirya commented Nov 12, 2016

holdenk commented Nov 12, 2016

viirya commented Nov 13, 2016 • edited Loading

jkbradley commented Nov 14, 2016

viirya commented Nov 15, 2016

holdenk commented Nov 15, 2016

viirya commented Nov 21, 2016

jkbradley commented Nov 28, 2016

holdenk commented Nov 29, 2016

SparkQA commented Dec 1, 2016

techaddict commented Dec 1, 2016

viirya commented Dec 1, 2016

jkbradley commented Dec 1, 2016

techaddict commented Nov 10, 2016 •

edited

Loading

viirya commented Nov 13, 2016 •

edited

Loading