-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843
Conversation
Test build #68478 has finished for PR 15843 at commit
|
Test build #68482 has finished for PR 15843 at commit
|
Thanks a lot for finding & reporting this! The fix should probably go in JavaWrapper, not JavaModel, right? I tested this manually (in JavaWrapper), and it seems to fix the problematic case with StringIndexer. |
@jkbradley yes I did it for |
You're right! It's another bug: copy should be implemented in JavaParams, not JavaModel. I'm sending this PR to fix that: techaddict#1 Can you please check it out and merge it into your PR if it looks OK to you? All pyspark.ml tests ran successfully with it. |
* moved copy from JavaModel to JavaParams. mv del from JavaModel to JavaWrapper * added test which fails before this fix
@jkbradley looks good, merged 👍 |
causes this error while quitting pyspark: Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ SparkContext._active_spark_context._gateway.detach(self._java_obj) AttributeError: 'NoneType' object has no attribute '_gateway' Exception ignored in: <bound method JavaWrapper.__del__ of StringIndexer_4a75b9e8c92f56703aff> Traceback (most recent call last): File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line 37, in __del__ AttributeError: 'NoneType' object has no attribute '_gateway'
@@ -33,6 +33,10 @@ def __init__(self, java_obj=None): | |||
super(JavaWrapper, self).__init__() | |||
self._java_obj = java_obj | |||
|
|||
def __del__(self): | |||
if SparkContext._active_spark_context: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking if there is active spark context, got this error after quit()
in pyspark
Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
SparkContext._active_spark_context._gateway.detach(self._java_obj)
AttributeError: 'NoneType' object has no attribute '_gateway'
Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
AttributeError: 'NoneType' object has no attribute '_gateway'
Test build #68513 has finished for PR 15843 at commit
|
Test build #68514 has finished for PR 15843 at commit
|
""" | ||
Creates a copy of this instance with the same uid and some | ||
extra params. This implementation first calls Params.copy and | ||
then make a copy of the companion Java model with extra params. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Java model -> Java pipeline component.
Creates a copy of this instance with the same uid and some | ||
extra params. This implementation first calls Params.copy and | ||
then make a copy of the companion Java pipeline component with | ||
extra params. So both the Python wrapper and the Java model get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There is another "Java model" here too.
LGTM with minor doc comment. |
Test build #68515 has finished for PR 15843 at commit
|
Test build #68516 has finished for PR 15843 at commit
|
So this change looks good to me, but it seems like it fixes more than just the bug described in the JIRA & PR description with @jkbradley's change integrated (namely the issue with param copy which we have). For people who are looking for what's changed between versions it might make sense to explain the copy related fix the PR description as well since that is what is used in the commit log. |
@holdenk updated the description. |
LGTM thanks for fixing this @techaddict :D :) |
ping @davies if you have time for final review/merge? |
Good point @holdenk --- @techaddict could you also please update the PR title to say "JavaWrapper" instead of "StringIndexer"? |
I'm now wondering if |
@jkbradley Sounds making sense more. |
I'm not so sure about that, we still would want to cleanup the underlying Java reference object on delete if it isn't needed anymore. I think the question is do we want to support shallow copy of javawrapper objects? |
Oh. I thought JavaWrapper is only used on JavaParams. But there are also others like LogisticRegressionSummary which directly inherits JavaWrapper. Looks like we should still put |
I don't see a need to do deep copies of model summaries, but I agree I don't like how JavaWrapper is ambiguous about whether it does shallow or deep copies. I'd say the confusion comes from us having a mix of immutable Java types (like model summaries) and mutable Java types (like Params subclasses). What do you think of these 2 options?
I don't think either option does much for enforcing these semantics. Barring GC issues, I'd pick option 1 since it's simpler. But if option 2 is better for GC issues, then I'd vote for it. Thoughts? |
I'd prefer option2 for safety since the model summaries should be an issue for GC. And looks like Java model summaries don't have copy method. |
From the Py4J documentation it seems like we could be leaking memory with the first option, although perhaps not a lot of memory, but if it was being used in an iterative Python algorithm for training many models it could start to have some impact. I'd be in favor of option 2, but that could be done as a follow up issue if the required copy methods aren't generally available. |
@jkbradley @holdenk @techaddict Do we want to implement copy() in JavaWrapper in this PR too? Or separate it to follow ones with the required copy methods on JVM side? |
Sorry for the slow response on this. Given the time pressure for 2.1, let's go with option 1 for now with a follow-up task to implement option 2. It would be great to include this fix in 2.1. @techaddict will you have time to update your PR quickly? Thank you! |
I agree, for a follow up (so we don't lose track of it) - I've created SPARK-18630 but option 1 for now is a strict improvement over the current situation. Thanks for all of your work on this @techaddict |
Test build #69476 has finished for PR 15843 at commit
|
@jkbradley @holdenk @viirya PR updated |
LGTM |
LGTM too Has anyone heard of complaints of this in current use cases of earlier branches? If not, I won't backport it further than 2.0. |
## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes #15843 from techaddict/SPARK-18274. (cherry picked from commit 78bb7f8) Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes #15843 from techaddict/SPARK-18274. (cherry picked from commit 78bb7f8) Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes apache#15843 from techaddict/SPARK-18274.
## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <[email protected]> Author: jkbradley <[email protected]> Closes apache#15843 from techaddict/SPARK-18274.
What changes were proposed in this pull request?
In
JavaWrapper
's destructor make Java Gateway dereference object in destructor, usingSparkContext._active_spark_context._gateway.detach
Fixing the copying parameter bug, by moving the
copy
method fromJavaModel
toJavaParams
How was this patch tested?
After: garbage collection works as the object is dereferenced, and computation completes