Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper #15843

Closed
wants to merge 10 commits into from

Conversation

techaddict
Copy link
Contributor

@techaddict techaddict commented Nov 10, 2016

What changes were proposed in this pull request?

InJavaWrapper 's destructor make Java Gateway dereference object in destructor, using SparkContext._active_spark_context._gateway.detach
Fixing the copying parameter bug, by moving the copy method from JavaModel to JavaParams

How was this patch tested?

import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
  • Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
    After: garbage collection works as the object is dereferenced, and computation completes
  • Mem footprint tested using profiler
  • Added a parameter copy related test which was failing before.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68478 has finished for PR 15843 at commit a493c19.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68482 has finished for PR 15843 at commit f25b099.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@techaddict techaddict changed the title [SPARK-18274] Memory leak in PySpark StringIndexer [SPARK-18274][ML][PYSPARK] Memory leak in PySpark StringIndexer Nov 11, 2016
@techaddict
Copy link
Contributor Author

cc: @jkbradley @davies @holdenk

@jkbradley
Copy link
Member

Thanks a lot for finding & reporting this! The fix should probably go in JavaWrapper, not JavaModel, right?

I tested this manually (in JavaWrapper), and it seems to fix the problematic case with StringIndexer.

@techaddict
Copy link
Contributor Author

@jkbradley yes I did it for JavaWrapper first, but try running tests with it gives https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68478/consoleFull

@jkbradley
Copy link
Member

You're right! It's another bug: copy should be implemented in JavaParams, not JavaModel. I'm sending this PR to fix that: techaddict#1

Can you please check it out and merge it into your PR if it looks OK to you? All pyspark.ml tests ran successfully with it.

* moved copy from JavaModel to JavaParams. mv del from JavaModel to JavaWrapper

* added test which fails before this fix
@techaddict
Copy link
Contributor Author

@jkbradley looks good, merged 👍

causes this error while quitting pyspark:

Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
  File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
    SparkContext._active_spark_context._gateway.detach(self._java_obj)
AttributeError: 'NoneType' object has no attribute '_gateway'
Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
  File "/Users/pichu/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
AttributeError: 'NoneType' object has no attribute '_gateway'
@@ -33,6 +33,10 @@ def __init__(self, java_obj=None):
super(JavaWrapper, self).__init__()
self._java_obj = java_obj

def __del__(self):
if SparkContext._active_spark_context:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking if there is active spark context, got this error after quit() in pyspark

Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
SparkContext._active_spark_context._gateway.detach(self._java_obj)
AttributeError: 'NoneType' object has no attribute '_gateway'
Exception ignored in: <bound method JavaWrapper.__del__ of
StringIndexer_4a75b9e8c92f56703aff>
Traceback (most recent call last):
File "/Users/xx/Project/Spark/python/pyspark/ml/wrapper.py", line
37, in __del__
AttributeError: 'NoneType' object has no attribute '_gateway'

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68513 has finished for PR 15843 at commit 3d858a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68514 has finished for PR 15843 at commit dc5aee3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class ParquetLogRedirector implements Serializable
    • case class OutputSpec(

"""
Creates a copy of this instance with the same uid and some
extra params. This implementation first calls Params.copy and
then make a copy of the companion Java model with extra params.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Java model -> Java pipeline component.

Creates a copy of this instance with the same uid and some
extra params. This implementation first calls Params.copy and
then make a copy of the companion Java pipeline component with
extra params. So both the Python wrapper and the Java model get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There is another "Java model" here too.

@viirya
Copy link
Member

viirya commented Nov 11, 2016

LGTM with minor doc comment.

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68515 has finished for PR 15843 at commit 01a80b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2016

Test build #68516 has finished for PR 15843 at commit a76a1fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Nov 11, 2016

So this change looks good to me, but it seems like it fixes more than just the bug described in the JIRA & PR description with @jkbradley's change integrated (namely the issue with param copy which we have). For people who are looking for what's changed between versions it might make sense to explain the copy related fix the PR description as well since that is what is used in the commit log.

@techaddict
Copy link
Contributor Author

@holdenk updated the description.

@holdenk
Copy link
Contributor

holdenk commented Nov 11, 2016

LGTM thanks for fixing this @techaddict :D :)

@holdenk
Copy link
Contributor

holdenk commented Nov 11, 2016

ping @davies if you have time for final review/merge?

@jkbradley
Copy link
Member

Good point @holdenk --- @techaddict could you also please update the PR title to say "JavaWrapper" instead of "StringIndexer"?

@techaddict techaddict changed the title [SPARK-18274][ML][PYSPARK] Memory leak in PySpark StringIndexer [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper Nov 11, 2016
@jkbradley
Copy link
Member

I'm now wondering if __del___ should be in JavaParams instead of JavaWrapper since JavaWrapper does not override copy. Do yall agree it will be safer if it's moved there?

@viirya
Copy link
Member

viirya commented Nov 12, 2016

@jkbradley Sounds making sense more.

@holdenk
Copy link
Contributor

holdenk commented Nov 12, 2016

I'm not so sure about that, we still would want to cleanup the underlying Java reference object on delete if it isn't needed anymore. I think the question is do we want to support shallow copy of javawrapper objects?

@viirya
Copy link
Member

viirya commented Nov 13, 2016

Oh. I thought JavaWrapper is only used on JavaParams. But there are also others like LogisticRegressionSummary which directly inherits JavaWrapper. Looks like we should still put __del__ in JavaWrapper?

@jkbradley
Copy link
Member

I don't see a need to do deep copies of model summaries, but I agree I don't like how JavaWrapper is ambiguous about whether it does shallow or deep copies.

I'd say the confusion comes from us having a mix of immutable Java types (like model summaries) and mutable Java types (like Params subclasses). What do you think of these 2 options?

  • Distinguish mutability within Python wrappers: JavaWrapper is usable for immutable types. JavaParams (or other subtypes, if needed) is usable for mutable types. I.e., __del__ and copy go in JavaParams.
  • Distinguish mutability within Java only: Use the same wrapper types for both in Python, and Java copy methods can do deep or shallow copies. I.e., in JavaWrapper, implement copy() which copies the Java instance, and implement __del__ to release that instance's handle.

I don't think either option does much for enforcing these semantics. Barring GC issues, I'd pick option 1 since it's simpler. But if option 2 is better for GC issues, then I'd vote for it.

Thoughts?

@viirya
Copy link
Member

viirya commented Nov 15, 2016

I'd prefer option2 for safety since the model summaries should be an issue for GC. And looks like Java model summaries don't have copy method.

@holdenk
Copy link
Contributor

holdenk commented Nov 15, 2016

From the Py4J documentation it seems like we could be leaking memory with the first option, although perhaps not a lot of memory, but if it was being used in an iterative Python algorithm for training many models it could start to have some impact. I'd be in favor of option 2, but that could be done as a follow up issue if the required copy methods aren't generally available.

@viirya
Copy link
Member

viirya commented Nov 21, 2016

@jkbradley @holdenk @techaddict Do we want to implement copy() in JavaWrapper in this PR too? Or separate it to follow ones with the required copy methods on JVM side?

@jkbradley
Copy link
Member

Sorry for the slow response on this. Given the time pressure for 2.1, let's go with option 1 for now with a follow-up task to implement option 2. It would be great to include this fix in 2.1.

@techaddict will you have time to update your PR quickly? Thank you!

@holdenk
Copy link
Contributor

holdenk commented Nov 29, 2016

I agree, for a follow up (so we don't lose track of it) - I've created SPARK-18630 but option 1 for now is a strict improvement over the current situation. Thanks for all of your work on this @techaddict

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69476 has finished for PR 15843 at commit 37e83e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@techaddict
Copy link
Contributor Author

@jkbradley @holdenk @viirya PR updated

@viirya
Copy link
Member

viirya commented Dec 1, 2016

LGTM

@jkbradley
Copy link
Member

LGTM too
Thanks a lot!
Merging with master, branch-2.1, branch-2.0

Has anyone heard of complaints of this in current use cases of earlier branches? If not, I won't backport it further than 2.0.

asfgit pushed a commit that referenced this pull request Dec 1, 2016
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>

Closes #15843 from techaddict/SPARK-18274.

(cherry picked from commit 78bb7f8)
Signed-off-by: Joseph K. Bradley <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 1, 2016
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>

Closes #15843 from techaddict/SPARK-18274.

(cherry picked from commit 78bb7f8)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 78bb7f8 Dec 1, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>

Closes apache#15843 from techaddict/SPARK-18274.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>

Closes apache#15843 from techaddict/SPARK-18274.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants