Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params #20410

Closed
wants to merge 1 commit into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-22799 and SPARK-22797 are causing valid Python test failures. The reason is that Python is setting the default params with set. So an outputCol value is always set by the Python API for Bucketizer.

How was this patch tested?

passing failing UTs

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86711 has finished for PR 20410 at commit 13e4731.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91 mgaido91 changed the title [SPARK-23234][ML][PYSPARK] Remove default outputCol in python [SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params Jan 26, 2018
@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86717 has finished for PR 20410 at commit 7abf62f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 26, 2018

CC @zhengruifeng @MLnick

@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

We should just revert SPARK-22797 for now to unblock others.

SPARK-22799 itself is not the cause per se (it passed tests) but after it was merged SPARK-22797 causes the failure.

@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

I think this is somewhat related to #15113

cc @BryanCutler

@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

I reverted #19892 in master (f5911d4) / branch-2.3 (a8a3e9b), so that other test runs can be unblocked.

@mgaido91
Copy link
Contributor Author

I think that the problem is not SPARK-22797. The problem is that before this PR, the Python API considers as Defined but not Set all the parameters with a default value, while the Scala/Java class representing it considers as Set all them.

This has come up in this case, but it can cause other problems in the future and also now, because it creates an inconsistency between the Python API and the representation in the JVM backend.

Thus I do believe that this PR is needed, and it is not only a fix for the test failures. I think this is a first step and a second step would later be to drop all the setDefault in the Python API, in favor of retrieving them from the JVM backend. In this way, we will be sure there is no logical inconsistency between the API and the backend.

Unfortunately, this second part is much bigger and has a large impact. So I think it best would need a design doc or something similar.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the root of the problem from SPARK-22797. I think we should still transfer defaults just to be careful though, just make sure they don't get "set" when transferred. Can you add a test for this?

if param in paramMap:
pair = self._make_java_param_pair(param, paramMap[param])
if param in self._paramMap:
pair = self._make_java_param_pair(param, self._paramMap[param])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent would be more clear if you called

if self.isSet(param):
    pair = self._make_java_param_pair(param, self.getOrDefault[param])

And to be 100% consistent with Scala, it might be a good idea to transfer default values anyway. That way, just in case the python default was somehow changed or different than scala it wouldn't cause an issue that would be really hard to detect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. There is only one problem: you can't transfer the default values to Scala without using set. Indeed, setDefault is protected, so it can't be called by Python.
Moreover, we already have test cases which ensures that the defaults in Python and Scala are the same. And since the user can't change a default, we are on the safe side. As I said before, I think that a good next step would be to read the defaults from Scala in Python and this will make us sure that they will be always consistent.
Thanks for the suggestion, but I don't really agree that using getOrDefault would make the intent more clear: on the opposite, I think it might be confusing. I can use isSet as you suggested, instead, or I can add a comment, what do you think?

@mgaido91
Copy link
Contributor Author

any more comments on this?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 7, 2018

kindly ping @BryanCutler @MLnick @zhengruifeng

@BryanCutler
Copy link
Member

@mgaido91 , this is actually the same fix as in #18982 only that also transfers the default values to Java, which I still think is a good idea because if users define their own models, it shouldn't assume that the default values are the same.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 8, 2018

@BryanCutler yes, I see it only now sorry. I am fine either with your change and this one. Personally, I think that it is not a good idea to transfer defaults, because it means we might have differences between the scala and the python API, but it is a minor thing I think. So if the community agrees that we should go on with your change I can close this. If the community agrees that this is the right thing to do, instead maybe we can go on with this.

Anyway, I think that either this or your PR should go on and be merged, since this is a real problem which is arising from time to time and should be solved.

@BryanCutler
Copy link
Member

Thanks @mgaido91 , #18982 had been sitting for so long, I forgot it was really the same issue as here! As for the defaults, I think the assumption that python and scala have the same value is fine within Spark, but if users are making their own models and wrappers, then it's not safe to assume that. It's not that big of deal though, so I'd like to hear other's opinions and this should get fixed either way.

@holdenk
Copy link
Contributor

holdenk commented Mar 16, 2018

@BryanCutler I think it would be an OK assumption to make that the default values should be the same between Python and Scala provided we maybe added it to the params docs? What does @HyukjinKwon / @viirya think?

@HyukjinKwon
Copy link
Member

So .. #18982 takes over this and I think SPARK-23234 is a duplicate of SPARK-21685?

@mgaido91
Copy link
Contributor Author

yes, if we go on with that, I can close this @HyukjinKwon , thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants