-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23234][ML][PYSPARK] Remove setting defaults on Java params #20410
Conversation
Test build #86711 has finished for PR 20410 at commit
|
Test build #86717 has finished for PR 20410 at commit
|
We should just revert SPARK-22797 for now to unblock others. SPARK-22799 itself is not the cause per se (it passed tests) but after it was merged SPARK-22797 causes the failure. |
I think this is somewhat related to #15113 cc @BryanCutler |
I think that the problem is not SPARK-22797. The problem is that before this PR, the Python API considers as Defined but not Set all the parameters with a default value, while the Scala/Java class representing it considers as Set all them. This has come up in this case, but it can cause other problems in the future and also now, because it creates an inconsistency between the Python API and the representation in the JVM backend. Thus I do believe that this PR is needed, and it is not only a fix for the test failures. I think this is a first step and a second step would later be to drop all the Unfortunately, this second part is much bigger and has a large impact. So I think it best would need a design doc or something similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is the root of the problem from SPARK-22797. I think we should still transfer defaults just to be careful though, just make sure they don't get "set" when transferred. Can you add a test for this?
if param in paramMap: | ||
pair = self._make_java_param_pair(param, paramMap[param]) | ||
if param in self._paramMap: | ||
pair = self._make_java_param_pair(param, self._paramMap[param]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent would be more clear if you called
if self.isSet(param):
pair = self._make_java_param_pair(param, self.getOrDefault[param])
And to be 100% consistent with Scala, it might be a good idea to transfer default values anyway. That way, just in case the python default was somehow changed or different than scala it wouldn't cause an issue that would be really hard to detect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. There is only one problem: you can't transfer the default values to Scala without using set
. Indeed, setDefault
is protected
, so it can't be called by Python.
Moreover, we already have test cases which ensures that the defaults in Python and Scala are the same. And since the user can't change a default, we are on the safe side. As I said before, I think that a good next step would be to read the defaults from Scala in Python and this will make us sure that they will be always consistent.
Thanks for the suggestion, but I don't really agree that using getOrDefault
would make the intent more clear: on the opposite, I think it might be confusing. I can use isSet
as you suggested, instead, or I can add a comment, what do you think?
any more comments on this? |
kindly ping @BryanCutler @MLnick @zhengruifeng |
@BryanCutler yes, I see it only now sorry. I am fine either with your change and this one. Personally, I think that it is not a good idea to transfer defaults, because it means we might have differences between the scala and the python API, but it is a minor thing I think. So if the community agrees that we should go on with your change I can close this. If the community agrees that this is the right thing to do, instead maybe we can go on with this. Anyway, I think that either this or your PR should go on and be merged, since this is a real problem which is arising from time to time and should be solved. |
Thanks @mgaido91 , #18982 had been sitting for so long, I forgot it was really the same issue as here! As for the defaults, I think the assumption that python and scala have the same value is fine within Spark, but if users are making their own models and wrappers, then it's not safe to assume that. It's not that big of deal though, so I'd like to hear other's opinions and this should get fixed either way. |
@BryanCutler I think it would be an OK assumption to make that the default values should be the same between Python and Scala provided we maybe added it to the params docs? What does @HyukjinKwon / @viirya think? |
So .. #18982 takes over this and I think SPARK-23234 is a duplicate of SPARK-21685? |
yes, if we go on with that, I can close this @HyukjinKwon , thanks. |
What changes were proposed in this pull request?
SPARK-22799 and SPARK-22797 are causing valid Python test failures. The reason is that Python is setting the default params with set. So an outputCol value is always set by the Python API for Bucketizer.
How was this patch tested?
passing failing UTs