-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588
Conversation
…sValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
replace TypeTag with explicit datatype
…rossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
Test build #24103 has finished for PR 3588 at commit
|
transformSchema(dataset.schema, paramMap, logging = true) | ||
stages.foldLeft(dataset)((cur, transformer) => transformer.transform(cur, paramMap)) | ||
// Precedence of ParamMaps: paramMap > this.paramMap > fittingParamMap | ||
val map = (fittingParamMap ++ this.paramMap) ++ fittingParamMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get the logic here. this.paramMap
contains only parameters to the Pipeline
instance and the input paramMap
is not included. fittingParamMap
is for record purpose. All relevant parameters should be inherited from the parent algorithm to model.paramMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think this was the wrong way to fix the bug.
The bug was:
The model was training with hashingTF.numFeatures features (10, 100 or 1000), but when PipelineModel.transform() was called, HashingTF used the default numFeatures.
I probably should fix it by changing Pipeline.fit() to merge all relevant parameters from the paramMap passed to fit() into the transformers. (The params are stored in Models but not other Transformers right now.) I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I see the issue now. Could we create a separate PR? This is not a blocker for the release, and it might need some discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think it was just a typo. The comment is correct. It should be:
val map = (fittingParamMap ++ this.paramMap) ++ paramMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, thanks!
…V and Params for spark.ml
@mengxr Thanks for reviewing! Updated based on all comments. |
Test build #24129 has finished for PR 3588 at commit
|
Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <[email protected]> Author: jkbradley <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5) Signed-off-by: Xiangrui Meng <[email protected]>
LGTM. Merged into master and branch-1.2. Thanks a lot!! |
Documentation:
Examples:
Bug fixes:
CC: @mengxr @shivaram @etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.