[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

jkbradley · 2014-12-03T21:27:31Z

Documentation:

Added ml-guide.md, linked from mllib-guide.md
Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:

CrossValidatorExample
SimpleParamsExample
(I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:

PipelineModel: did not use ParamMaps correctly
UnaryTransformer: issues with TypeTag serialization (Thanks to @mengxr for that fix!)

CC: @mengxr @shivaram @etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

…sValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.

replace TypeTag with explicit datatype

…rossValidatorExample to use more training examples so it is less likely to get a 0-size fold.

SparkQA · 2014-12-03T22:54:49Z

Test build #24103 has finished for PR 3588 at commit c38469c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaCrossValidatorExample
- public class JavaSimpleParamsExample

mengxr · 2014-12-04T01:04:09Z

mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala

-    transformSchema(dataset.schema, paramMap, logging = true)
-    stages.foldLeft(dataset)((cur, transformer) => transformer.transform(cur, paramMap))
+    // Precedence of ParamMaps: paramMap > this.paramMap > fittingParamMap
+    val map = (fittingParamMap ++ this.paramMap) ++ fittingParamMap


I don't quite get the logic here. this.paramMap contains only parameters to the Pipeline instance and the input paramMap is not included. fittingParamMap is for record purpose. All relevant parameters should be inherited from the parent algorithm to model.paramMap.

Right, I think this was the wrong way to fix the bug.

The bug was:
The model was training with hashingTF.numFeatures features (10, 100 or 1000), but when PipelineModel.transform() was called, HashingTF used the default numFeatures.

I probably should fix it by changing Pipeline.fit() to merge all relevant parameters from the paramMap passed to fit() into the transformers. (The params are stored in Models but not other Transformers right now.) I'll make that change.

Thanks! I see the issue now. Could we create a separate PR? This is not a blocker for the release, and it might need some discussion.

Oh, I think it was just a typo. The comment is correct. It should be:

val map = (fittingParamMap ++ this.paramMap) ++ paramMap

Whoops, thanks!

…V and Params for spark.ml

jkbradley · 2014-12-04T07:00:39Z

@mengxr Thanks for reviewing! Updated based on all comments.

SparkQA · 2014-12-04T08:26:36Z

Test build #24129 has finished for PR 3588 at commit d393b5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaCrossValidatorExample
- public class JavaSimpleParamsExample

Documentation: * Added ml-guide.md, linked from mllib-guide.md * Updated mllib-guide.md with small section pointing to ml-guide.md Examples: * CrossValidatorExample * SimpleParamsExample * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md) Bug fixes: * PipelineModel: did not use ParamMaps correctly * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!) CC: mengxr shivaram etrain Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete. Author: Joseph K. Bradley <[email protected]> Author: jkbradley <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #3588 from jkbradley/ml-package-docs and squashes the following commits: d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit). updated examples for CV and Params for spark.ml c38469c [Joseph K. Bradley] Updated ml-guide with CV examples 99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params. Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold. ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs 3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype 41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works. (cherry picked from commit 469a6e5) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2014-12-04T09:02:10Z

LGTM. Merged into master and branch-1.2. Thanks a lot!!

jkbradley and others added 5 commits December 2, 2014 14:59

Added examples for spark.ml: SimpleParamsExample + Java version, Cros…

41ad9b1

…sValidatorExample + Java version. CrossValidatorExample not working yet. Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.

replace TypeTag with explicit datatype

3b83ec0

Merge pull request #4 from mengxr/ml-package-docs

ea34dc6

replace TypeTag with explicit datatype

Fixed bug in PipelineModel.transform* with usage of params. Updated C…

99f88c2

…rossValidatorExample to use more training examples so it is less likely to get a 0-size fold.

Updated ml-guide with CV examples

c38469c

jkbradley changed the title ~~[SPARK-4575] [mllib] spark.ml pipelines doc + bug fixes~~ [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes Dec 4, 2014

mengxr reviewed Dec 4, 2014
View reviewed changes

fixed bug in Pipeline (typo from last commit). updated examples for C…

d393b5c

…V and Params for spark.ml

asfgit closed this in 469a6e5 Dec 4, 2014

jkbradley deleted the ml-package-docs branch July 25, 2016 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

jkbradley commented Dec 3, 2014

SparkQA commented Dec 3, 2014

mengxr Dec 4, 2014

jkbradley Dec 4, 2014

mengxr Dec 4, 2014

mengxr Dec 4, 2014

jkbradley Dec 4, 2014

jkbradley commented Dec 4, 2014

SparkQA commented Dec 4, 2014

mengxr commented Dec 4, 2014

[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

[SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes #3588

Conversation

jkbradley commented Dec 3, 2014

SparkQA commented Dec 3, 2014

mengxr Dec 4, 2014

Choose a reason for hiding this comment

jkbradley Dec 4, 2014

Choose a reason for hiding this comment

mengxr Dec 4, 2014

Choose a reason for hiding this comment

mengxr Dec 4, 2014

Choose a reason for hiding this comment

jkbradley Dec 4, 2014

Choose a reason for hiding this comment

jkbradley commented Dec 4, 2014

SparkQA commented Dec 4, 2014

mengxr commented Dec 4, 2014