Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7407][MLLIB] use uid + name to identify parameters #6019

Closed
wants to merge 22 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented May 8, 2015

A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved parent as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

@jkbradley

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32277 has started for PR 6019 at commit a4794dd.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32277 has finished for PR 6019 at commit a4794dd.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32277/
Test FAILed.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32285 has started for PR 6019 at commit aa4a611.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32285 has finished for PR 6019 at commit aa4a611.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class Pipeline(override val uid: String) extends Estimator[PipelineModel]
    • final class DecisionTreeClassifier(override val uid: String)
    • final class GBTClassifier(override val uid: String)
    • class LogisticRegression(override val uid: String)
    • final class RandomForestClassifier(override val uid: String)
    • class BinaryClassificationEvaluator(override val uid: String)
    • final class Binarizer(override val uid: String)
    • class ElementwiseProduct(override val uid: String)
    • class HashingTF(override val uid: String) extends UnaryTransformer[Iterable[_], Vector, HashingTF]
    • final class IDF(override val uid: String) extends Estimator[IDFModel] with IDFBase
    • class Normalizer(override val uid: String) extends UnaryTransformer[Vector, Vector, Normalizer]
    • class OneHotEncoder(override val uid: String)
    • class PolynomialExpansion(override val uid: String)
    • class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
    • class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel]
    • class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[String], Tokenizer]
    • class RegexTokenizer(override val uid: String)
    • class VectorAssembler(override val uid: String)
    • class VectorIndexer(override val uid: String) extends Estimator[VectorIndexerModel]
    • final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase
    • class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean)
    • class DoubleParam(parent: String, name: String, doc: String, isValid: Double => Boolean)
    • class IntParam(parent: String, name: String, doc: String, isValid: Int => Boolean)
    • class FloatParam(parent: String, name: String, doc: String, isValid: Float => Boolean)
    • class LongParam(parent: String, name: String, doc: String, isValid: Long => Boolean)
    • class BooleanParam(parent: String, name: String, doc: String) // No need for isValid
    • class ALS(override val uid: String) extends Estimator[ALSModel] with ALSParams
    • final class DecisionTreeRegressor(override val uid: String)
    • final class GBTRegressor(override val uid: String)
    • class LinearRegression(override val uid: String)
    • final class RandomForestRegressor(override val uid: String)
    • class CrossValidator(override val uid: String) extends Estimator[CrossValidatorModel]
    • trait Identifiable

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32285/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32306 has started for PR 6019 at commit 629d402.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32306 has finished for PR 6019 at commit 629d402.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Pipeline(override val uid: String) extends Estimator[PipelineModel]
    • final class DecisionTreeClassifier(override val uid: String)
    • final class GBTClassifier(override val uid: String)
    • class LogisticRegression(override val uid: String)
    • final class RandomForestClassifier(override val uid: String)
    • class BinaryClassificationEvaluator(override val uid: String)
    • final class Binarizer(override val uid: String)
    • class ElementwiseProduct(override val uid: String)
    • class HashingTF(override val uid: String) extends UnaryTransformer[Iterable[_], Vector, HashingTF]
    • final class IDF(override val uid: String) extends Estimator[IDFModel] with IDFBase
    • class Normalizer(override val uid: String) extends UnaryTransformer[Vector, Vector, Normalizer]
    • class OneHotEncoder(override val uid: String)
    • class PolynomialExpansion(override val uid: String)
    • class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
    • class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel]
    • class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[String], Tokenizer]
    • class RegexTokenizer(override val uid: String)
    • class VectorAssembler(override val uid: String)
    • class VectorIndexer(override val uid: String) extends Estimator[VectorIndexerModel]
    • final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase
    • class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean)
    • class DoubleParam(parent: String, name: String, doc: String, isValid: Double => Boolean)
    • class IntParam(parent: String, name: String, doc: String, isValid: Int => Boolean)
    • class FloatParam(parent: String, name: String, doc: String, isValid: Float => Boolean)
    • class LongParam(parent: String, name: String, doc: String, isValid: Long => Boolean)
    • class BooleanParam(parent: String, name: String, doc: String) // No need for isValid
    • class ALS(override val uid: String) extends Estimator[ALSModel] with ALSParams
    • final class DecisionTreeRegressor(override val uid: String)
    • final class GBTRegressor(override val uid: String)
    • class LinearRegression(override val uid: String)
    • final class RandomForestRegressor(override val uid: String)
    • class CrossValidator(override val uid: String) extends Estimator[CrossValidatorModel]
    • trait Identifiable

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32306/
Test PASSed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32594 has started for PR 6019 at commit 2569168.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32594 has finished for PR 6019 at commit 2569168.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Pipeline(override val uid: String) extends Estimator[PipelineModel]
    • final class DecisionTreeClassifier(override val uid: String)
    • final class GBTClassifier(override val uid: String)
    • class LogisticRegression(override val uid: String)
    • final class OneVsRest(override val uid: String)
    • final class RandomForestClassifier(override val uid: String)
    • class BinaryClassificationEvaluator(override val uid: String)
    • final class Binarizer(override val uid: String)
    • final class Bucketizer(override val uid: String)
    • class ElementwiseProduct(override val uid: String)
    • class HashingTF(override val uid: String) extends UnaryTransformer[Iterable[_], Vector, HashingTF]
    • final class IDF(override val uid: String) extends Estimator[IDFModel] with IDFBase
    • class Normalizer(override val uid: String) extends UnaryTransformer[Vector, Vector, Normalizer]
    • class OneHotEncoder(override val uid: String)
    • class PolynomialExpansion(override val uid: String)
    • class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
    • class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel]
    • class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[String], Tokenizer]
    • class RegexTokenizer(override val uid: String)
    • class VectorAssembler(override val uid: String)
    • class VectorIndexer(override val uid: String) extends Estimator[VectorIndexerModel]
    • final class Word2Vec(override val uid: String) extends Estimator[Word2VecModel] with Word2VecBase
    • class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean)
    • class DoubleParam(parent: String, name: String, doc: String, isValid: Double => Boolean)
    • class IntParam(parent: String, name: String, doc: String, isValid: Int => Boolean)
    • class FloatParam(parent: String, name: String, doc: String, isValid: Float => Boolean)
    • class LongParam(parent: String, name: String, doc: String, isValid: Long => Boolean)
    • class BooleanParam(parent: String, name: String, doc: String) // No need for isValid
    • class ALS(override val uid: String) extends Estimator[ALSModel] with ALSParams
    • final class DecisionTreeRegressor(override val uid: String)
    • final class GBTRegressor(override val uid: String)
    • class LinearRegression(override val uid: String)
    • final class RandomForestRegressor(override val uid: String)
    • class CrossValidator(override val uid: String) extends Estimator[CrossValidatorModel]
    • trait Identifiable

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32594/
Test PASSed.

@@ -17,6 +17,7 @@

package org.apache.spark.examples.ml

import org.apache.spark.ml.util.Identifiable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organize imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jkbradley
Copy link
Member

@mengxr A bunch of small comments, but other than those, it looks good. At some point, it might be nice to add a test suite helper method which does standard checks to make sure models are handled correctly. E.g.: take (Estimator, Model, paramMap passed to fit()), and check UID, parent, and parameter values copied to the model. I just made a JIRA for that: [https://issues.apache.org/jira/browse/SPARK-7609]

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32681 has started for PR 6019 at commit c4c8120.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32681 has finished for PR 6019 at commit c4c8120.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32681/
Test PASSed.

asfgit pushed a commit that referenced this pull request May 14, 2015
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <[email protected]>

Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable

(cherry picked from commit 1b8625f)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 1b8625f May 14, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 14, 2015

Merged into master and branch-1.4. I will send another PR for SPARK-7609.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.

This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.

jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6019 from mengxr/SPARK-7407 and squashes the following commits:

c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
520f0a2 [Xiangrui Meng] address comments
2569168 [Xiangrui Meng] fix tests
873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
409ea08 [Xiangrui Meng] minor updates
83a163c [Xiangrui Meng] update JavaDeveloperApiExample
5db5325 [Xiangrui Meng] update OneVsRest
7bde7ae [Xiangrui Meng] merge master
697fdf9 [Xiangrui Meng] update Bucketizer
7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
629d402 [Xiangrui Meng] fix LRSuite
154516f [Xiangrui Meng] merge master
aa4a611 [Xiangrui Meng] fix examples/compile
a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
fdbc415 [Xiangrui Meng] all tests passed
c255f17 [Xiangrui Meng] fix tests in ParamsSuite
818e1db [Xiangrui Meng] merge master
e1160cf [Xiangrui Meng] fix tests
fbc39f0 [Xiangrui Meng] pass test:compile
108937e [Xiangrui Meng] pass compile
8726d39 [Xiangrui Meng] use parent uid in Param
eaeed35 [Xiangrui Meng] update Identifiable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants