-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization for Game #242
Conversation
* @tparam V The type of the values in the Map | ||
*/ | ||
implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) { | ||
def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More idiomatic:
o.flatMap(_.get(key)).getOrElse(f)
Also, why not just use this form at the call site instead of the wrapper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll probably use it in the implementation of extractOrElse
. The idea is to make the code simpler to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, it might be simpler to read the above than extractOrElse
and go hunting for the implicit - how often is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 7 usages so far. But note that this is not an invisible implicit that is hard to track down. The source code explicitly mentions extractOrElse
. Here is an example:
contextBroadcasts.extractOrElse(featureShardId)(defaultNormalizationContext),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this implicit for now.
@@ -133,11 +122,13 @@ class GameEstimator(val params: GameParams, val sparkContext: SparkContext, val | |||
} | |||
} | |||
|
|||
gameModelsMap | |||
GameModelsMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a typo -- shouldn't have pascal case for an object reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix. Thanks!
f15ffdc
to
a0fbdc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of my comments might be obsolete - I went through the commits in order and the later ones override the earlier ones.
I don't approve the changes yet, there's a lot of changes to the Driver and GameEstimator I'd like to check locally first.
@@ -166,4 +166,11 @@ object DistributedGLMLossFunction { | |||
case _ => new DistributedGLMLossFunction(singleLossFunction, sparkContext, treeAggregateDepth) | |||
} | |||
} | |||
|
|||
def apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a comment at @fastier-li 's behest:
We don't need both apply
and create
functions, let's drop create
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll finish that cleanup by removing the creates. In general, we should use apply
for factory methods as much as possible, as it is more concise, but there are limitations (scala 2.10 doesn't want 2 apply
that define the same default values, for example).
import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel} | ||
import com.linkedin.photon.ml.test.SparkTestUtils | ||
|
||
import com.linkedin.photon.ml.stat.BasicStatisticalSummary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to format the import statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran Optimize Imports on whole photon-ml in IDEA.
type FeatureShardId = String | ||
type CoordinateId = String | ||
type IndexMapLoaders = Map[FeatureShardId, IndexMapLoader] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. Might be better class names out there, something like Aliases
,AliasTypes
, TypeAlias
, etc.
|
||
/** | ||
* The driver class, which provides the main entry point to GAME model training. | ||
* The Driver class, which drives the training of Game model. | ||
* Note: there is a separate Driver to drive the scoring of Game models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the @note
annotation for notes, it stands out if you're using an IDE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In "header comments" I fixed to @note
in the whole code base, but in one liner comments I kept // NOTE
uniformly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
|
||
// Write the best model to HDFS | ||
bestModel match { | ||
if (params.modelOutputMode != ModelOutputMode.NONE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we perform model selection for NONE
output mode? I could see it going either way - since the NONE
output mode is used mostly as a debugging tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm creating a TODO to revisit that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @tparam V The type of the values in the Map | ||
*/ | ||
implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) { | ||
def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that note, it might be simpler to read the above than extractOrElse
and go hunting for the implicit - how often is it used?
@@ -29,6 +29,7 @@ class ClassUtilsTest { | |||
|
|||
@Test | |||
def testIsAnonClass(): Unit = { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. We agreed not to do this for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/** | ||
* Integration tests for GameEstimator. | ||
* | ||
* The test data set here is a subset of the Yahoo! music data set available on the internet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either add a link or remove the internet part, I'm sure people will think to look there on their own
@@ -22,7 +22,7 @@ import org.apache.spark.broadcast.Broadcast | |||
*/ | |||
class DefaultIndexMapLoader(sc: SparkContext, featureNameToIdMap: Map[String, Int]) extends IndexMapLoader { | |||
|
|||
@transient | |||
@transient // Ensures _indexMap won't be serialized (for performance, it can be big) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove this (based on conversation in PR #243 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
val logger = new PhotonLogger(s"${params.outputDir}/log", sc) | ||
val estimator = new GameEstimator(sc, params, logger) | ||
val (estimator, logger) = createEstimator(params, "simpleTest") | ||
val (nSamples, nDimensions) = (10, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer these two each on their own line - I don't like defining multiple vals on one line in Scala unless it's through an unapply method
9f21061
to
92de64f
Compare
cbabae0
to
ab54364
Compare
Unit and integration tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, more to come
import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel} | ||
import com.linkedin.photon.ml.test.SparkTestUtils | ||
|
||
import com.linkedin.photon.ml.stat.BasicStatisticalSummary | ||
import com.linkedin.photon.ml.{ModelTraining, TaskType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be at the very top of the com.linkedin.photon.ml
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASCII code 7B {
comes after all the letters of the alphabet :-/ I just ran Optimize Imports in IDEA on the whole project, with the Spark CodeStyle - let's leave it like that.
} | ||
|
||
/** | ||
* Factory to build a default feature index map from a feature names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, extra article a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* Intercepts are optional in GameEstimator, but GameDriver will setup an intercept by default if | ||
* none is specified in GameParams.featureShardIdToInterceptMap. | ||
* This happens in GameDriver.prepareFeatureMapsDefault, and there only. | ||
*/ | ||
@Test | ||
def testFixedEffectsWithIntercept(): Unit = sparkTest("testFixedEffectsWithIntercept", useKryo = true) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna delete all your whitespace in my next commit, just you watch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh - I believe it is good typographic practice to make titles stand out one way or another, and function signatures are equivalent to titles :-)
Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++ Map("output-dir" -> outputDir))) | ||
runDriver(CommonTestUtils.argArray(randomEffectToyRunArgs() ++ | ||
Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++ | ||
Map("output-dir" -> outputDir))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 256 - 260 should be indented once more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -14,10 +14,10 @@ | |||
*/ | |||
package com.linkedin.photon.ml.data | |||
|
|||
import org.apache.spark.sql.types._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be after the following import statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t
is before {
in ASCII code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that Scala would keep them in front, since that's a shorthand for multiple imports from one package, and that package comes first, but I just double-checked on my end and it looks like it doesn't.
Disregard
|
||
/** | ||
* Class that represents a labeled data point used for supervised learning. | ||
* Class that represents a labeled data point used for supervised learning in Game. It has a couple fields more than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GAME should be fully capitalized when used in the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforced across code base using IDEA.
* | ||
* @return A machine-parsable, space separated string | ||
*/ | ||
def toRawString: String = s"$label ${features.toDenseVector.toArray.mkString(", ")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look into making this be the definition of toString
, and adding the Summarizable
trait and changing the current definition of toString
to be the definition of toSummaryString
. This would be in line with what other classes do, and avoid a third way of converting objects to strings in Photon-ML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @return | ||
*/ | ||
def apply( | ||
label: Double, | ||
features: org.apache.spark.mllib.linalg.Vector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the SparkVector
type from Types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
@@ -55,7 +55,7 @@ class GameEstimator(val sc: SparkContext, val params: GameParams, implicit val l | |||
import GameEstimator._ | |||
|
|||
// 2 types that makes the code more readable | |||
// TODO: Those look like they should be in file Types? | |||
// TODO: Should they be in file Types? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they used anywhere outside of GameEstimator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - which is why I put them there. Also, if we put all types under the sun in Types
, the code can become less readable in some places :-/
bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit = | ||
featureShardIdToFeatureMapLoader: Map[String, IndexMapLoader], | ||
models: Seq[(GAMEModel, Option[EvaluationResults], String)], | ||
bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was correct before (this comment applies for all indentation in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one settable in Scala code style in IDEA? In any case, I fixed the file. Thanks!
d0924f8
to
abec2d9
Compare
@@ -79,10 +79,10 @@ class Driver(val params: Params, val sparkContext: SparkContext, val logger: Log | |||
/** | |||
* Log some statistics of the GAME data set for debugging purpose. | |||
* | |||
* @param gameDataSet The GAME data set | |||
* @param GAMEDataSet The GAME data set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the param names were accidentally edited when you were doing "Game" -> "GAME" in comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
import org.apache.hadoop.fs.Path | ||
import org.apache.spark.SparkContext | ||
import org.apache.spark.mllib.linalg.Vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be replaced by class from Types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
s"Model summary:\n${model.toSummaryString}\n\n" + | ||
s"Evaluation result is : ${eval.head._2}") | ||
case _ => | ||
logger.debug("No best model selection because no validation data was provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading TapOption
right, this line will never be called:
The preceding expression will always produce Some(x)
or None
. If it produces None
, this function won't be called at all. If it produces Some(x)
, x
will always match the first case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Fixed.
runDriver(argArray(fixedEffectToyRunArgs() ++ | ||
Map("output-dir" -> outputDir, "delete-output-dir-if-exists" -> "true"))) | ||
/** | ||
* This test should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a quick comment about why it should fail - at first glance I missed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
assertTrue(S.isDefined) | ||
val stats = S.get.toMap | ||
assertEquals(stats.size, featureIndexMapLoaders.size) | ||
featureIndexMapLoaders.keys.foreach { featureShardId => assertTrue(stats(featureShardId).mean.length > 0) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would check that the summary matches pre-computed "correct" values, though I'm not sure if it makes sense to do here (since Driver
is not responsible for computing summaries correctly, just that summaries are computed).
However, this test should check that the summaries are saved to the correct directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. For the calculations of the statistics itself, they are delegated to spark.ml and I am adding fixed unit test for BasicStatisticalSummary
(strange name to be refactored later: descriptive statistics are summaries by definition).
|
||
val params = fixedAndRandomEffectParams | ||
val (estimator, _) = createEstimator(params, "prepareFixedAndRandomEffectTrainingDataSet") | ||
val trainingDataSet = estimator.prepareTrainingDataSet(gameDataSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this block of changes? Usually you like to compress code blocks to be more functional, not less
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO more readable here in this case.
createEstimator(params, "simpleTest")._1.fit(trainingData, validationData = None, normalizationContexts) | ||
|
||
val model = models.head._1.getModel(coordinateId).head.asInstanceOf[FixedEffectModel].model | ||
for (i <- 0 until 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3
should be replaced by nDimensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
// This example has only a single fixed effect | ||
val (coordinateId, featureShardId) = ("global", "features") | ||
val labeledPoints: Seq[LabeledPoint] = trivialLabeledPoints()(0)(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't particularly important, I'm just curious:
What's the policy on using DataProviders
as methods in tests? @joshvfleming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use them - but here I would need two, which is not supported (and this is explained in the comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's fine
assertEquals(model.coefficients.means(0), 0.3215554473500486, 1e-12) | ||
assertEquals(model.coefficients.means(1), 0.17904355431985355, 1e-12) | ||
assertEquals(model.coefficients.means(2), 0.4122241763914806, 1e-12) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests no normalization - will there be similar tests for the various kinds of normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is one right after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, but it doesn't go into the same depth as this one - it just checks that there are results, not that they match pre-computed values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding something. But please note that the model coefficients will not normally reflect normalization, as we undo normalization once we have trained a model.
"fixed-effect-data-configurations" -> s"$coordinateId:$featureShardId,1", | ||
"fixed-effect-optimization-configurations" -> s"$coordinateId:100,1e-11,0.3,1,LBFGS,l2", | ||
"updating-sequence" -> coordinateId, | ||
//"normalization-type" -> NormalizationType.NONE.toString, // not required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
abec2d9
to
652ef3f
Compare
d975fef
to
356645c
Compare
0cb8f6b
to
10ed2ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(The latest changes are crushed with the previous ones, so I only checked things that I commented on before)
- statistics are calculated for the training data in the training Driver - normalization contexts are set up according to the statistics - BUT normalization contexts are NOT used yet.
- connected normalization in training Driver - added unit tests - various cleanups and simplifications
- unit tests for normalization - new "unit" test for GameEstimator while testing Game normalization - small improvement in build files - Small cleanups
10ed2ee
to
40e42bf
Compare
Implement normalization in GAME. Normalization is already implemented at the algorithmic level, but GAME was missing the configuration of the normalization contexts that are needed by the algorithm. This PR implements that.