Normalization for Game #242

fastier-li · 2017-03-04T00:59:22Z

Implement normalization in GAME. Normalization is already implemented at the algorithmic level, but GAME was missing the configuration of the normalization contexts that are needed by the algorithm. This PR implements that.

joshvfleming · 2017-03-04T01:10:13Z

photon-lib/src/main/scala/com/linkedin/photon/ml/util/Implicits.scala

+   * @tparam V The type of the values in the Map
+   */
+  implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) {
+    def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f }


More idiomatic:

o.flatMap(_.get(key)).getOrElse(f)

Also, why not just use this form at the call site instead of the wrapper function?

Yes, I'll probably use it in the implementation of extractOrElse. The idea is to make the code simpler to read.

On that note, it might be simpler to read the above than extractOrElse and go hunting for the implicit - how often is it used?

I have 7 usages so far. But note that this is not an invisible implicit that is hard to track down. The source code explicitly mentions extractOrElse. Here is an example:

contextBroadcasts.extractOrElse(featureShardId)(defaultNormalizationContext),

Let's keep this implicit for now.

joshvfleming · 2017-03-04T01:20:46Z

photon-client/src/main/scala/com/linkedin/photon/ml/estimators/GameEstimator.scala

@@ -133,11 +122,13 @@ class GameEstimator(val params: GameParams, val sparkContext: SparkContext, val
      }
    }

-    gameModelsMap
+    GameModelsMap


This looks like a typo -- shouldn't have pascal case for an object reference.

I'll fix. Thanks!

ashelkovnykov

Some of my comments might be obsolete - I went through the commits in order and the later ones override the earlier ones.

I don't approve the changes yet, there's a lot of changes to the Driver and GameEstimator I'd like to check locally first.

ashelkovnykov · 2017-03-06T20:04:21Z

photon-api/src/main/scala/com/linkedin/photon/ml/function/glm/DistributedGLMLossFunction.scala

@@ -166,4 +166,11 @@ object DistributedGLMLossFunction {
      case _ => new DistributedGLMLossFunction(singleLossFunction, sparkContext, treeAggregateDepth)
    }
  }
+
+  def apply(


Leaving a comment at @fastier-li 's behest:

We don't need both apply and create functions, let's drop create.

Yes, I'll finish that cleanup by removing the creates. In general, we should use apply for factory methods as much as possible, as it is more concise, but there are limitations (scala 2.10 doesn't want 2 apply that define the same default values, for example).

ashelkovnykov · 2017-03-06T22:21:21Z

photon-api/src/integTest/scala/com/linkedin/photon/ml/normalization/NormalizationTest.scala

 import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel}
 import com.linkedin.photon.ml.test.SparkTestUtils

+import com.linkedin.photon.ml.stat.BasicStatisticalSummary


Don't forget to format the import statements

Ran Optimize Imports on whole photon-ml in IDEA.

ashelkovnykov · 2017-03-06T22:23:47Z

photon-client/src/main/scala/com/linkedin/photon/ml/Types.scala

+  type FeatureShardId = String
+  type CoordinateId = String
+  type IndexMapLoaders = Map[FeatureShardId, IndexMapLoader]
+}


This is a good idea. Might be better class names out there, something like Aliases,AliasTypes, TypeAlias, etc.

ashelkovnykov · 2017-03-06T22:24:49Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/Driver.scala


 /**
- * The driver class, which provides the main entry point to GAME model training.
+ * The Driver class, which drives the training of Game model.
+ * Note: there is a separate Driver to drive the scoring of Game models.


We should use the @note annotation for notes, it stands out if you're using an IDE.

Done. Good idea.

In "header comments" I fixed to @note in the whole code base, but in one liner comments I kept // NOTE uniformly.

Sounds good

ashelkovnykov · 2017-03-06T23:56:03Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/Driver.scala


-    // Write the best model to HDFS
-    bestModel match {
+    if (params.modelOutputMode != ModelOutputMode.NONE) {


Should we perform model selection for NONE output mode? I could see it going either way - since the NONE output mode is used mostly as a debugging tool.

I'm creating a TODO to revisit that later.

ashelkovnykov · 2017-03-07T19:09:10Z

photon-lib/src/main/scala/com/linkedin/photon/ml/util/Implicits.scala

+   * @tparam V The type of the values in the Map
+   */
+  implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) {
+    def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f }


On that note, it might be simpler to read the above than extractOrElse and go hunting for the implicit - how often is it used?

ashelkovnykov · 2017-03-07T19:09:51Z

photon-lib/src/test/scala/com/linkedin/photon/ml/util/ClassUtilsTest.scala

@@ -29,6 +29,7 @@ class ClassUtilsTest {

  @Test
  def testIsAnonClass(): Unit = {
+


Nope. We agreed not to do this for tests.

ashelkovnykov · 2017-03-07T19:16:43Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala


 /**
 * Integration tests for GameEstimator.
+ *
+ * The test data set here is a subset of the Yahoo! music data set available on the internet.


Either add a link or remove the internet part, I'm sure people will think to look there on their own

ashelkovnykov · 2017-03-07T19:18:57Z

photon-api/src/main/scala/com/linkedin/photon/ml/util/DefaultIndexMapLoader.scala

@@ -22,7 +22,7 @@ import org.apache.spark.broadcast.Broadcast
 */
 class DefaultIndexMapLoader(sc: SparkContext, featureNameToIdMap: Map[String, Int]) extends IndexMapLoader {

-  @transient
+  @transient // Ensures _indexMap won't be serialized (for performance, it can be big)


Don't forget to remove this (based on conversation in PR #243 )

ashelkovnykov · 2017-03-07T19:20:38Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

-    val logger = new PhotonLogger(s"${params.outputDir}/log", sc)
-    val estimator = new GameEstimator(sc, params, logger)
+    val (estimator, logger) = createEstimator(params, "simpleTest")
+    val (nSamples, nDimensions) = (10, 3)


I'd prefer these two each on their own line - I don't like defining multiple vals on one line in Scala unless it's through an unapply method

fastier-li · 2017-03-13T22:18:48Z

Unit and integration tests pass.

ashelkovnykov

Partial review, more to come

ashelkovnykov · 2017-03-14T05:16:12Z

photon-api/src/integTest/scala/com/linkedin/photon/ml/normalization/NormalizationTest.scala

 import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel}
 import com.linkedin.photon.ml.test.SparkTestUtils
-
-import com.linkedin.photon.ml.stat.BasicStatisticalSummary
+import com.linkedin.photon.ml.{ModelTraining, TaskType}


Shouldn't this be at the very top of the com.linkedin.photon.ml block?

ASCII code 7B { comes after all the letters of the alphabet :-/ I just ran Optimize Imports in IDEA on the whole project, with the Spark CodeStyle - let's leave it like that.

ashelkovnykov · 2017-03-14T05:20:21Z

photon-api/src/main/scala/com/linkedin/photon/ml/util/DefaultIndexMap.scala

-  }
+
+  /**
+   * Factory to build a default feature index map from a feature names.


Typo, extra article a

ashelkovnykov · 2017-03-14T05:23:35Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/cli/game/training/DriverTest.scala

+   * Intercepts are optional in GameEstimator, but GameDriver will setup an intercept by default if
+   * none is specified in GameParams.featureShardIdToInterceptMap.
+   * This happens in GameDriver.prepareFeatureMapsDefault, and there only.
+   */
  @Test
  def testFixedEffectsWithIntercept(): Unit = sparkTest("testFixedEffectsWithIntercept", useKryo = true) {



I'm gonna delete all your whitespace in my next commit, just you watch

Sigh - I believe it is good typographic practice to make titles stand out one way or another, and function signatures are equivalent to titles :-)

ashelkovnykov · 2017-03-14T05:25:06Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/cli/game/training/DriverTest.scala

-          Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++ Map("output-dir" -> outputDir)))
+    runDriver(CommonTestUtils.argArray(randomEffectToyRunArgs() ++
+      Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++
+      Map("output-dir" -> outputDir)))


Lines 256 - 260 should be indented once more

ashelkovnykov · 2017-03-14T05:26:44Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/data/GameConvertersTest.scala

@@ -14,10 +14,10 @@
 */
 package com.linkedin.photon.ml.data

+import org.apache.spark.sql.types._


Shouldn't this be after the following import statement?

t is before { in ASCII code...

I was assuming that Scala would keep them in front, since that's a shorthand for multiple imports from one package, and that package comes first, but I just double-checked on my end and it looks like it doesn't.

Disregard

ashelkovnykov · 2017-03-14T05:42:57Z

photon-lib/src/main/scala/com/linkedin/photon/ml/data/LabeledPoint.scala


 /**
- * Class that represents a labeled data point used for supervised learning.
+ * Class that represents a labeled data point used for supervised learning in Game. It has a couple fields more than


GAME should be fully capitalized when used in the comments

Enforced across code base using IDEA.

ashelkovnykov · 2017-03-14T05:45:13Z

photon-lib/src/main/scala/com/linkedin/photon/ml/data/LabeledPoint.scala

+   *
+   * @return A machine-parsable, space separated string
+   */
+  def toRawString: String = s"$label ${features.toDenseVector.toArray.mkString(", ")}"


Look into making this be the definition of toString, and adding the Summarizable trait and changing the current definition of toString to be the definition of toSummaryString. This would be in line with what other classes do, and avoid a third way of converting objects to strings in Photon-ML

ashelkovnykov · 2017-03-14T05:46:01Z

photon-lib/src/main/scala/com/linkedin/photon/ml/data/LabeledPoint.scala

   * @return
   */
+  def apply(
+      label: Double,
+      features: org.apache.spark.mllib.linalg.Vector,


Use the SparkVector type from Types

Good idea. Done.

ashelkovnykov · 2017-03-14T05:49:22Z

photon-client/src/main/scala/com/linkedin/photon/ml/estimators/GameEstimator.scala

@@ -55,7 +55,7 @@ class GameEstimator(val sc: SparkContext, val params: GameParams, implicit val l
  import GameEstimator._

  // 2 types that makes the code more readable
-  // TODO: Those look like they should be in file Types?
+  // TODO: Should they be in file Types?


Are they used anywhere outside of GameEstimator?

Nope - which is why I put them there. Also, if we put all types under the sun in Types, the code can become less readable in some places :-/

ashelkovnykov · 2017-03-14T05:49:47Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/Driver.scala

-      bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit =
+    featureShardIdToFeatureMapLoader: Map[String, IndexMapLoader],
+    models: Seq[(GAMEModel, Option[EvaluationResults], String)],
+    bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit =


This was correct before (this comment applies for all indentation in this file)

Is this one settable in Scala code style in IDEA? In any case, I fixed the file. Thanks!

ashelkovnykov · 2017-03-15T17:20:03Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/scoring/Driver.scala

@@ -79,10 +79,10 @@ class Driver(val params: Params, val sparkContext: SparkContext, val logger: Log
  /**
   * Log some statistics of the GAME data set for debugging purpose.
   *
-   * @param gameDataSet The GAME data set
+   * @param GAMEDataSet The GAME data set


Looks like the param names were accidentally edited when you were doing "Game" -> "GAME" in comments

ashelkovnykov · 2017-03-15T17:48:57Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/Driver.scala


 import org.apache.hadoop.fs.Path
 import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg.Vector


Can be replaced by class from Types

ashelkovnykov · 2017-03-15T18:13:20Z

photon-client/src/main/scala/com/linkedin/photon/ml/cli/game/training/Driver.scala

+            s"Model summary:\n${model.toSummaryString}\n\n" +
+            s"Evaluation result is : ${eval.head._2}")
+        case _ =>
+          logger.debug("No best model selection because no validation data was provided")


If I'm reading TapOption right, this line will never be called:
The preceding expression will always produce Some(x) or None. If it produces None, this function won't be called at all. If it produces Some(x), x will always match the first case.

Correct. Fixed.

ashelkovnykov · 2017-03-15T18:58:31Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/cli/game/training/DriverTest.scala

-    runDriver(argArray(fixedEffectToyRunArgs() ++
-        Map("output-dir" -> outputDir, "delete-output-dir-if-exists" -> "true")))
+  /**
+   * This test should fail.


Add a quick comment about why it should fail - at first glance I missed it

ashelkovnykov · 2017-03-15T19:03:49Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/cli/game/training/DriverTest.scala

+    assertTrue(S.isDefined)
+    val stats = S.get.toMap
+    assertEquals(stats.size, featureIndexMapLoaders.size)
+    featureIndexMapLoaders.keys.foreach { featureShardId => assertTrue(stats(featureShardId).mean.length > 0) }


Ideally we would check that the summary matches pre-computed "correct" values, though I'm not sure if it makes sense to do here (since Driver is not responsible for computing summaries correctly, just that summaries are computed).

However, this test should check that the summaries are saved to the correct directory

Added. For the calculations of the statistics itself, they are delegated to spark.ml and I am adding fixed unit test for BasicStatisticalSummary (strange name to be refactored later: descriptive statistics are summaries by definition).

ashelkovnykov · 2017-03-15T19:14:12Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

+
+      val params = fixedAndRandomEffectParams
+      val (estimator, _) = createEstimator(params, "prepareFixedAndRandomEffectTrainingDataSet")
+      val trainingDataSet = estimator.prepareTrainingDataSet(gameDataSet)


Any reason for this block of changes? Usually you like to compress code blocks to be more functional, not less

IMHO more readable here in this case.

ashelkovnykov · 2017-03-15T19:17:09Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

+      createEstimator(params, "simpleTest")._1.fit(trainingData, validationData = None, normalizationContexts)
+
+      val model = models.head._1.getModel(coordinateId).head.asInstanceOf[FixedEffectModel].model
+      for (i <- 0 until 3)


3 should be replaced by nDimensions

ashelkovnykov · 2017-03-15T19:18:32Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

+
+      // This example has only a single fixed effect
+      val (coordinateId, featureShardId) = ("global", "features")
+      val labeledPoints: Seq[LabeledPoint] = trivialLabeledPoints()(0)(0)


This isn't particularly important, I'm just curious:

What's the policy on using DataProviders as methods in tests? @joshvfleming

We want to use them - but here I would need two, which is not supported (and this is explained in the comment).

IMO it's fine

ashelkovnykov · 2017-03-15T19:19:07Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

+    assertEquals(model.coefficients.means(0), 0.3215554473500486, 1e-12)
+    assertEquals(model.coefficients.means(1), 0.17904355431985355, 1e-12)
+    assertEquals(model.coefficients.means(2), 0.4122241763914806, 1e-12)
+  }


This tests no normalization - will there be similar tests for the various kinds of normalization?

I believe there is one right after?

There is, but it doesn't go into the same depth as this one - it just checks that there are results, not that they match pre-computed values

I'm adding something. But please note that the model coefficients will not normally reflect normalization, as we undo normalization once we have trained a model.

ashelkovnykov · 2017-03-15T19:19:15Z

photon-client/src/integTest/scala/com/linkedin/photon/ml/estimators/GameEstimatorTest.scala

+      "fixed-effect-data-configurations" -> s"$coordinateId:$featureShardId,1",
+      "fixed-effect-optimization-configurations" -> s"$coordinateId:100,1e-11,0.3,1,LBFGS,l2",
+      "updating-sequence" -> coordinateId,
+      //"normalization-type" -> NormalizationType.NONE.toString, // not required


ashelkovnykov

LGTM

(The latest changes are crushed with the previous ones, so I only checked things that I commented on before)

- statistics are calculated for the training data in the training Driver - normalization contexts are set up according to the statistics - BUT normalization contexts are NOT used yet.

- connected normalization in training Driver - added unit tests - various cleanups and simplifications

- unit tests for normalization - new "unit" test for GameEstimator while testing Game normalization - small improvement in build files - Small cleanups

joshvfleming reviewed Mar 4, 2017

View reviewed changes

fastier-li force-pushed the normalization branch 3 times, most recently from f15ffdc to a0fbdc8 Compare March 7, 2017 00:21

fastier-li self-assigned this Mar 7, 2017

ashelkovnykov reviewed Mar 7, 2017

View reviewed changes

fastier-li force-pushed the normalization branch 9 times, most recently from 9f21061 to 92de64f Compare March 13, 2017 18:39

fastier-li requested a review from li-ashelkov March 13, 2017 18:42

fastier-li force-pushed the normalization branch 4 times, most recently from cbabae0 to ab54364 Compare March 13, 2017 22:07

ashelkovnykov reviewed Mar 14, 2017

View reviewed changes

fastier-li force-pushed the normalization branch 2 times, most recently from d0924f8 to abec2d9 Compare March 15, 2017 17:51

ashelkovnykov reviewed Mar 15, 2017

View reviewed changes

fastier-li force-pushed the normalization branch from abec2d9 to 652ef3f Compare March 15, 2017 21:19

fastier-li mentioned this pull request Mar 16, 2017

Add normalization to HessianDiagonalAggregator #184

Open

fastier-li force-pushed the normalization branch 2 times, most recently from d975fef to 356645c Compare March 16, 2017 18:14

fastier-li force-pushed the normalization branch 2 times, most recently from 0cb8f6b to 10ed2ee Compare March 16, 2017 21:58

ashelkovnykov approved these changes Mar 16, 2017

View reviewed changes

fastier-li added 3 commits March 16, 2017 17:11

Normalization in Game - part 1:

c17c128

- statistics are calculated for the training data in the training Driver - normalization contexts are set up according to the statistics - BUT normalization contexts are NOT used yet.

Normalization in Game - part 2:

c282d24

- connected normalization in training Driver - added unit tests - various cleanups and simplifications

Normalization in Game - part 3:

40e42bf

- unit tests for normalization - new "unit" test for GameEstimator while testing Game normalization - small improvement in build files - Small cleanups

fastier-li force-pushed the normalization branch from 10ed2ee to 40e42bf Compare March 17, 2017 00:15

fastier-li added the enhancement label Mar 17, 2017

li-ashelkov merged commit 2a0c20d into linkedin:master Mar 17, 2017

fastier-li deleted the normalization branch March 20, 2017 18:12

		@@ -29,6 +29,7 @@ class ClassUtilsTest {

		@Test
		def testIsAnonClass(): Unit = {

Normalization for Game #242

Normalization for Game #242

Conversation

fastier-li commented Mar 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fastier-li Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashelkovnykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fastier-li commented Mar 13, 2017

ashelkovnykov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fastier-li Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fastier-li commented Mar 4, 2017 •

edited

Loading

fastier-li Mar 13, 2017 •

edited

Loading

fastier-li Mar 14, 2017 •

edited

Loading