Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization for Game #242

Merged
merged 3 commits into from
Mar 17, 2017
Merged

Conversation

fastier-li
Copy link
Member

@fastier-li fastier-li commented Mar 4, 2017

Implement normalization in GAME. Normalization is already implemented at the algorithmic level, but GAME was missing the configuration of the normalization contexts that are needed by the algorithm. This PR implements that.

* @tparam V The type of the values in the Map
*/
implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) {
def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More idiomatic:

o.flatMap(_.get(key)).getOrElse(f)

Also, why not just use this form at the call site instead of the wrapper function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll probably use it in the implementation of extractOrElse. The idea is to make the code simpler to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, it might be simpler to read the above than extractOrElse and go hunting for the implicit - how often is it used?

Copy link
Member Author

@fastier-li fastier-li Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 7 usages so far. But note that this is not an invisible implicit that is hard to track down. The source code explicitly mentions extractOrElse. Here is an example:

contextBroadcasts.extractOrElse(featureShardId)(defaultNormalizationContext),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this implicit for now.

@@ -133,11 +122,13 @@ class GameEstimator(val params: GameParams, val sparkContext: SparkContext, val
}
}

gameModelsMap
GameModelsMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a typo -- shouldn't have pascal case for an object reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix. Thanks!

@fastier-li fastier-li force-pushed the normalization branch 3 times, most recently from f15ffdc to a0fbdc8 Compare March 7, 2017 00:21
@fastier-li fastier-li self-assigned this Mar 7, 2017
Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of my comments might be obsolete - I went through the commits in order and the later ones override the earlier ones.

I don't approve the changes yet, there's a lot of changes to the Driver and GameEstimator I'd like to check locally first.

@@ -166,4 +166,11 @@ object DistributedGLMLossFunction {
case _ => new DistributedGLMLossFunction(singleLossFunction, sparkContext, treeAggregateDepth)
}
}

def apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment at @fastier-li 's behest:

We don't need both apply and create functions, let's drop create.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll finish that cleanup by removing the creates. In general, we should use apply for factory methods as much as possible, as it is more concise, but there are limitations (scala 2.10 doesn't want 2 apply that define the same default values, for example).

import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel}
import com.linkedin.photon.ml.test.SparkTestUtils

import com.linkedin.photon.ml.stat.BasicStatisticalSummary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to format the import statements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran Optimize Imports on whole photon-ml in IDEA.

type FeatureShardId = String
type CoordinateId = String
type IndexMapLoaders = Map[FeatureShardId, IndexMapLoader]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Might be better class names out there, something like Aliases,AliasTypes, TypeAlias, etc.


/**
* The driver class, which provides the main entry point to GAME model training.
* The Driver class, which drives the training of Game model.
* Note: there is a separate Driver to drive the scoring of Game models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the @note annotation for notes, it stands out if you're using an IDE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "header comments" I fixed to @note in the whole code base, but in one liner comments I kept // NOTE uniformly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good


// Write the best model to HDFS
bestModel match {
if (params.modelOutputMode != ModelOutputMode.NONE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perform model selection for NONE output mode? I could see it going either way - since the NONE output mode is used mostly as a debugging tool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm creating a TODO to revisit that later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @tparam V The type of the values in the Map
*/
implicit class ExtractOrElse[K, V](o: Option[Map[K, V]]) {
def extractOrElse(key: K)(f: => V): V = { if (o.isDefined) o.get(key) else f }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, it might be simpler to read the above than extractOrElse and go hunting for the implicit - how often is it used?

@@ -29,6 +29,7 @@ class ClassUtilsTest {

@Test
def testIsAnonClass(): Unit = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We agreed not to do this for tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/**
* Integration tests for GameEstimator.
*
* The test data set here is a subset of the Yahoo! music data set available on the internet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either add a link or remove the internet part, I'm sure people will think to look there on their own

@@ -22,7 +22,7 @@ import org.apache.spark.broadcast.Broadcast
*/
class DefaultIndexMapLoader(sc: SparkContext, featureNameToIdMap: Map[String, Int]) extends IndexMapLoader {

@transient
@transient // Ensures _indexMap won't be serialized (for performance, it can be big)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove this (based on conversation in PR #243 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val logger = new PhotonLogger(s"${params.outputDir}/log", sc)
val estimator = new GameEstimator(sc, params, logger)
val (estimator, logger) = createEstimator(params, "simpleTest")
val (nSamples, nDimensions) = (10, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer these two each on their own line - I don't like defining multiple vals on one line in Scala unless it's through an unapply method

@fastier-li fastier-li force-pushed the normalization branch 9 times, most recently from 9f21061 to 92de64f Compare March 13, 2017 18:39
@fastier-li fastier-li requested a review from li-ashelkov March 13, 2017 18:42
@fastier-li fastier-li force-pushed the normalization branch 4 times, most recently from cbabae0 to ab54364 Compare March 13, 2017 22:07
@fastier-li
Copy link
Member Author

Unit and integration tests pass.

Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, more to come

import com.linkedin.photon.ml.supervised.classification.{BinaryClassifier, LogisticRegressionModel}
import com.linkedin.photon.ml.test.SparkTestUtils

import com.linkedin.photon.ml.stat.BasicStatisticalSummary
import com.linkedin.photon.ml.{ModelTraining, TaskType}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be at the very top of the com.linkedin.photon.ml block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII code 7B { comes after all the letters of the alphabet :-/ I just ran Optimize Imports in IDEA on the whole project, with the Spark CodeStyle - let's leave it like that.

}

/**
* Factory to build a default feature index map from a feature names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, extra article a

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* Intercepts are optional in GameEstimator, but GameDriver will setup an intercept by default if
* none is specified in GameParams.featureShardIdToInterceptMap.
* This happens in GameDriver.prepareFeatureMapsDefault, and there only.
*/
@Test
def testFixedEffectsWithIntercept(): Unit = sparkTest("testFixedEffectsWithIntercept", useKryo = true) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna delete all your whitespace in my next commit, just you watch

Copy link
Member Author

@fastier-li fastier-li Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh - I believe it is good typographic practice to make titles stand out one way or another, and function signatures are equivalent to titles :-)

Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++ Map("output-dir" -> outputDir)))
runDriver(CommonTestUtils.argArray(randomEffectToyRunArgs() ++
Map("feature-shard-id-to-intercept-map" -> "shard2:false|shard3:true") ++
Map("output-dir" -> outputDir)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 256 - 260 should be indented once more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -14,10 +14,10 @@
*/
package com.linkedin.photon.ml.data

import org.apache.spark.sql.types._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be after the following import statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t is before { in ASCII code...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that Scala would keep them in front, since that's a shorthand for multiple imports from one package, and that package comes first, but I just double-checked on my end and it looks like it doesn't.

Disregard


/**
* Class that represents a labeled data point used for supervised learning.
* Class that represents a labeled data point used for supervised learning in Game. It has a couple fields more than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAME should be fully capitalized when used in the comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforced across code base using IDEA.

*
* @return A machine-parsable, space separated string
*/
def toRawString: String = s"$label ${features.toDenseVector.toArray.mkString(", ")}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into making this be the definition of toString, and adding the Summarizable trait and changing the current definition of toString to be the definition of toSummaryString. This would be in line with what other classes do, and avoid a third way of converting objects to strings in Photon-ML

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @return
*/
def apply(
label: Double,
features: org.apache.spark.mllib.linalg.Vector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the SparkVector type from Types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

@@ -55,7 +55,7 @@ class GameEstimator(val sc: SparkContext, val params: GameParams, implicit val l
import GameEstimator._

// 2 types that makes the code more readable
// TODO: Those look like they should be in file Types?
// TODO: Should they be in file Types?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they used anywhere outside of GameEstimator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - which is why I put them there. Also, if we put all types under the sun in Types, the code can become less readable in some places :-/

bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit =
featureShardIdToFeatureMapLoader: Map[String, IndexMapLoader],
models: Seq[(GAMEModel, Option[EvaluationResults], String)],
bestModel: Option[(GAMEModel, EvaluationResults, String)]): Unit =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was correct before (this comment applies for all indentation in this file)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one settable in Scala code style in IDEA? In any case, I fixed the file. Thanks!

@fastier-li fastier-li force-pushed the normalization branch 2 times, most recently from d0924f8 to abec2d9 Compare March 15, 2017 17:51
@@ -79,10 +79,10 @@ class Driver(val params: Params, val sparkContext: SparkContext, val logger: Log
/**
* Log some statistics of the GAME data set for debugging purpose.
*
* @param gameDataSet The GAME data set
* @param GAMEDataSet The GAME data set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the param names were accidentally edited when you were doing "Game" -> "GAME" in comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :-(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


import org.apache.hadoop.fs.Path
import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.Vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be replaced by class from Types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

s"Model summary:\n${model.toSummaryString}\n\n" +
s"Evaluation result is : ${eval.head._2}")
case _ =>
logger.debug("No best model selection because no validation data was provided")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading TapOption right, this line will never be called:
The preceding expression will always produce Some(x) or None. If it produces None, this function won't be called at all. If it produces Some(x), x will always match the first case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Fixed.

runDriver(argArray(fixedEffectToyRunArgs() ++
Map("output-dir" -> outputDir, "delete-output-dir-if-exists" -> "true")))
/**
* This test should fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a quick comment about why it should fail - at first glance I missed it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

assertTrue(S.isDefined)
val stats = S.get.toMap
assertEquals(stats.size, featureIndexMapLoaders.size)
featureIndexMapLoaders.keys.foreach { featureShardId => assertTrue(stats(featureShardId).mean.length > 0) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would check that the summary matches pre-computed "correct" values, though I'm not sure if it makes sense to do here (since Driver is not responsible for computing summaries correctly, just that summaries are computed).

However, this test should check that the summaries are saved to the correct directory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. For the calculations of the statistics itself, they are delegated to spark.ml and I am adding fixed unit test for BasicStatisticalSummary (strange name to be refactored later: descriptive statistics are summaries by definition).


val params = fixedAndRandomEffectParams
val (estimator, _) = createEstimator(params, "prepareFixedAndRandomEffectTrainingDataSet")
val trainingDataSet = estimator.prepareTrainingDataSet(gameDataSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this block of changes? Usually you like to compress code blocks to be more functional, not less

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO more readable here in this case.

createEstimator(params, "simpleTest")._1.fit(trainingData, validationData = None, normalizationContexts)

val model = models.head._1.getModel(coordinateId).head.asInstanceOf[FixedEffectModel].model
for (i <- 0 until 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 should be replaced by nDimensions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// This example has only a single fixed effect
val (coordinateId, featureShardId) = ("global", "features")
val labeledPoints: Seq[LabeledPoint] = trivialLabeledPoints()(0)(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't particularly important, I'm just curious:

What's the policy on using DataProviders as methods in tests? @joshvfleming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use them - but here I would need two, which is not supported (and this is explained in the comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's fine

assertEquals(model.coefficients.means(0), 0.3215554473500486, 1e-12)
assertEquals(model.coefficients.means(1), 0.17904355431985355, 1e-12)
assertEquals(model.coefficients.means(2), 0.4122241763914806, 1e-12)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests no normalization - will there be similar tests for the various kinds of normalization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is one right after?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, but it doesn't go into the same depth as this one - it just checks that there are results, not that they match pre-computed values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding something. But please note that the model coefficients will not normally reflect normalization, as we undo normalization once we have trained a model.

"fixed-effect-data-configurations" -> s"$coordinateId:$featureShardId,1",
"fixed-effect-optimization-configurations" -> s"$coordinateId:100,1e-11,0.3,1,LBFGS,l2",
"updating-sequence" -> coordinateId,
//"normalization-type" -> NormalizationType.NONE.toString, // not required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@fastier-li fastier-li force-pushed the normalization branch 2 times, most recently from 0cb8f6b to 10ed2ee Compare March 16, 2017 21:58
Copy link
Contributor

@ashelkovnykov ashelkovnykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(The latest changes are crushed with the previous ones, so I only checked things that I commented on before)

 - statistics are calculated for the training data in the training Driver
 - normalization contexts are set up according to the statistics
 - BUT normalization contexts are NOT used yet.
 - connected normalization in training Driver
 - added unit tests
 - various cleanups and simplifications
 - unit tests for normalization
 - new "unit" test for GameEstimator while testing Game normalization
 - small improvement in build files
 - Small cleanups
@li-ashelkov li-ashelkov merged commit 2a0c20d into linkedin:master Mar 17, 2017
@fastier-li fastier-li deleted the normalization branch March 20, 2017 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants