[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

avulanov · 2015-01-22T01:01:40Z

Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors.

SparkQA · 2015-01-22T01:58:35Z

Test build #25936 has finished for PR 4152 at commit 43c6fec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

avulanov · 2015-01-22T17:38:47Z

Some unrelated (streaming kafka) test failed.

jkbradley · 2015-01-25T02:50:03Z

@avulanov @mengxr Do you know how much of a hit we would take if we used a type parameter for the type of data? I'm imagining a Datum type which would be Datum = (Double, Vector) for GLMs and would be Datum = (Vector, Vector) for Neural Networks. This could be nice for unsupervised algorithms using gradient descent, where we might have Datum = Vector. I don't see a reason to limit the generality unless it causes a large drop in performance.

avulanov · 2015-01-26T18:57:27Z

@jkbradley +1 for more generic Datum type. There are two links to Scala type conversion benchmarks in the answer to http://stackoverflow.com/questions/18083696/generic-type-class-instances-performance. As far as I understand, it will not be a big overhead, especially comparing to the code that contains algorithm's implementation.

srowen · 2015-06-16T20:18:15Z

This is pretty old now -- not clear it's going in. Should it be closed or is there any likelihood of someone taking it up?

avulanov · 2015-06-17T00:37:22Z

@srowen Are you going to support batches and output vectors in a different way? This pull request implements this feature and is applicable to the current Spark version. I can rebase it if you wish.

jkbradley · 2015-07-24T02:16:34Z

We definitely need some improvements and generalizations to optimization, but we also need to discuss the design a bit since there are several generalizations on the table. I'm sorry I haven't had the time to come through on this. I'm afraid it won't happen for 1.5, but I hope it will be possible for 1.6.

If you do close this, please keep the branch so we can re-open it at a later date.

Vector instead of label in gradient and optimizer

43c6fec

avulanov closed this Jul 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

avulanov commented Jan 22, 2015

SparkQA commented Jan 22, 2015

avulanov commented Jan 22, 2015

jkbradley commented Jan 25, 2015

avulanov commented Jan 26, 2015

srowen commented Jun 16, 2015

avulanov commented Jun 17, 2015

jkbradley commented Jul 24, 2015

[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

Conversation

avulanov commented Jan 22, 2015

SparkQA commented Jan 22, 2015

avulanov commented Jan 22, 2015

jkbradley commented Jan 25, 2015

avulanov commented Jan 26, 2015

srowen commented Jun 16, 2015

avulanov commented Jun 17, 2015

jkbradley commented Jul 24, 2015