Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] SPARK-5362 (4526, 2372) Gradient and Optimizer to support generic output (instead of label) and data batches #4152

Closed
wants to merge 1 commit into from

Conversation

avulanov
Copy link
Contributor

Currently, Gradient and Optimizer interfaces support data in form of RDD[Double, Vector] which refers to label and features. This limits its application to classification problems. For example, artificial neural network demands Vector as output (instead of label: Double). Moreover, current interface does not support data batches. I propose to replace label: Double with output: Vector. It enables passing generic output instead of label and also passing data and output batches stored in corresponding vectors.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25936 has finished for PR 4152 at commit 43c6fec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@avulanov
Copy link
Contributor Author

Some unrelated (streaming kafka) test failed.

@jkbradley
Copy link
Member

@avulanov @mengxr Do you know how much of a hit we would take if we used a type parameter for the type of data? I'm imagining a Datum type which would be Datum = (Double, Vector) for GLMs and would be Datum = (Vector, Vector) for Neural Networks. This could be nice for unsupervised algorithms using gradient descent, where we might have Datum = Vector. I don't see a reason to limit the generality unless it causes a large drop in performance.

@avulanov
Copy link
Contributor Author

@jkbradley +1 for more generic Datum type. There are two links to Scala type conversion benchmarks in the answer to http://stackoverflow.com/questions/18083696/generic-type-class-instances-performance. As far as I understand, it will not be a big overhead, especially comparing to the code that contains algorithm's implementation.

@srowen
Copy link
Member

srowen commented Jun 16, 2015

This is pretty old now -- not clear it's going in. Should it be closed or is there any likelihood of someone taking it up?

@avulanov
Copy link
Contributor Author

@srowen Are you going to support batches and output vectors in a different way? This pull request implements this feature and is applicable to the current Spark version. I can rebase it if you wish.

@jkbradley
Copy link
Member

We definitely need some improvements and generalizations to optimization, but we also need to discuss the design a bit since there are several generalizations on the table. I'm sorry I haven't had the time to come through on this. I'm afraid it won't happen for 1.5, but I hope it will be possible for 1.6.

If you do close this, please keep the branch so we can re-open it at a later date.

@avulanov avulanov closed this Jul 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants