-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLIB] SPARK-2329 Add multi-label evaluation metrics #1270
Conversation
…eraged by docs, micro and per-class precision and recall averaged by class
… macro measures, bunch of tests
Can one of the admins verify this patch? |
* @return Accuracy. | ||
*/ | ||
lazy val accuracy = predictionAndLabels.map{ case(predictions, labels) => | ||
labels.intersect(predictions).size.toDouble / labels.union(predictions).size}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the intersect is called multiple times in different metrics, how about take it out so it is only calculated once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to extract "labels.intersect(predictions).size" as a lazy val? Will it then be calculated only once? The operation is made with Scala Set (not with RDD). Another option might be to store in RDD all intermediate calculations (including intersection) that are used in six different measures. In this case, I will need to make fold on the six-element tuple, which will look kind of scary, but it will be the most effective way to compute everything at once.
@avulanov Cool Alexander. Are you working on a multi-label classifier? We are expecting a multi-class multi-label classifier. I'm planning to implement the MultiBoost.MH on Spark, not sure if you've already started working on it. |
@BaiGang Thanks! I'm implementing the decomposition of multiclass and multilabel problems to binary classification problems that can be solved with built-in MLLib classifiers. I use one-vs-one and one-vs-all approaches. As far as I understand, MultiBoost.MH is a C++ implementation of Adaboost.MH and the latter uses another kind of problem decomposition in addition to boosting. So, our efforts are complimentary. Lets stay in touch. Btw, I would be glad to benchmark your classifier with the classification tasks that I'm solving. |
@avulanov Thanks Alexander! I just started to implement the base learner. The algorithms described in the MultiBoost document and the paper are straightforward but it will take some efforts to implement optimally on Apache Spark. I will keep you notified when I get the skeletons up. |
Jenkins, test this please. |
QA tests have started for PR 1270. This patch merges cleanly. |
QA results for PR 1270: |
@mengxr I've fixed Scala style |
* Evaluator for multilabel classification. | ||
* @param predictionAndLabels an RDD of (predictions, labels) pairs, both are non-null sets. | ||
*/ | ||
class MultilabelMetrics(predictionAndLabels: RDD[(Set[Double], Set[Double])]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another feasible representation of predictions/labels is mllib.linalg.Vector. It's basically a Vector of +1s and -1s, either dense or sparse. So it will be great if we add another function to do the transformation.
It's up to you. Transforming the data outside this evaluation module is also OK. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDD[(Set[Double], Set[Double])]
may be hard for Java users. We can ask users to input RDD[(Array[Double], Array[Double])]
, requiring that the labels are ordered. It is easier for Java users and faster to compute intersection and other set operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr We need to ensure that they don't contain repeating elements as well. It should be an optional constructor, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Set
and Double
are Scala types. It is very hard for Java users to construct such RDDs. Also, the input labels and output predictions are usually stored as Array[Double]
. Shall we change the input to RDD[(Array[Double], Array[Double])]
and internally convert it to RDD[(Set[Double], Set[Double])]
and cache? We can put a contract that both labels and predictions are unique and ordered within a single instance. We don't need it if we use Set
internally. But later we can switch to Array[Double]
based solution for speed, because those are very small arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr Can we have RDD[(java.util.HashSet[Double], java.util.HashSet[Double])]
as an optional constructor? Internally, we will use scala.collection.JavaConversions.asScalaSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avulanov Let's think what is more natural for the input data to a multi-label classifier and the output from the model it produces. They should match the input type here, so we can chain them easily. If we use either Java or Scala Set
, we are going to have compatibility issues on the other. Also, set stores small objects, which increase GC pressure. These are the reasons I recommend using Array[Double]
.
this is ok to test |
test this please |
QA tests have started for PR 1270 at commit
|
@avulanov Sorry for getting back late! For the implementation, shall we define an aggregator and then compute all necessary information in a single pass, instead of trigger a job for each? For the metric names, I think our reference is Mining Multi-label Data and we should follow the naming there:
|
QA tests have finished for PR 1270 at commit
|
@mengxr Thank you for comments!
|
…rns the list of labels
QA tests have started for PR 1270 at commit
|
QA tests have finished for PR 1270 at commit
|
* Returns Hamming-loss | ||
*/ | ||
lazy val hammingLoss: Double = (predictionAndLabels.map { case (predictions, labels) => | ||
labels.diff(predictions).size + predictions.diff(labels).size}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be faster: labels.size + predictions.size - 2 * labels.intersect(labels).size
private lazy val numDocs: Long = predictionAndLabels.count | ||
|
||
private lazy val numLabels: Long = predictionAndLabels.flatMap { case (_, labels) => | ||
labels}.distinct.count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predictionAndLabels.values.flatMap(l => l).distinct().count()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr Could you elaborate on this?
QA tests have started for PR 1270 at commit
|
QA tests have finished for PR 1270 at commit
|
@avulanov Just want to check with you on the input type. I still feel |
@avulanov We are close to the feature freeze deadline. Do you plan to update the PR? If you are busy, do you mind me taking it over? Thanks! |
Welcome to the bay area! The deadline (soft for MLlib) is this Saturday. But since the only thing that needs to change is the input type, it should be trivial to update. Using |
@mengxr Thanks! I've replaced Set with Array, fixed two functions that didn't pass test (due to union working differently on Arrays) and added a note that Arrays must have unique elements. |
Test build #22624 has started for PR 1270 at commit
|
Test build #22624 has finished for PR 1270 at commit
|
Test FAILed. |
test this please |
Test build #22659 has started for PR 1270 at commit
|
Test build #22659 has finished for PR 1270 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
@mengxr Thank you! |
### What changes were proposed in this pull request? This PR bumps the ADT version to 1.1.0. ### Why are the changes needed? These changes are needed to avoid dependencies on a preview release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally.
* Only test range join and outer join * fix
Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329
Multi-class measures are currently in the following pull request: #1155