Implement a `half-way' score method or dispatch for BenchmarkResult #1

RaphaelS1 · 2020-04-24T11:42:14Z

Currently there are two scoring methods in BenchmarkResult:

For some BenchmarkResult object called bmr:

bmr$score() - Returns the aggregated scores for every fold
bmr$aggregate() - Returns the aggregated scores aggregated over each fold

The problem is that currently no mid-point is supported due to how measures in mlr3measures are implemented. e.g. final line of logloss:

-mean(log(p))

The mean is hardcoded into the equation. This is generally a problem and doesn't allow easy support for standard errors or examining residuals for an individual prediction.

Therefore this issue would depend on a restructuring of scores. One suggestion would be as follows, using logloss as example:

Have a classif.logloss class with three methods:

score - Returns -log(p)
aggr - Returns mean(self$score())
se - Returns sd(self$score())/sqrt(task$nrow)

Alternatively, something like a class with one method with options:

self$score(type = "resid") - Returns -log(p)` (or maybe 'response')
self$score(type = "aggr") - Returns mean(self$score())` (this would be default)
self$score(type = "se") - Returns sd(self$score())/sqrt(task$nrow)`

Anyway this is outside remit of this package but would be required for Wilcox pairwise tests and other comparisons that look at all residuals. And a decision that probably has to be made by @mllg or @berndbischl

The text was updated successfully, but these errors were encountered:

fkiraly · 2020-04-26T20:33:55Z

First a comment about your suggestion: I think it makes a lot of sense, since the individual predictions, loss evalutes, and/or residuals are needed for some post-hoc analyses. Even if this is not directly called by the user, there should be some way to get these via the measure rather than manually.

There are, though, two issues I think we should discuss:

the aggregation funtions are hard-coded as methods.
What if you want to compute the median absolute error?
That would require a change to the absolute error class, rather than being easily extensible (as per the generic ML toolbox design).
what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate?
I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.

fkiraly · 2020-04-26T20:40:35Z

Second, I think there is an interesting distinction which we thought a little about with mlaut (which is a little like mlr3benchmark for python), see also the paper about it.

The problem is: some meaures do not first compute individual losses/utilities, then aggregate.
For example, classification auroc, concordance index, or F1 cannot be written as an aggregate of individual samples by an aggregation function; and for sensitivity/specificity or RMSE the aggregation function is odd.

Conceptually, you have aggregate measures (which take test predictions and observations as input) and sample-level measures (they take a single prediction and obseration as input). Formally, you can apply a compositor (the aggregation mode) to the latter to create one of the former.

One question then is: since this is mathematically different, should that not be two different kinds of object?

RaphaelS1 · 2020-04-27T09:25:57Z

what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate?
I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.

Each measure contains an aggregator field where users can supply their own aggregator (https://mlr3.mlr-org.com/reference/Measure.html). However this is not an aggregation on the level of individual predictions but aggregation across folds.

mllg · 2020-04-27T10:04:01Z

It would be relatively easy to allow measures to return (numeric) vectors and introduce a second aggregation function operating on lists of such vectors. Would that help?

RaphaelS1 · 2020-04-27T16:51:50Z

I assume its easy to implement this as a possible return type but not to actually go back and change all implemented measures from automatically returning aggregated scores?

RaphaelS1 mentioned this issue Apr 24, 2020

Add Wilcoxon Signed-Rank test #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a `half-way' score method or dispatch for BenchmarkResult #1

Implement a `half-way' score method or dispatch for BenchmarkResult #1

RaphaelS1 commented Apr 24, 2020

fkiraly commented Apr 26, 2020

fkiraly commented Apr 26, 2020 •

edited

Loading

RaphaelS1 commented Apr 27, 2020

mllg commented Apr 27, 2020

RaphaelS1 commented Apr 27, 2020

Implement a `half-way' score method or dispatch for BenchmarkResult #1

Implement a `half-way' score method or dispatch for BenchmarkResult #1

Comments

RaphaelS1 commented Apr 24, 2020

fkiraly commented Apr 26, 2020

fkiraly commented Apr 26, 2020 • edited Loading

RaphaelS1 commented Apr 27, 2020

mllg commented Apr 27, 2020

RaphaelS1 commented Apr 27, 2020

fkiraly commented Apr 26, 2020 •

edited

Loading