Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a `half-way' score method or dispatch for BenchmarkResult #1

Open
RaphaelS1 opened this issue Apr 24, 2020 · 5 comments
Open

Comments

@RaphaelS1
Copy link
Contributor

Currently there are two scoring methods in BenchmarkResult:

For some BenchmarkResult object called bmr:

  1. bmr$score() - Returns the aggregated scores for every fold
  2. bmr$aggregate() - Returns the aggregated scores aggregated over each fold

The problem is that currently no mid-point is supported due to how measures in mlr3measures are implemented. e.g. final line of logloss:

-mean(log(p))

The mean is hardcoded into the equation. This is generally a problem and doesn't allow easy support for standard errors or examining residuals for an individual prediction.

Therefore this issue would depend on a restructuring of scores. One suggestion would be as follows, using logloss as example:

Have a classif.logloss class with three methods:

  • score - Returns -log(p)
  • aggr - Returns mean(self$score())
  • se - Returns sd(self$score())/sqrt(task$nrow)

Alternatively, something like a class with one method with options:

  • self$score(type = "resid") - Returns -log(p)` (or maybe 'response')
  • self$score(type = "aggr") - Returns mean(self$score())` (this would be default)
  • self$score(type = "se") - Returns sd(self$score())/sqrt(task$nrow)`

Anyway this is outside remit of this package but would be required for Wilcox pairwise tests and other comparisons that look at all residuals. And a decision that probably has to be made by @mllg or @berndbischl

@fkiraly
Copy link

fkiraly commented Apr 26, 2020

First a comment about your suggestion: I think it makes a lot of sense, since the individual predictions, loss evalutes, and/or residuals are needed for some post-hoc analyses. Even if this is not directly called by the user, there should be some way to get these via the measure rather than manually.

There are, though, two issues I think we should discuss:

  • the aggregation funtions are hard-coded as methods.
    What if you want to compute the median absolute error?
    That would require a change to the absolute error class, rather than being easily extensible (as per the generic ML toolbox design).
  • what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate?
  • I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.

@fkiraly
Copy link

fkiraly commented Apr 26, 2020

Second, I think there is an interesting distinction which we thought a little about with mlaut (which is a little like mlr3benchmark for python), see also the paper about it.

The problem is: some meaures do not first compute individual losses/utilities, then aggregate.
For example, classification auroc, concordance index, or F1 cannot be written as an aggregate of individual samples by an aggregation function; and for sensitivity/specificity or RMSE the aggregation function is odd.

Conceptually, you have aggregate measures (which take test predictions and observations as input) and sample-level measures (they take a single prediction and obseration as input). Formally, you can apply a compositor (the aggregation mode) to the latter to create one of the former.

One question then is: since this is mathematically different, should that not be two different kinds of object?

@RaphaelS1
Copy link
Contributor Author

what would you do if you would like to return two closely related aggregations together, e.g., return RMSE together with a standard error estimate?
I think (but one might disagree) that the user should be able to provide their own function and have it wrapped as an aggregator.

Each measure contains an aggregator field where users can supply their own aggregator (https://mlr3.mlr-org.com/reference/Measure.html). However this is not an aggregation on the level of individual predictions but aggregation across folds.

@mllg
Copy link
Member

mllg commented Apr 27, 2020

It would be relatively easy to allow measures to return (numeric) vectors and introduce a second aggregation function operating on lists of such vectors. Would that help?

@RaphaelS1
Copy link
Contributor Author

I assume its easy to implement this as a possible return type but not to actually go back and change all implemented measures from automatically returning aggregated scores?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants