[ENH] design - dealing with incomplete distributions such as predictive survival function estimates #249
Labels
API design
API design & software architecture
module:probability&simulation
probability distributions and simulators
module:regression
probabilistic regression module
module:survival&time-to-event
module for time-to-event prediction aka survival prediction
Design and discussion issue how to deal with the following:
Some algorithms and packages produce distributional predictions that are incomplete, in the sense that they specify a full predictive distribution almost but not entirely.
This is in tension with the
predict_proba
interface which states that it returns a full distribution (full as in, fully specified).Examples of such returns are Kaplan-Meier or conditional survival function (= one minus cdf) estimates, where function evaluates are available only at some points of the prediction range, rather than over the entire range.
A conrete example output - given by both
scikit-survival
andlifeline
packages - is a 2Dnumpy
array, with one index corresponding to instances on the test/inference set, and the other index corresponding to time points at which the survival function is evaluated. Entries are the predicted survival for the given instance.Even if we make the approximative assumption that the predicted distribution is supported only at the time points observed in the training data (i.e., sum of weighted delta), there are boundary effects which prevent a bijective mapping onto fully specified probability distributions.
For instance, consider the predictions where survival is estimated as constant zero, or constant one - here, the survival model makes a reasonable prediction that the instances dies before, or survives until afer the first or last point in the training data.
Similar boundary effects occur when attempting to mapping onto an empirical distribution.
These are not severe, if the first and last probability are close to one and zero, respectively, but are the more impactful the more this does not hold.
There are multiple questions in this:
Empirical
distibutions, what is the best choice?The text was updated successfully, but these errors were encountered: