Using a model based on current data to estimate the position of data in a table. It is an index, with some known bound on error.
- Often a continuous function, can include neural-networks, regression, [[Piecewise Linear Function]].
- Used as a type of [[Main Memory Indexing]]
- Models are large, inference is slower than existing indexes (e.g. [[B+ Tree]]).
- Overfitting is avoided, but in [[Learned Indexes]] overfitting is beneficial.
- guarantee on lookup time
- update time guarantee
- can handle all data types & data distributions
- well-studied, available on many data processing systems, and some with hardware support
- Considers the pattern of available data
- Lookup time guarantee on read-only data (distribution does not change, error does not change)
- Updates are difficult (distribution changes, error bound changes)
- Retraining for updates is slow
- Not all key types work
- Not yet applied to multidimensional data
If a prediction has large error for key
- we can record the error from lookup on a few reference keys, store it, and then use it to correct our predictions.
For example the MAB+ tree (Model Assisted [[B+ Tree]])
- uses a CDF model to predict the location of a first result
$key \mapsto \widehat{pos}(key)$ -
Corrective Model used to eliminate local bias from prediction,
$\widehat{pos} \mapsto \widehat{pos}'$ - Larger nodes used (can better predict location within a node), the model and correction metadata is embedded in each node along side the values, and pointers.