Skip to content

Latest commit

 

History

History
36 lines (29 loc) · 4.41 KB

File metadata and controls

36 lines (29 loc) · 4.41 KB

Basics

Note: Some of this overlaps with my general blog on evals

What are automated benchmarks?

Automated benchmarks usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete task, such as How well can my model classify spam from non spam emails?, or a more abstract and general capability, such as How good is my model at math?.

From this, you construct an evaluation, using:

  • a dataset, made of samples.
    • These samples contain an input for the model, sometimes coupled with a reference (called gold) to compare the model's output with.
    • Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at email classification, you create a dataset of spam and non spam emails, try to include some hard edge cases, etc.
  • a metric.
    • The metric is a way to score your model. Example: how accurately can your model classify spam (score of well classified sample = 1, badly classified = 0).
    • Metrics use your model's outputs to do this scoring. In the case of LLMs, people mostly consider two kind of outputs:
      • the text generated by the model following the input (generative evaluation)
      • the log-probability of one or several sequences provided to the model (multiple-choice evaluations, sometimes called MCQA, or perplexity evaluations)
      • For more info on this, you should check out the Model inference and evaluation page.

This is more interesting to do on data that the model has never been exposed to before (data absent from the model training set), because you want to test if it generalizes well. For example, if it can classify spam emails about 'health' products after having seen only spam emails about fake banks.

Note: A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be overfitting. Similarly to a student who learned test questions by heart without understanding the topic, evaluating LLMs on data that was already present in their training set is scoring them on capabilities they do not possess.

Pros and cons of using automated benchmarks

Automated benchmarks have the following advantages:

  • Consistency and reproducibility: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
  • Scale at limited cost: They are one of the cheapest way to evaluate models at the moment.
  • Understandability: Most automated metrics are very understandable. Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as BLEU or ROUGE for example).
  • Dataset quality: A number of automated benchmarks are using expert generated datasets or pre-existing high quality data (like MMLU or MATH). However, this does not mean these datasets are perfect: for MMLU, several errors have been identified in samples afterwards, from parsing issues to actually non-sensical questions, leading to the creation of several follow-up datasets, like MMLU-Pro and MMLU-Redux.

However, they also present the following limitations:

  • Reduced use on more complex tasks: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks. Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts? This led to the use of more generalist evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a good proxy for what we aim to measure.
  • Contamination: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.