If the task you are interested is already well studied, chances are that a dataset exists for it.
Below are a number of evaluation datasets which were developed in the last few years.
However, careful:
- Some of them can be obsolete, as they were designed pre-LLM and are now easily solved: they aimed to investigate one specific property of text (translation, summarization) which is no longer really how we evaluate models (evaluations are now more general/holistic). (If you've got some bandwidth, this could really benefit from adding the publication dates!) (This will also be updated with post LLM evals at some point)
- They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!
Evaluation name | Task type | Publication date | Data size | Task data | Task/Paper content | Source | Dataset | Comments |
---|---|---|---|---|---|---|---|---|
AGIEval (SATMath) | Exam dataset + existing datasets | 2023 | 220 | Math problems from the SAT | Paper is actually a compilation of a bunch of human relative exams to use as eval data. | Paper | HuggingFace | - Careful, this paper also includes datasets from other papers! For math, they use AIME & AMC through the MATH dataset, GRE & GMAT through the AQuA-Rat dataset, and GaoKao - Metrics: acc/em/f1 |
AIME (all) | Olympiad dataset | 1983-now | 15 x 2 per year | Mathematical problems requiring a combination of arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics | 2nd exam to choose the US team for the International Math Olympiads | Blog | Source | Answer is systematically an integer between 0 and 999. |
AIME (22, 23 and 24) | Olympiad dataset | 2024 | 90 |
See AIME (all) | Paper | HuggingFace | Used in the AIMO competition | |
ALGES (SingleEQ) | Online sources compilations | 2015 | 508 | Grade school algebra problems extracted from sources in the web | Paper is about implicitely learning and solving the simple equation behind the problem | Paper | Source | - Web sources: http://math-aids.com, http://k5learning.com, and http://ixl.com - Pre-LLM paper - data sources are probably OK |
ALG514 or AllEq | Online forum | 2014 | 514 | Algebra word problems extracted from a crowdsourced tutoring website, cleaned with turking, and manually verified | Paper is about extracting the equation template from the problem to solve it | Paper | Source | - Web source: Algebra.com |
AMC 12 | Olympiad dataset | 2000 - now | 25 per year | Mathematical word problems requiring arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics. | 1st exam to select the US team for the International Math Olympiad, used to be the Americal High School Math exam | Blog | Source | - Problems are designed to be solvable by students without any background in calculus. |
Ape210K | Exam dataset | 2020 | 210K problems, 56K templates | Chinese elementary school-level math word problems written by math teachers | Solve the problems | Paper (withdrawn, but v1 still accessible) | HuggingFace | - Some problems are templated, which could be interesting for contamination issues - Initial dataset is 900K and was manually filtered - Also provides "intermediate equations" (useful to test CoT traces if needed) - Dataset is in Chinese - Intended to be partially used for training |
AQUA or AQUA-Rat | Exam dataset + turked dataset | 2017 | 100K | Algebraic word problems constructed from a seed of 34K problems from GMAT and GRE, and extended via turking | Task: Solve the problem | Paper | HuggingFace | - Intended to be partially used for training - Includes the rationale for the problems - Use accuracy, BLEU and perplexity for scoring ... |
ASDiv-A | Online sources compilation | 2020 | 2.3K | Math world grade-school problems collected from various websites and normalized | Task: Solve the problem | Paper | Github | - Contains problem type and grade level annotations by a Master’s student annotator - Focused on a high lexical diversity - Used 28 websites |
CHAMP | Olympiad dataset | 2024 | 270 | Math word problems extracted from a book of olympic competitions examples, rewritten to make the solutions parsable, and annotated | Introduces a math bench. Problems are extended with hints and labeled with concepts, to allow ablation studies on performance | Paper | - Source: Book "Problem-Solving strategies" (Engel, 2008) |
|
DeepMind Math | Exam dataset + synthetic dataset | 2019 | 10K? | Synthetic raw math problems in algebra, arithmetic, calculus, comparision, conversions between units, polynomials, probability, etc. | Task: Solve the problem | Paper | HuggingFace | - Full list of domains in appendix B - Paper first section is quite nice - Provide generation code to generate more examples - Provides additional train set - Synthetic procedural dataset (inspired from/to extend? school exams dataset) |
DocMath-Eval | Annotated financial reports + exisiting Fin math datasets | 2023 | 3.2K | Combine financial reports and existing datasets, read by annotators to generate (or validate) questions, and provide answers as Python programs, then evaluated with domain expert annotators | Solutions should be presented as Python programs which will be run to test their validity | Paper | Source | - Re-uses TAT-QA, FinQA, MultiHiertt, TAT-HQA - Looks quite high q for math fin data! - Provides additional train set |
Dolphin1878 | Online sources compilations | 2015 | 1.5K | Number math word problems sampled from online sources and re-annotated if needed | Paper is about extracting the equation (DOL) tree from the problem using semantic parsing | Paper | ? | - Sources: algebra.com and answers.yahoo.com |
Dolphin18K | Online sources compilations | 2016 | 18K | Math word problems semi automatically extracted from online sources | Paper | Kaggle | - Sources: Math category of Yahoo answers since 2008 - Method: manually annotate 6K problems then use a classifier. Train a model to extract the gold - I'm not sure the quality is amazing there given the amount of automatic extraction (high) vs manual verif (low) |
|
Draw-1K | Online sources compilations | 2016 | 1K | General algebra word problems extracted from online sources | Paper is about evaluating solvers (systems generating equations from mwp), and test template and equation equivalence | Paper | Source | - Label each problem with the template it follows, can be useful for contam - Source: algebra.com |
FinQA | Expert annotated financial reports | 2021 | 1.1K | Financial questions linked to tables from earning reports. Annotators provide a question + step by step process + annotations for each page. | Paper introduces the dataset plus a method made of a retriever which extracts relevant facts first to accomodate short context models, then a process generator. | Paper | HuggingFace | - Likely high quality: used paid expert annotators + external experts annotators had high agreement on the task - Total set is 8.2K - Unsure from the paper how the tables are formatted - markdown maybe? - Data source: earnings reports of S&P 500 companies (1999-2019) |
FrontierMath | Expert created datasset | 2024 | 100+ (precise number unknown) | Entirely novel math problems created for the paper across most math domains. Solutions are either integer or SymPy objects to be automatically verifiable through unit-test like python programs. Problems are labelled. | Introduces the dataset | Paper | Private | - Nice discussion of contamination in the paper - Experts: 60 mathematicians over 12 countries - All problems are peer reviewed - Probably the highest quality dataset here atm - Data is not public; however, since closed source models have been evaluated, it's likely they'll be contaminated for it in future occurences :( |
GAOKAO-Bench (MathCloze, MathQA) | Exam dataset | 2023 | ~500 | Math word problems at high school level from the Chinese college entry exams | Paper | Source, Datasets are only 2023 HuggingFace and HuggingFace |
- Mathematical formulas are converted to latex - Problems are in Chinese - Dataset is updated yearly - Paper explores a bunch of grading methods, including LLM as judge - Paper contains surprinsingly little info about the dataset |
|
GAOKAO 2023 (MathEn) | Exam dataset, Competition dataset | 2023 | 385 | Math word problems at high school level | Compiles questions from the 2023 Chinese National College Entrance Examination, the 2023 American Mathematics Competitions, and the 2023 American College Testing | HuggingFace | ||
GSM1K | Manually created dataset in the style of another dataset | 2024 | 1.2K | Diverse "grade school"-like math word problems, following the solving distribution of GSM8K | Paper does a contamination analysis of models on GSM8K vs GSM1K | Paper | Private | - Paper also seems to suggest that perplexity analysis is not very good at detecting contamination |
GSM8K | Manually created dataset in the style of an exam dataset | 2021 | 8.5K | Diverse grade school-level math word problems | Paper is about training verifiers to solve math word problems | Paper | Github Hugging Face |
- Best results with an external calculator added - All answers are positive integers, 50% of answers are between 0 and 8 - Annotation used Upwork for 1K problems, then Scale for the next. Problems writers were provided with seed questions from a 175B GPT3 model |
iGSM (med and hard sets) | Synthetic dataset | 2024 | 20K | Problems are generated using a combination of dependency graphs between objects and categories (with direct, and implicit dependencies) and number of operations to generate new problems | Paper is about studying actual math reasoning on an extension of GSM8K, including probing internal model states | Paper | HuggingFace | - Idea is theoretically nice but problems generated are very unrealistic with high numbers of operation - Paper focuses on "mental process" of model which I find dubious (though the probing section is nice!) - So much anthropomorphism -_- |
GSMHard | Adaptation of existing dataset, numbers replaced | 2022 | 8.5K | GSM8K with bigger/less common numbers to make the problems harder. However, change was done automatically through programs generated, and only 25 changes were checked (+ 50 cases were done manually). | Paper is about using program-aided LMs (= generating CoT alternating equations and reasoning steps, and computing the solution on the last equation with Python) | Paper | Hugging Face | - Described in appendix H1 - Good idea, but not sure of the quality. |
GSM-IC | Adaptation of existing dataset with perturbations | 2023 | 58K | 100 samples from GSM8K with irrelevant context added (using a template for the irrelevant sentence, plus roles/numbers fillers) | Paper tests how sensitive LLMs are to irrelevant context when reasoning on math tasks | Paper | HuggingFace | |
GSM-Plus | Adaptation of existing dataset with perturbations | 2024 | 10K | GSM8K with 8 variations per question, added by GPT4 and manually annotated by selected humans (cross annotator agreement checked) | Paper introduces the dataset and compares results on several GSM8K variants across models and prompting formats | Paper | HuggngFace | - Changes include: replacing numbers by ohter numbers, changing the operations, changing the question, adding distractors, etc (nice typology of changes, I feel it could be extended) |
GSM-Symbolic | Adaptation of existing dataset, templated | 2024 | 8.5K | Templated GSM8K problems, which allows to generate new evals at will | Paper creates parsable templates from GSM8K to be able to generate new problems at will and analyse contamination on GSM8K | Paper | To be released | - Contains other specific subsets (M1, P1, P2, which are difficulty levels, as well as NoOp, with seemingly relevant but actually irrelevant info added), and some experiments are done with few shot formats - Lacking a dataset description table with all subsets imo |
Hungarian HighSchool National Finals Exam | Exam dataset | 2023 | 33 | Problems from the 2023 hungarian national high school finals in math | Source | HuggingFace | - Require grading by hand atm | |
HMWP | Exam dataset | 2020 | 5.4K | Annotated math word problems from a Chinese school-level (K-12) problems bank | Introduces a new formalism to represent MWP equations uniformly | Paper | HuggingFace | - Dataset is in Chinese - Sources: Chinese K12 problems |
Math23K | Online sources compulation | 2017 | 23K | Automatically extracted elementary school level math word problems. | Introduces a RNN to solve MWP | Paper | HuggingFace | - Sources: Chinese math word problems from online education websites for elementary school students. - Dataset is in Chinese - Extraction is rule based, but it's very unclear how much manual validation was done |
Math401-LLM | Synthetic dataset | 2023 | 401 | Arithmetic expressions combining additions, substractions, multiplications, exponentiations, logarithms etc | Papers wants to measure strict arithmetic ability of models | Paper | Github | - Models are not that good atm for log/trig problems or big numbers |
MATH | Olympiad datasets | 2021 | 12.5K | Mathematical problems from real competitions in natural language and latex, annotated with difficulty levels. | Paper | HuggingFace | - Sources: AMC 10, AMC12, AOME, "and more" - Also introduces a train set created from scraping Khan academy and AMPS |
|
MathOdyssey | ||||||||
MathQA | Adaptation of existing dataset, annotated | 2019 | 37K | Annotated solvable problems from the AQuA dataset with formal annotation programs (using humans as annotators and testing their agreement) | Aims to introduce a representation language for math problems, applies the method to AQuA | Paper | HuggingFace | - Sources: AQuA |
MAWPS | Existing dataset compilation | 2016 | 3.3K | Math world problems from existing datasets | Framework to create new math problems, notably to remove lexical or template overlap when adding new datasets | Paper | Github | - Sources: ALG514, ALGES, and other pre-LLM datasets |
MiniF2F | Olympiad dataset | 2022 | 244 | Olympiad math word problems formalized with theorem provers when possible (Lean, Methamath, Isabelle) | Paper is about testing math proof solvers ability to reason on formal logic | Paper | Possibly HuggingFace | - Sources: AIME, AMC, IMO |
NPHardEval | Synthetic dataset | 2023 | 900 | Complexity math word problems of varied difficulty level built from synthetic graph/linear data | Paper introduces the benchmark and uses it to evaluate reasoning ability of models. Also explores benchmark robustness! | Paper | Github | - Problems: sorted array search, edit distance, shortest path, traveling salesman, graph coloring, knapsack problem, meeting scheduling problem - Can be regenerated as needed |
NuminaMATH CoT | Existing dataset compilation | 2024 | 860K | Math word problems (K12 + olympiad levels) combining existing datasets | NA | NA | HuggingFace | - Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets - careful if you use this as train set as you will be contaminated on all major math bencks |
NuminaMATH TiR | Existing dataset compilation | 2024 | 72K | Subset of NuminaMATH CoT focused on problems solvable with tool integrated reasoning | NA | NA | HuggingFace | - Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets - careful if you use this as train set as you will be contaminated on all major math bencks |
OmniMath | Olympiad datasets | 2024 | 2.2K | Olympiad math word problems. Problems are extracted from forums or olympiad websites (using rule based + LLM rephrasing), then annotated and verified by humans. | Paper introduces the benchmark and a judge trained to evaluate the answers (since they are free form) | Paper | HuggingFace | - Sources: IMO, IMC, AoPS forum and wiki - Domain labeling is done with LLMs |
OlympiadBench | Olympiad datasets | 2024 | 8.4K | Olympiad/math/physics word problems. Answers are automatically evaluated - either numbers or equations (evaluated with SymPy) | Paper | - Sources: Global Mathematics and Physics Olympiad Problems, Regional and National Chinese Math Competitions, and Gaokao Mock Questions for Mathematics and Physics - Includes a physics subset - VLM evaluation! |
||
OlympicArena | Olympiad datasets | 2024 | 11K | Paper | ||||
PRM800K | Synthetic data | 2023 | 800K | Preference data from annotators on 800K solutions generated by a model | Paper introducing process supervision to improve reward models (compares output and process supervision) | Paper | HuggingFace | - More a train set than an evaluation |
SVAMP | Adaptation of existing dataset | 2021 | 1K | One-unknown arithmetic word problems of grade level up to 4, created with experts applying variations to ASDiv-A. |
Paper wants to assess question sensitivitiy, reasoning ability, and structure invariance in models for math evaluations. | Paper | Github | - Variations: same object & different structure, opposite, both different, adding relevant or irrelevant information, changing information, inverting operations, changing order of sentences or objects |
TabMWP | Online source adaptation | 2022 | 38K | Tabular math word problems requiring multi-hop reasoning, extracted from an online educative website and manually annotated. | Paper wants to test tabular math reasoning, datast | Paper | HuggingFace | - Source: IXL learning website - Tabular data is provided as an image, semi-structured text, and a table - Answers are generative or MCQA - Dataset is tested against turkers |
TAL-SCQ5K-En | Competitions dataset | 2023 | 4K | Math word problems in MCQA format, with math expressions as latex | NA | None | HuggingFace | - Contains English and Chinese - Also contains 6K train samples and CoT |
TemplateGSM | LLM-generated data | 2024 | 7M | GPT4-generated math word problems inspired in shape by GSM8K | Paper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usability | Paper | HuggingFace | - Since everything is LLM generated, I would expect stronger proofs of quality |
TheoremQA | Online sources adaptations | 2023 | 800 | QAs about university level theorems | Protocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning them | Paper | HuggingFace |
Evaluation name | Task type | Task data | Task content | Source | Dataset | Comments |
---|---|---|---|---|---|---|
DeepFix | Code task, Code-to-code, Correction | 7K student-written erroneous C programs | Correct the C programs | Paper | ||
MLSum | Generation, Multilingual, Summarization | 1.5M news summary/article pairs from the DailyMail, Le Monde, Süddeutsche Zeitung, El Pais, Moskovskij Komsomolets and Internet Haber (en, fr, de, es, ru, tur) | Summarize the articles | Paper | Hugging Face | Palm: Prefixed with a prompt, truncated article to 2048 tokens |
TransCoder | Code task, Code-to-code | 852 parallel functions in Python/Java/C++ | Translate from a language to another | Paper | From paper | |
WMT | Multilingual, Translation | Datasets from the WMT conf on machine translation - datasets available depend on the year | Translate from a language to another | Conference Replace the 2 digits by the conference year |
||
Adversarial NLI | Language Inference | 10K entailment dataset generated using human in the loop adversarial attacks, looking for predicates which force models to predict wrong entailement labels (uses contexts from StoryCloze, CommonCrawl, Wikipedia, the Open Annotated National Corpus, WikiHow and GLUE) | Predict entailment | Paper | Data Github |
R1 to R3 = rounds of data generation |
APPS | Text-to-code | 10K Python coding problems in natural languages, scraped from leetcode sites, with a suite of test cases. | Solve the Python problem | Paper | Github Data |
|
AQuA | Arithmetic, Reasoning | 100K multiple choice problems (GMAT, GRE, other sources) with question/options/rationale | Select the correct MCQA | Paper | Github | Best results obtained with an external calculator added |
ARC | Common Sense, Reasoning | 8K Grade school science questions: e = easy set, c = challenge set | Select the correct MCQA | Paper | Data | Careful, this is the AI2 Reasoning Challenge, not the Abstraction and Reasoning Corpus |
bAbI | Reasoning | 20 tasks each with 2K automatically generated questions + short scenarios (successive actions generated with a simulated text adventure game). | Reason over the sentence to select the correct conclusion | Paper | Github Data |
See Part 4 for the simulation env and its constraints, it’s quite a fun idea. Probably not too hard to reproduce for other types of reasoning. |
BBQ | Bias detection | 58K examples with two contexts (ambiguous and explicit about a bias), two questions (negative and non-negative) and possible answers, constructed from manual templates and checked with crowdsourcing. | Predict the correct, non biased answer. Difference between accuracies depending on context/question allows to build a bias score. | Paper | Github | |
BLiMP | Language Understanding | 67 datasets of each artificially generated 1K minimal pairs testing syntactic, morphological and semantic knowledge, validated with MTurking. | Accuracy measured by looking if the log-probability the model assigned to the correct sentence is higher. | Paper | Github | Things tested: anaphor agreement, argument structure, binding, control/raisong, determiner-noun agreement, ellipsis, filler-gap, irregular forms, island effects, NPI licensing, quantifiers, subject-verb agreement |
BOLD | Generation, Toxicity detection | 23K prompts extracted from beginning of Wikipedia sentences containing a race/religion/political/gender/profession group member (ex: woman artist for gender=female). |
Task: generating end of sentence, and toxicity is evaluated through a range of metrics (sentiment analysis, using classifiers, …). In HELM, toxicity is measured using the Perspective API. |
Paper | Github | |
BooksCorpus | N/A | 11K unpublished books of more than 20K words scraped from the web (of 16 different genres). |
In the original paper, it's used to train a sentence embedding model - in model papers, it's often used for contamination or perplexity evaluations. | Paper | Hugging Face | |
BooksCorpus_HELM | Generation, Memorisation | 1K randomly sampled books from BooksCorpus. | Task: from a random number of tokens beginning a paragraph, the model must generate a follow up - measure exact and near-exact reproduction. | Paper | Data | |
BoolQ | Language Inference, Language Understanding | 16K sentences of naturally occurring yes no QA, from question + context from Wikipedia | Answer the MCQA | Paper | Website | |
CB | Language Understanding | 1.2K of discourse (news from the Wall Street Journal, fiction from the British National Corpus, dialogue from Switchboard), containing context + target sentence | Predict commitment entailment | Paper | Website | |
Civil comments | Toxicity detection | 1.8M online comments, with crowd sourced human labels for and toxicity following the Perspective API guidelines, and among these, 450K labeled with identity terms (crowdsourced, to pick in a list). | Task : toxicity prediction, labels are used to identify areas of biases in models | Paper | Kaggle Hugging Face |
Original paper contained a synthetic test set (77K examples generated from templates using 50 identity terms, 50/50 on toxicity) and a human labeled dataset (description in the Task content col) - I suppose the dataset available is the latter |
Clean E2E NLG | Description, Generation | 50K crowdsourced generated descriptions of restaurants given keys and values (type of food = X, budget = Y, …). | Paper | Hugging Face | Palm: Prefixed with a prompt, truncated article to 2048 tokens | |
CNN/DailyMail | Cloze/Completion, Summarization | Original dataset: 200K new documents (CNN/DailyMail between 2007 and 2015) converted into Cloze format by removing some of the text’s named entities, and using them as keys. |
In HELM: Uses above documents (in complete form) as text to summarize, and their highlights as gold summaries. | Paper | Hugging Face Data |
(I suspect this does not produce very good summaries though) |
CommonsenseQA | Common Sense, Reasoning | 12K turked Q/A (initialized from ConceptNet associations), then filtered by quality, with added context from a Google Search query > Some text likely overlaps with CC data | Answer the MCQA | Paper | Best results with an external calculator added | |
Contrast Sets | Generation, Robustness | 10 contrast sets of up to 1K examples, for their datasets (see comments) made by (often the original paper’s) researchers (increase reasoning steps in questions, replace words by their opposites, change counts …). | Follow original task setup with new examples, and see how/if performance drops. In HELM: use the IMDb and DROP contrast sets |
Paper | Data | NLVR2, IMDb sentiment analysis, MATRES Temporal RE, English UD parsing, PERSPECTRUM, DROP, Quoref, ROPES, BoolQ, MC-TACO. Dataset construction is way more detailed in Appendix |
COPA | Common Sense, Language Understanding | 1K premise + causal questions with alternatives (common sense) | Paper | Website | ||
CoQA | In-Context Reading Comprehension | 127K Conversational QA, from a given context (rationale must be provided too) - written by annotators | Paper | v1.0 from Data | ||
DataImputation | Real life task, Reasoning, Structured data | 8 structured datasets from several sources. | Task: from a row of attributes with gaps, the model must fill the gaps (ex: extrapolating city from phone number, phone brand from its specs). |
Paper | Data restaurant Data Buy |
See table 2 for all sources. In HELM: use the subsets Buy and Restaurant, convert input to natural language, test accuracy. |
Digits arithmetics (2D+, 2D-, 3D+, 3D-, …) | Arithmetic | Basic arithmetic tasks for n digits addition, subtraction, composite operations with 2K example each | Task: solve the math | Paper | Github | All links come from the lm-evaluation-harness/lm_eval/datasets/arithmetic |
DROP | Arithmetic, In-Context Reading Comprehension | 55K adversarial questions which require 1) selecting relevant items from the text and 2) computing on them (sorting/counting/…) to get the correct answer | Task: select and count to provide the correct answer | Paper | Data | |
Dyck language_HELM | Symbolic manipulation | 500 D_n words between 52 and 100 characters (”word” made of nested brackets/parenthesis) where the last i characters have been removed. | Task: Predict the unique sequence of closing parentheses. | Paper | Github Also has a different version in BigBench |
|
HellaSwag | Cloze/Completion | 60K adversarially filtered multiple choice Q/A | Choose the correct next sentence (from captions or WikiHow) | Paper | Github | |
HumanEval | Code task, Text-to-code | 164 hand written programming problems with function signature, docstring, body + unit tests | Aim is to complete function to pass unit tests | Paper | Hugging Face | |
IMDB | Sentiment Analysis | 50K reviews from IMDB, with even positive (score ≥ 7) /negative (score ≤ 4) reviews (no neutral). | Classify positive/negative review. | Paper | Website | |
LAMBADA | Cloze/Completion | 10K Narrative contexts (from the BookCorpus) followed by a sentence where the last word is masked and must be predicted. Specifically built to force use of the context. | Predict the last word. | Paper | Zenodo | |
Language Modeling Evaluation_HELM | Language Modeling | Compilation in HELM of several datasets: WikiText-103, ThePile (particularly arXiv, BooksCorpus2, Enron Emails, PubMed Central, Wikipedia), TwitterAAE, ICE. | Task: get conditional log probability of the full sequence (perplexity measure) | Paper | The pile website BLIMP Github Wikitext data Twitter AAE data ICE data |
|
LegalSupport | Entailment, Real life task, Reasoning | 20K legal entailment scenarios, constructed from state/federal legal opinions (assertion is used as context, and 2 supporting sources (”see X, rule”) are selected at random). | Task: finding which rule best supports assertion. | Paper | Data | |
LinuxKernel_HELM | Generation, Memorisation | 2K randomly sampled functions from the Linux Kernel. | Task: from a random number of lines beginning a function, the model must generate a follow up - measure exact and near-exact reproduction. | Paper | Data | |
LSAT | Analytical Reasoning, In-Context Reading Comprehension, Logical Reasoning | 10K questions from the Law School Admission Test (analytical, logical reasoning, and reading comprehension), with context. | Answer the MCQA correctly | Paper | Github | |
Magellan Benchmark | Real life task, Reasoning, Structured data | 23 datasets from several sources containing entities with attributes. Dirty datasets contain deliberate errors, such as attributes being in the wrong column, misspellings, etc. | Task: given two entities from two different tables, determine if they are the same or not. | Paper | Github | It’s likely that Abt-Buy and Buy are the same dataset |
MBPP | Code task, Text-to-code | 1K entry-level Python crowd-sourced programming problems (description, solution, 3 unit test cases) - (58% mathematical, 43% list processing, 19% string processing, 9% integer sequences, and 2% other). | Solve the Python program | Paper | Github Hugging Face |
Also contains an edited version (400 items) with unambiguous prompts and good signatures (can be interesting to look at later to see the impact of prompts on code gen) + a MathQA-Python dataset (adaptation of the MathQA dataset) |
MMLU | Language Understanding | 15K multi-choice Q/A manually collected from various online sources, on many topics (legal, philosophy, economics, psychology, STEM, medicine, etc, - at high school to professional level) | Answer the MCQA | Paper | Hugging Face Github |
Seems like a strong/high-quality baseline |
MRF (Misinfo Reaction Frames) | Generation, Misinformation capabilities | 200K pairs of claims from news headlines (climate change, covid 19, cancer illness, detailed sources in comments) + label (real/misinformation), the former annotated on veracity, likelihood of spread, writer intent by MTurk workers. | Task: must either predict the gold label or generate likely writer intent/reader perception/… | Paper | Github | (Contains data from NELA-GT-2018-2020, SciDCC, Climate-FEVER, CoAID, CoronaVirusFacts/DatosCoronaVirusAlliance Database, ESOC Covid-19 Misinformation Dataset, DETERRENT) |
MS MARCO | Question Answering, Retrieval | 1M anonymised questions with free-form human generated answers (from relevant web document extracts), some with added rewriting. |
Original paper contains 3 tasks: 1) generating the correct answer, if possible, 2) same but answer should make sense even without context, 3) ranking 1000 passages on how relevant they are for the question. | Paper | Github | Contains extended descriptions of QA datasets in lit review. In HELM, only the ranking task is looked at, and relevance is estimated looking at the log-likelihood of the prediction when asking “Does the passage answer the query?” |
MS MARCO TREC, aka TREC 2019 | Retrieval | Datasets derived from MS MARCO, edited for either passage or document retrieval tasks, either doing full retrieval or top-n reranking (100 for documents, 1000 for passages). (see MS MARCO) | Paper | Data Github |
||
MultiRC | Language Understanding, Question Answering | 6K multiple choice question over a diversity of topics | Paper | Data | ||
NarrativeQA | Question Answering, Retrieval | 47K free-form human generated questions and answers, linked to 1.5K books (Gutemberg project) and movie scripts (scraped) matched with plot summaries. |
Task: from the summary or the story, answer or select the correct answer. | Paper | Github | For long range context testing, we could use this dataset to do QA from the full stories. Could be interesting for anything conversational imo. |
Natural Questions | Open domain/Closed book, Question Answering | 207K aggregated google search queries + annotated wikipedia sample answer | Paper | Data | ||
NewsQA | Question Answering | 100K human generated QA pairs from 12K news articles (CNN). Questions were generated from title + summary, answers from question + article, then kept through a validation mechanism. Likely intersects with CNN/DailyMail, as data extraction script was the same. |
Paper | Github | ||
OpenBookQA | Common Sense, Reasoning | 6K sentences, science reasoning needing common sense knowledge to extrapolate to new situations | Paper | Data | ||
PIQA | Common Sense, Reasoning | 20K physical common sense reasoning situations, | select the correct action to do from a context and answers | Paper | Data | |
PopularBooksCorpus_HELM | Generation, Memorisation | 20 books from BooksCorpus which appear in a list of bestsellers. | Task: from a random number of tokens beginning the first paragraph of the book, the model must generate a follow up - measure exact and near-exact reproduction. | Paper | Data | |
QuAC | In-Context Reading Comprehension | 100K questions in information seeking QA contexts (used Wikipedia to generate dataset) | Paper | Data | ||
RACE | In-Context Reading Comprehension | 100K questions from English reading comprehension exam for Chinese mid/high school students | Paper | Data | ||
RAFT | Real life task, Text classification | Compilation of 11 datasets of naturally occurring classification tasks, of between 150 and 5K test items. | Task: in few shot from 50 labeled examples, provide meaningful labels. (Domains: medical, finance, research, english language, law, physics, AI safety, social networks) | Paper | Hugging Face | Corpus: (ADE Corpus v2, Banking77, NeurIPS 2020 impact statement risks, OneStopEnglish, Overrruling, Systematic review inclusion, TAI safety research, Terms of Service, TweetEval Hate, Twitter complaints, + Semiconductor org types, created for this) |
RealToxicityPrompts | Generation, Toxicity detection | 100K natural occurring sentences (selected from OpenWebText corpus, basically = reddit, and scored for toxicity with the PerspectiveAPI) split in two to create a prompt and continuation. | Task: generate the continuation from the sentence start, then toxicity evaluated with PerspectiveAPI. | Paper | Data Github (the repo lacks a lot of info) |
|
ReCoRD | Language Understanding | 120K passage/cloze query/answer examples from news (CNN, DailyMail) with human filtering | Paper | Data | ||
RTE | Language Understanding | 3K compilation of competition data on entailement | Paper | Data | ||
SAT analogies | Language Understanding | 374 SAT analogy problem prior to 2005 (a is to b what c is to multiple choice questions; words are not the most frequent) | Paper | Data dev Data test |
||
SIQA | Question Answering | |||||
SQuADv2 | In-Context Reading Comprehension | Combines SQuAD with 50K unanswerable questions | from a context, give an answer, but only if possible | Paper | Github | |
StoryCloze | Cloze/Completion, Common Sense | 50K 5-sentences commonsense stories | choose the correct ending | Paper | Hugging Face | |
StrategyQA | Common Sense, Reasoning | 2.8K questions needing reasoning from implicit knowledge | Paper | Best results with an external calculator added | ||
Synthetic reasoning (natural) | Logical Reasoning, Reasoning | Synthetic data generated on the fly, containing a set of synthetic rules (conditional statements), facts (attributes), and the logical gold output. | Paper | Can be generated with Github | Also called rule_induct in HELM | |
Synthetic reasoning (symbolic)_HELM | Logical Reasoning, Symbolic manipulation | Synthetic data generated on the fly using a pattern template. | Either test if the model is able to identify patterns (”beach + beach - pear” has “A + A - B” as pattern) or if the model can substitute strings in a given pattern. | Paper | Can be generated with Github | |
TriviaQA | Open domain/Closed book, Question Answering | 95K trivia QA (compositional questions, syntactic variability) | Paper | Data | ||
TruthfulQA | Question Answering | 817 questions about tricky factual claims (common misconceptions, falsehoods, …) over 38 categories, with true and false reference answers + a source to support true answers (+ 380 added questions). | Paper | Github | ||
TyDiQA-GoldP | Multilingual, Question Answering | 204K multilingual QA pairs (unconstrained question elicitation from prompts, then Wikipedia article retrieval, and specific answer selection in the article if possible) (en, ar, ben, fin, ind, ja, ko, ru, tel, th and kiswahili). | MCQA | Paper | Github | Dataset generated can present interesting underspecification of questions and mismatch between question and answers language level. Might be harder than other datasets |
Web Questions | Open domain/Closed book, Question Answering | Extracted 100K “Wh?” questions from Google Search API, then annotated by MTurkers - I suspect answers are partially out of date | MCQA | Paper | Website | |
WebNLG | Generation, Verbalization | 13K mappings between triples (subject, property, object, constructed from DBPedia, which is a KB from Wikipedia) and sentence verbalization (by crowd workers), about specific topics (astronauts, universities, monuments, buildings, characters from comics, food, airports, sports teams, written works). | Task: verbalize in a grammatical way. | Paper | Hugging Face | There was a sentence selection for fluency and the sentences generated are relatively simple, but there is no description of annotators/crowdsourcers origins > maybe some data is not in “standard English”. |
WiC | Language Understanding | 7K, classification of whether a word occurring in two different contexts has the same meaning or not | Paper | Site | ||
WikiFact_HELM | Cloze/Completion | 12 domains with 1K triples (subject, relation, object) sampled from Wikipedia and cleaned. | Task: predict missing item in the sentence made of the relation. | Paper | Codalab Github |
|
WikiLingua | Generation, Multilingual, Summarization | 43K article/summary pairs constructed from WikiHow in 18 languages (on the site, articles are written with a summary sentence + detailed paragraph per step: in the dataset, summaries are the concatenation of the summary sentences, and articles of the detailed paragraphs | Summarization | Paper | Github | Palm: Prefixed with a prompt, truncated article to 2048 tokens I suspect data creation can leads to very “robotic” language for the summary baseline, which could underscore more fluid summaries - though ROUGE shouldn’t be too prone to that). |
Winogender | Bias detection | |||||
Winograd | Reasoning, Winograd | 273 to 285 examples where one must disambiguate who/what a pronoun is referring to on sentence specially constructed to be ambiguous to statistics not to humans | Disambiguation of pronoun | Paper | Website | Not sure if GPT3 was evaled on this one or the SuperGLUE one |
WinoGrande | Reasoning, Winograd | 43K sentences Adversarial Winograd | Paper | Website | ||
WSC | Language Understanding, Winograd | WinoGrad Schema Challenge (see Winograd) | Paper | Website | ||
XSUM | Summarization | 226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts) | Paper | Github | ||
XSum | Generation, Summarization | 226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machine | Paper | Hugging Face | Could be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news. |
Evaluation name | Task type | Task content | Source | Dataset | Comments | |
---|---|---|---|---|---|---|
✍️ GSM8K-Python | Code task, Text-to-code | Python version of the GSM8K dataset (8.5K grade school math problems) | Paper | N/A | ||
✍️ MRF | Generation, Manual evaluation, Misinformation capabilities | 250 headlines extracted from the MRF dataset, grouped in 80 clusters by thesis. Task: from the thesis + 5 headlines, the model must generate plausible headlines which supports the thesis. Annotators evaluate if 1) the headline supports the thesis and 2) looks real. | Paper | Data | See report page 6 for a detailed explanation of the original process, plus sections 8.5.2, E.5, and 5.5 in the HELM paper. | |
✍️ News article generation | Generation | Generated 25 articles from titles and subtitles, 80 humans had to classify if generated or original | Paper | |||
✍️ Numeracy Prediction | Symbolic manipulation | “requires the model to perform symbolic regression given a few examples, and apply the number relationship to a new input” | Paper | Github | ||
✍️ SVG datasets | Could construct an SVG dataset to see if models can indeed generate or interpret SVG drawings | Twitter thread | ||||
✍️ Theory of the mind datasets | Could likely be easy to generate | Paper | ||||
✍️ Wedging prompts | Generation, Manual evaluation, Misinformation capabilities | 11 prompts with specific intent (ex: influence voting behaviors, target specific groups by generate pro/anti X rhetoric) augmented with 3 examples. Task: generate follow up examples. |
Paper | Data | In HELM: use manual evaluation to determine if the message generate 1) addresses targeted group; 2) supports desired message; 3) is divisive | |
✍️ Word scrambling | Symbolic manipulation | 10K examples for 5 tasks of 5 character manipulation tasks (word with cycled letters, anagrammed, random insertions, reversed). Model needs to recover the original word | Paper | Easy to generate/automate, see Section 3.9.2 of GPT3 paper |