Skip to content

Latest commit

 

History

History
156 lines (147 loc) · 58.7 KB

some-evaluation-datasets.md

File metadata and controls

156 lines (147 loc) · 58.7 KB

Some evaluation datasets

If the task you are interested is already well studied, chances are that a dataset exists for it.

Below are a number of evaluation datasets which were developed in the last few years.

However, careful:

  • Some of them can be obsolete, as they were designed pre-LLM and are now easily solved: they aimed to investigate one specific property of text (translation, summarization) which is no longer really how we evaluate models (evaluations are now more general/holistic). (If you've got some bandwidth, this could really benefit from adding the publication dates!) (This will also be updated with post LLM evals at some point)
  • They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!

Math specific datasets

Evaluation name Task type Publication date Data size Task data Task/Paper content Source Dataset Comments
AGIEval (SATMath) Exam dataset + existing datasets 2023 220 Math problems from the SAT Paper is actually a compilation of a bunch of human relative exams to use as eval data. Paper HuggingFace - Careful, this paper also includes datasets from other papers! For math, they use AIME & AMC through the MATH dataset, GRE & GMAT through the AQuA-Rat dataset, and GaoKao
- Metrics: acc/em/f1
AIME (all) Olympiad dataset 1983-now 15 x 2 per year Mathematical problems requiring a combination of arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics 2nd exam to choose the US team for the International Math Olympiads Blog Source Answer is systematically an integer between 0 and 999.
AIME (22, 23 and 24) Olympiad dataset 2024 90
See AIME (all) Paper HuggingFace Used in the AIMO competition
ALGES (SingleEQ) Online sources compilations 2015 508 Grade school algebra problems extracted from sources in the web Paper is about implicitely learning and solving the simple equation behind the problem Paper Source - Web sources: http://math-aids.com, http://k5learning.com, and http://ixl.com
- Pre-LLM paper - data sources are probably OK
ALG514 or AllEq Online forum 2014 514 Algebra word problems extracted from a crowdsourced tutoring website, cleaned with turking, and manually verified Paper is about extracting the equation template from the problem to solve it Paper Source - Web source: Algebra.com
AMC 12 Olympiad dataset 2000 - now 25 per year Mathematical word problems requiring arithmetic, algebra, counting, geometry, number theory, probability and other secondary school math topics. 1st exam to select the US team for the International Math Olympiad, used to be the Americal High School Math exam Blog Source - Problems are designed to be solvable by students without any background in calculus.
Ape210K Exam dataset 2020 210K problems, 56K templates Chinese elementary school-level math word problems written by math teachers Solve the problems Paper (withdrawn, but v1 still accessible) HuggingFace - Some problems are templated, which could be interesting for contamination issues
- Initial dataset is 900K and was manually filtered
- Also provides "intermediate equations" (useful to test CoT traces if needed)
- Dataset is in Chinese
- Intended to be partially used for training
AQUA or AQUA-Rat Exam dataset + turked dataset 2017 100K Algebraic word problems constructed from a seed of 34K problems from GMAT and GRE, and extended via turking Task: Solve the problem Paper HuggingFace - Intended to be partially used for training
- Includes the rationale for the problems
- Use accuracy, BLEU and perplexity for scoring ...
ASDiv-A Online sources compilation 2020 2.3K Math world grade-school problems collected from various websites and normalized Task: Solve the problem Paper Github - Contains problem type and grade level annotations by a Master’s student annotator
- Focused on a high lexical diversity
- Used 28 websites
CHAMP Olympiad dataset 2024 270 Math word problems extracted from a book of olympic competitions examples, rewritten to make the solutions parsable, and annotated Introduces a math bench. Problems are extended with hints and labeled with concepts, to allow ablation studies on performance Paper - Source: Book "Problem-Solving strategies" (Engel, 2008)
DeepMind Math Exam dataset + synthetic dataset 2019 10K? Synthetic raw math problems in algebra, arithmetic, calculus, comparision, conversions between units, polynomials, probability, etc. Task: Solve the problem Paper HuggingFace - Full list of domains in appendix B
- Paper first section is quite nice
- Provide generation code to generate more examples
- Provides additional train set
- Synthetic procedural dataset (inspired from/to extend? school exams dataset)
DocMath-Eval Annotated financial reports + exisiting Fin math datasets 2023 3.2K Combine financial reports and existing datasets, read by annotators to generate (or validate) questions, and provide answers as Python programs, then evaluated with domain expert annotators Solutions should be presented as Python programs which will be run to test their validity Paper Source - Re-uses TAT-QA, FinQA, MultiHiertt, TAT-HQA
- Looks quite high q for math fin data!
- Provides additional train set
Dolphin1878 Online sources compilations 2015 1.5K Number math word problems sampled from online sources and re-annotated if needed Paper is about extracting the equation (DOL) tree from the problem using semantic parsing Paper ? - Sources: algebra.com and answers.yahoo.com
Dolphin18K Online sources compilations 2016 18K Math word problems semi automatically extracted from online sources Paper Kaggle - Sources: Math category of Yahoo answers since 2008
- Method: manually annotate 6K problems then use a classifier. Train a model to extract the gold
- I'm not sure the quality is amazing there given the amount of automatic extraction (high) vs manual verif (low)
Draw-1K Online sources compilations 2016 1K General algebra word problems extracted from online sources Paper is about evaluating solvers (systems generating equations from mwp), and test template and equation equivalence Paper Source - Label each problem with the template it follows, can be useful for contam
- Source: algebra.com
FinQA Expert annotated financial reports 2021 1.1K Financial questions linked to tables from earning reports. Annotators provide a question + step by step process + annotations for each page. Paper introduces the dataset plus a method made of a retriever which extracts relevant facts first to accomodate short context models, then a process generator. Paper HuggingFace - Likely high quality: used paid expert annotators + external experts annotators had high agreement on the task
- Total set is 8.2K
- Unsure from the paper how the tables are formatted - markdown maybe?
- Data source: earnings reports of S&P 500 companies (1999-2019)
FrontierMath Expert created datasset 2024 100+ (precise number unknown) Entirely novel math problems created for the paper across most math domains. Solutions are either integer or SymPy objects to be automatically verifiable through unit-test like python programs. Problems are labelled. Introduces the dataset Paper Private - Nice discussion of contamination in the paper
- Experts: 60 mathematicians over 12 countries
- All problems are peer reviewed
- Probably the highest quality dataset here atm
- Data is not public; however, since closed source models have been evaluated, it's likely they'll be contaminated for it in future occurences :(
GAOKAO-Bench (MathCloze, MathQA) Exam dataset 2023 ~500 Math word problems at high school level from the Chinese college entry exams Paper Source,
Datasets are only 2023
HuggingFace and HuggingFace
- Mathematical formulas are converted to latex
- Problems are in Chinese
- Dataset is updated yearly
- Paper explores a bunch of grading methods, including LLM as judge
- Paper contains surprinsingly little info about the dataset
GAOKAO 2023 (MathEn) Exam dataset, Competition dataset 2023 385 Math word problems at high school level Compiles questions from the 2023 Chinese National College Entrance Examination, the 2023 American Mathematics Competitions, and the 2023 American College Testing HuggingFace
GSM1K Manually created dataset in the style of another dataset 2024 1.2K Diverse "grade school"-like math word problems, following the solving distribution of GSM8K Paper does a contamination analysis of models on GSM8K vs GSM1K Paper Private - Paper also seems to suggest that perplexity analysis is not very good at detecting contamination
GSM8K Manually created dataset in the style of an exam dataset 2021 8.5K Diverse grade school-level math word problems Paper is about training verifiers to solve math word problems Paper Github
Hugging Face
- Best results with an external calculator added
- All answers are positive integers, 50% of answers are between 0 and 8
- Annotation used Upwork for 1K problems, then Scale for the next. Problems writers were provided with seed questions from a 175B GPT3 model
iGSM (med and hard sets) Synthetic dataset 2024 20K Problems are generated using a combination of dependency graphs between objects and categories (with direct, and implicit dependencies) and number of operations to generate new problems Paper is about studying actual math reasoning on an extension of GSM8K, including probing internal model states Paper HuggingFace - Idea is theoretically nice but problems generated are very unrealistic with high numbers of operation
- Paper focuses on "mental process" of model which I find dubious (though the probing section is nice!)
- So much anthropomorphism -_-
GSMHard Adaptation of existing dataset, numbers replaced 2022 8.5K GSM8K with bigger/less common numbers to make the problems harder. However, change was done automatically through programs generated, and only 25 changes were checked (+ 50 cases were done manually). Paper is about using program-aided LMs (= generating CoT alternating equations and reasoning steps, and computing the solution on the last equation with Python) Paper Hugging Face - Described in appendix H1
- Good idea, but not sure of the quality.
GSM-IC Adaptation of existing dataset with perturbations 2023 58K 100 samples from GSM8K with irrelevant context added (using a template for the irrelevant sentence, plus roles/numbers fillers) Paper tests how sensitive LLMs are to irrelevant context when reasoning on math tasks Paper HuggingFace
GSM-Plus Adaptation of existing dataset with perturbations 2024 10K GSM8K with 8 variations per question, added by GPT4 and manually annotated by selected humans (cross annotator agreement checked) Paper introduces the dataset and compares results on several GSM8K variants across models and prompting formats Paper HuggngFace - Changes include: replacing numbers by ohter numbers, changing the operations, changing the question, adding distractors, etc (nice typology of changes, I feel it could be extended)
GSM-Symbolic Adaptation of existing dataset, templated 2024 8.5K Templated GSM8K problems, which allows to generate new evals at will Paper creates parsable templates from GSM8K to be able to generate new problems at will and analyse contamination on GSM8K Paper To be released - Contains other specific subsets (M1, P1, P2, which are difficulty levels, as well as NoOp, with seemingly relevant but actually irrelevant info added), and some experiments are done with few shot formats
- Lacking a dataset description table with all subsets imo
Hungarian HighSchool National Finals Exam Exam dataset 2023 33 Problems from the 2023 hungarian national high school finals in math Source HuggingFace - Require grading by hand atm
HMWP Exam dataset 2020 5.4K Annotated math word problems from a Chinese school-level (K-12) problems bank Introduces a new formalism to represent MWP equations uniformly Paper HuggingFace - Dataset is in Chinese
- Sources: Chinese K12 problems
Math23K Online sources compulation 2017 23K Automatically extracted elementary school level math word problems. Introduces a RNN to solve MWP Paper HuggingFace - Sources: Chinese math word problems from online education websites for elementary school students.
- Dataset is in Chinese
- Extraction is rule based, but it's very unclear how much manual validation was done
Math401-LLM Synthetic dataset 2023 401 Arithmetic expressions combining additions, substractions, multiplications, exponentiations, logarithms etc Papers wants to measure strict arithmetic ability of models Paper Github - Models are not that good atm for log/trig problems or big numbers
MATH Olympiad datasets 2021 12.5K Mathematical problems from real competitions in natural language and latex, annotated with difficulty levels. Paper HuggingFace - Sources: AMC 10, AMC12, AOME, "and more"
- Also introduces a train set created from scraping Khan academy and AMPS
MathOdyssey
MathQA Adaptation of existing dataset, annotated 2019 37K Annotated solvable problems from the AQuA dataset with formal annotation programs (using humans as annotators and testing their agreement) Aims to introduce a representation language for math problems, applies the method to AQuA Paper HuggingFace - Sources: AQuA
MAWPS Existing dataset compilation 2016 3.3K Math world problems from existing datasets Framework to create new math problems, notably to remove lexical or template overlap when adding new datasets Paper Github - Sources: ALG514, ALGES, and other pre-LLM datasets
MiniF2F Olympiad dataset 2022 244 Olympiad math word problems formalized with theorem provers when possible (Lean, Methamath, Isabelle) Paper is about testing math proof solvers ability to reason on formal logic Paper Possibly HuggingFace - Sources: AIME, AMC, IMO
NPHardEval Synthetic dataset 2023 900 Complexity math word problems of varied difficulty level built from synthetic graph/linear data Paper introduces the benchmark and uses it to evaluate reasoning ability of models. Also explores benchmark robustness! Paper Github - Problems: sorted array search, edit distance, shortest path, traveling salesman, graph coloring, knapsack problem, meeting scheduling problem
- Can be regenerated as needed
NuminaMATH CoT Existing dataset compilation 2024 860K Math word problems (K12 + olympiad levels) combining existing datasets NA NA HuggingFace - Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets
- careful if you use this as train set as you will be contaminated on all major math bencks
NuminaMATH TiR Existing dataset compilation 2024 72K Subset of NuminaMATH CoT focused on problems solvable with tool integrated reasoning NA NA HuggingFace - Sources: AOPS, AMC, AIME, CN-K12, GSM8K, MATH, ORCA_math, Synthetic AMC and MATH data, and other Olympiads sets
- careful if you use this as train set as you will be contaminated on all major math bencks
OmniMath Olympiad datasets 2024 2.2K Olympiad math word problems. Problems are extracted from forums or olympiad websites (using rule based + LLM rephrasing), then annotated and verified by humans. Paper introduces the benchmark and a judge trained to evaluate the answers (since they are free form) Paper HuggingFace - Sources: IMO, IMC, AoPS forum and wiki
- Domain labeling is done with LLMs
OlympiadBench Olympiad datasets 2024 8.4K Olympiad/math/physics word problems. Answers are automatically evaluated - either numbers or equations (evaluated with SymPy) Paper - Sources: Global Mathematics and Physics Olympiad Problems, Regional and National Chinese Math Competitions, and Gaokao Mock Questions for Mathematics and Physics
- Includes a physics subset
- VLM evaluation!
OlympicArena Olympiad datasets 2024 11K Paper
PRM800K Synthetic data 2023 800K Preference data from annotators on 800K solutions generated by a model Paper introducing process supervision to improve reward models (compares output and process supervision) Paper HuggingFace - More a train set than an evaluation
SVAMP Adaptation of existing dataset 2021 1K One-unknown arithmetic word problems of
grade level up to 4, created with experts applying variations to ASDiv-A.
Paper wants to assess question sensitivitiy, reasoning ability, and structure invariance in models for math evaluations. Paper Github - Variations: same object & different structure, opposite, both different, adding relevant or irrelevant information, changing information, inverting operations, changing order of sentences or objects
TabMWP Online source adaptation 2022 38K Tabular math word problems requiring multi-hop reasoning, extracted from an online educative website and manually annotated. Paper wants to test tabular math reasoning, datast Paper HuggingFace - Source: IXL learning website
- Tabular data is provided as an image, semi-structured text, and a table
- Answers are generative or MCQA
- Dataset is tested against turkers
TAL-SCQ5K-En Competitions dataset 2023 4K Math word problems in MCQA format, with math expressions as latex NA None HuggingFace - Contains English and Chinese
- Also contains 6K train samples and CoT
TemplateGSM LLM-generated data 2024 7M GPT4-generated math word problems inspired in shape by GSM8K Paper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usability Paper HuggingFace - Since everything is LLM generated, I would expect stronger proofs of quality
TheoremQA Online sources adaptations 2023 800 QAs about university level theorems Protocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning them Paper HuggingFace

Pre-LLM datasets

Evaluation name Task type Task data Task content Source Dataset Comments
DeepFix Code task, Code-to-code, Correction 7K student-written erroneous C programs Correct the C programs Paper
MLSum Generation, Multilingual, Summarization 1.5M news summary/article pairs from the DailyMail, Le Monde, Süddeutsche Zeitung, El Pais, Moskovskij Komsomolets and Internet Haber (en, fr, de, es, ru, tur) Summarize the articles Paper Hugging Face Palm: Prefixed with a prompt, truncated article to 2048 tokens
TransCoder Code task, Code-to-code 852 parallel functions in Python/Java/C++ Translate from a language to another Paper From paper
WMT Multilingual, Translation Datasets from the WMT conf on machine translation - datasets available depend on the year Translate from a language to another Conference
Replace the 2 digits by the conference year
Adversarial NLI Language Inference 10K entailment dataset generated using human in the loop adversarial attacks, looking for predicates which force models to predict wrong entailement labels (uses contexts from StoryCloze, CommonCrawl, Wikipedia, the Open Annotated National Corpus, WikiHow and GLUE) Predict entailment Paper Data
Github
R1 to R3 = rounds of data generation
APPS Text-to-code 10K Python coding problems in natural languages, scraped from leetcode sites, with a suite of test cases. Solve the Python problem Paper Github
Data
AQuA Arithmetic, Reasoning 100K multiple choice problems (GMAT, GRE, other sources) with question/options/rationale Select the correct MCQA Paper Github Best results obtained with an external calculator added
ARC Common Sense, Reasoning 8K Grade school science questions: e = easy set, c = challenge set Select the correct MCQA Paper Data Careful, this is the AI2 Reasoning Challenge, not the Abstraction and Reasoning Corpus
bAbI Reasoning 20 tasks each with 2K automatically generated questions + short scenarios (successive actions generated with a simulated text adventure game). Reason over the sentence to select the correct conclusion Paper Github
Data
See Part 4 for the simulation env and its constraints, it’s quite a fun idea. Probably not too hard to reproduce for other types of reasoning.
BBQ Bias detection 58K examples with two contexts (ambiguous and explicit about a bias), two questions (negative and non-negative) and possible answers, constructed from manual templates and checked with crowdsourcing. Predict the correct, non biased answer. Difference between accuracies depending on context/question allows to build a bias score. Paper Github
BLiMP Language Understanding 67 datasets of each artificially generated 1K minimal pairs testing syntactic, morphological and semantic knowledge, validated with MTurking. Accuracy measured by looking if the log-probability the model assigned to the correct sentence is higher. Paper Github Things tested: anaphor agreement, argument structure, binding, control/raisong, determiner-noun agreement, ellipsis, filler-gap, irregular forms, island effects, NPI licensing, quantifiers, subject-verb agreement
BOLD Generation, Toxicity detection 23K prompts extracted from beginning of Wikipedia sentences containing a race/religion/political/gender/profession group member (ex: woman artist for gender=female).
Task: generating end of sentence, and toxicity is evaluated through a range of metrics (sentiment analysis, using classifiers, …). In HELM, toxicity is measured using the Perspective API.
Paper Github
BooksCorpus N/A 11K unpublished books of more than 20K words scraped from the web (of 16 different genres).

In the original paper, it's used to train a sentence embedding model - in model papers, it's often used for contamination or perplexity evaluations. Paper Hugging Face
BooksCorpus_HELM Generation, Memorisation 1K randomly sampled books from BooksCorpus. Task: from a random number of tokens beginning a paragraph, the model must generate a follow up - measure exact and near-exact reproduction. Paper Data
BoolQ Language Inference, Language Understanding 16K sentences of naturally occurring yes no QA, from question + context from Wikipedia Answer the MCQA Paper Website
CB Language Understanding 1.2K of discourse (news from the Wall Street Journal, fiction from the British National Corpus, dialogue from Switchboard), containing context + target sentence Predict commitment entailment Paper Website
Civil comments Toxicity detection 1.8M online comments, with crowd sourced human labels for and toxicity following the Perspective API guidelines, and among these, 450K labeled with identity terms (crowdsourced, to pick in a list). Task : toxicity prediction, labels are used to identify areas of biases in models Paper Kaggle
Hugging Face
Original paper contained a synthetic test set (77K examples generated from templates using 50 identity terms, 50/50 on toxicity) and a human labeled dataset (description in the Task content col) - I suppose the dataset available is the latter
Clean E2E NLG Description, Generation 50K crowdsourced generated descriptions of restaurants given keys and values (type of food = X, budget = Y, …). Paper Hugging Face Palm: Prefixed with a prompt, truncated article to 2048 tokens
CNN/DailyMail Cloze/Completion, Summarization Original dataset: 200K new documents (CNN/DailyMail between 2007 and 2015) converted into Cloze format by removing some of the text’s named entities, and using them as keys.
In HELM: Uses above documents (in complete form) as text to summarize, and their highlights as gold summaries. Paper Hugging Face
Data
(I suspect this does not produce very good summaries though)
CommonsenseQA Common Sense, Reasoning 12K turked Q/A (initialized from ConceptNet associations), then filtered by quality, with added context from a Google Search query > Some text likely overlaps with CC data Answer the MCQA Paper Best results with an external calculator added
Contrast Sets Generation, Robustness 10 contrast sets of up to 1K examples, for their datasets (see comments) made by (often the original paper’s) researchers (increase reasoning steps in questions, replace words by their opposites, change counts …). Follow original task setup with new examples, and see how/if performance drops.
In HELM: use the IMDb and DROP contrast sets
Paper Data NLVR2, IMDb sentiment analysis, MATRES Temporal RE, English UD parsing, PERSPECTRUM, DROP, Quoref, ROPES, BoolQ, MC-TACO.

Dataset construction is way more detailed in Appendix
COPA Common Sense, Language Understanding 1K premise + causal questions with alternatives (common sense) Paper Website
CoQA In-Context Reading Comprehension 127K Conversational QA, from a given context (rationale must be provided too) - written by annotators Paper v1.0 from Data
DataImputation Real life task, Reasoning, Structured data 8 structured datasets from several sources. Task: from a row of attributes with gaps, the model must fill the gaps (ex: extrapolating city from phone number, phone brand from its specs).
Paper Data restaurant
Data Buy
See table 2 for all sources.
In HELM: use the subsets Buy and Restaurant, convert input to natural language, test accuracy.
Digits arithmetics (2D+, 2D-, 3D+, 3D-, …) Arithmetic Basic arithmetic tasks for n digits addition, subtraction, composite operations with 2K example each Task: solve the math Paper Github All links come from the lm-evaluation-harness/lm_eval/datasets/arithmetic
DROP Arithmetic, In-Context Reading Comprehension 55K adversarial questions which require 1) selecting relevant items from the text and 2) computing on them (sorting/counting/…) to get the correct answer Task: select and count to provide the correct answer Paper Data
Dyck language_HELM Symbolic manipulation 500 D_n words between 52 and 100 characters (”word” made of nested brackets/parenthesis) where the last i characters have been removed. Task: Predict the unique sequence of closing parentheses. Paper Github
Also has a different version in BigBench
HellaSwag Cloze/Completion 60K adversarially filtered multiple choice Q/A Choose the correct next sentence (from captions or WikiHow) Paper Github
HumanEval Code task, Text-to-code 164 hand written programming problems with function signature, docstring, body + unit tests Aim is to complete function to pass unit tests Paper Hugging Face
IMDB Sentiment Analysis 50K reviews from IMDB, with even positive (score ≥ 7) /negative (score ≤ 4) reviews (no neutral). Classify positive/negative review. Paper Website
LAMBADA Cloze/Completion 10K Narrative contexts (from the BookCorpus) followed by a sentence where the last word is masked and must be predicted. Specifically built to force use of the context. Predict the last word. Paper Zenodo
Language Modeling Evaluation_HELM Language Modeling Compilation in HELM of several datasets: WikiText-103, ThePile (particularly arXiv, BooksCorpus2, Enron Emails, PubMed Central, Wikipedia), TwitterAAE, ICE. Task: get conditional log probability of the full sequence (perplexity measure) Paper The pile website
BLIMP Github
Wikitext data
Twitter AAE data
ICE data
LegalSupport Entailment, Real life task, Reasoning 20K legal entailment scenarios, constructed from state/federal legal opinions (assertion is used as context, and 2 supporting sources (”see X, rule”) are selected at random). Task: finding which rule best supports assertion. Paper Data
LinuxKernel_HELM Generation, Memorisation 2K randomly sampled functions from the Linux Kernel. Task: from a random number of lines beginning a function, the model must generate a follow up - measure exact and near-exact reproduction. Paper Data
LSAT Analytical Reasoning, In-Context Reading Comprehension, Logical Reasoning 10K questions from the Law School Admission Test (analytical, logical reasoning, and reading comprehension), with context. Answer the MCQA correctly Paper Github
Magellan Benchmark Real life task, Reasoning, Structured data 23 datasets from several sources containing entities with attributes. Dirty datasets contain deliberate errors, such as attributes being in the wrong column, misspellings, etc. Task: given two entities from two different tables, determine if they are the same or not. Paper Github It’s likely that Abt-Buy and Buy are the same dataset
MBPP Code task, Text-to-code 1K entry-level Python crowd-sourced programming problems (description, solution, 3 unit test cases) - (58% mathematical, 43% list processing, 19% string processing, 9% integer sequences, and 2% other). Solve the Python program Paper Github
Hugging Face
Also contains an edited version (400 items) with unambiguous prompts and good signatures (can be interesting to look at later to see the impact of prompts on code gen) + a MathQA-Python dataset (adaptation of the MathQA dataset)
MMLU Language Understanding 15K multi-choice Q/A manually collected from various online sources, on many topics (legal, philosophy, economics, psychology, STEM, medicine, etc, - at high school to professional level) Answer the MCQA Paper Hugging Face
Github
Seems like a strong/high-quality baseline
MRF (Misinfo Reaction Frames) Generation, Misinformation capabilities 200K pairs of claims from news headlines (climate change, covid 19, cancer illness, detailed sources in comments) + label (real/misinformation), the former annotated on veracity, likelihood of spread, writer intent by MTurk workers. Task: must either predict the gold label or generate likely writer intent/reader perception/… Paper Github (Contains data from NELA-GT-2018-2020, SciDCC, Climate-FEVER, CoAID, CoronaVirusFacts/DatosCoronaVirusAlliance Database, ESOC Covid-19 Misinformation Dataset, DETERRENT)
MS MARCO Question Answering, Retrieval 1M anonymised questions with free-form human generated answers (from relevant web document extracts), some with added rewriting.
Original paper contains 3 tasks: 1) generating the correct answer, if possible, 2) same but answer should make sense even without context, 3) ranking 1000 passages on how relevant they are for the question. Paper Github Contains extended descriptions of QA datasets in lit review.
In HELM, only the ranking task is looked at, and relevance is estimated looking at the log-likelihood of the prediction when asking “Does the passage answer the query?”
MS MARCO TREC, aka TREC 2019 Retrieval Datasets derived from MS MARCO, edited for either passage or document retrieval tasks, either doing full retrieval or top-n reranking (100 for documents, 1000 for passages). (see MS MARCO) Paper Data
Github
MultiRC Language Understanding, Question Answering 6K multiple choice question over a diversity of topics Paper Data
NarrativeQA Question Answering, Retrieval 47K free-form human generated questions and answers, linked to 1.5K books (Gutemberg project) and movie scripts (scraped) matched with plot summaries.
Task: from the summary or the story, answer or select the correct answer. Paper Github For long range context testing, we could use this dataset to do QA from the full stories. Could be interesting for anything conversational imo.
Natural Questions Open domain/Closed book, Question Answering 207K aggregated google search queries + annotated wikipedia sample answer Paper Data
NewsQA Question Answering 100K human generated QA pairs from 12K news articles (CNN). Questions were generated from title + summary, answers from question + article, then kept through a validation mechanism.
Likely intersects with CNN/DailyMail, as data extraction script was the same.
Paper Github
OpenBookQA Common Sense, Reasoning 6K sentences, science reasoning needing common sense knowledge to extrapolate to new situations Paper Data
PIQA Common Sense, Reasoning 20K physical common sense reasoning situations, select the correct action to do from a context and answers Paper Data
PopularBooksCorpus_HELM Generation, Memorisation 20 books from BooksCorpus which appear in a list of bestsellers. Task: from a random number of tokens beginning the first paragraph of the book, the model must generate a follow up - measure exact and near-exact reproduction. Paper Data
QuAC In-Context Reading Comprehension 100K questions in information seeking QA contexts (used Wikipedia to generate dataset) Paper Data
RACE In-Context Reading Comprehension 100K questions from English reading comprehension exam for Chinese mid/high school students Paper Data
RAFT Real life task, Text classification Compilation of 11 datasets of naturally occurring classification tasks, of between 150 and 5K test items. Task: in few shot from 50 labeled examples, provide meaningful labels. (Domains: medical, finance, research, english language, law, physics, AI safety, social networks) Paper Hugging Face Corpus: (ADE Corpus v2, Banking77, NeurIPS 2020 impact statement risks, OneStopEnglish, Overrruling, Systematic review inclusion, TAI safety research, Terms of Service, TweetEval Hate, Twitter complaints, + Semiconductor org types, created for this)
RealToxicityPrompts Generation, Toxicity detection 100K natural occurring sentences (selected from OpenWebText corpus, basically = reddit, and scored for toxicity with the PerspectiveAPI) split in two to create a prompt and continuation. Task: generate the continuation from the sentence start, then toxicity evaluated with PerspectiveAPI. Paper Data
Github (the repo lacks a lot of info)
ReCoRD Language Understanding 120K passage/cloze query/answer examples from news (CNN, DailyMail) with human filtering Paper Data
RTE Language Understanding 3K compilation of competition data on entailement Paper Data
SAT analogies Language Understanding 374 SAT analogy problem prior to 2005 (a is to b what c is to multiple choice questions; words are not the most frequent) Paper Data dev
Data test
SIQA Question Answering
SQuADv2 In-Context Reading Comprehension Combines SQuAD with 50K unanswerable questions from a context, give an answer, but only if possible Paper Github
StoryCloze Cloze/Completion, Common Sense 50K 5-sentences commonsense stories choose the correct ending Paper Hugging Face
StrategyQA Common Sense, Reasoning 2.8K questions needing reasoning from implicit knowledge Paper Best results with an external calculator added
Synthetic reasoning (natural) Logical Reasoning, Reasoning Synthetic data generated on the fly, containing a set of synthetic rules (conditional statements), facts (attributes), and the logical gold output. Paper Can be generated with Github Also called rule_induct in HELM
Synthetic reasoning (symbolic)_HELM Logical Reasoning, Symbolic manipulation Synthetic data generated on the fly using a pattern template. Either test if the model is able to identify patterns (”beach + beach - pear” has “A + A - B” as pattern) or if the model can substitute strings in a given pattern. Paper Can be generated with Github
TriviaQA Open domain/Closed book, Question Answering 95K trivia QA (compositional questions, syntactic variability) Paper Data
TruthfulQA Question Answering 817 questions about tricky factual claims (common misconceptions, falsehoods, …) over 38 categories, with true and false reference answers + a source to support true answers (+ 380 added questions). Paper Github
TyDiQA-GoldP Multilingual, Question Answering 204K multilingual QA pairs (unconstrained question elicitation from prompts, then Wikipedia article retrieval, and specific answer selection in the article if possible) (en, ar, ben, fin, ind, ja, ko, ru, tel, th and kiswahili). MCQA Paper Github Dataset generated can present interesting underspecification of questions and mismatch between question and answers language level. Might be harder than other datasets
Web Questions Open domain/Closed book, Question Answering Extracted 100K “Wh?” questions from Google Search API, then annotated by MTurkers - I suspect answers are partially out of date MCQA Paper Website
WebNLG Generation, Verbalization 13K mappings between triples (subject, property, object, constructed from DBPedia, which is a KB from Wikipedia) and sentence verbalization (by crowd workers), about specific topics (astronauts, universities, monuments, buildings, characters from comics, food, airports, sports teams, written works). Task: verbalize in a grammatical way. Paper Hugging Face
There was a sentence selection for fluency and the sentences generated are relatively simple, but there is no description of annotators/crowdsourcers origins > maybe some data is not in “standard English”.
WiC Language Understanding 7K, classification of whether a word occurring in two different contexts has the same meaning or not Paper Site
WikiFact_HELM Cloze/Completion 12 domains with 1K triples (subject, relation, object) sampled from Wikipedia and cleaned. Task: predict missing item in the sentence made of the relation. Paper Codalab
Github
WikiLingua Generation, Multilingual, Summarization 43K article/summary pairs constructed from WikiHow in 18 languages (on the site, articles are written with a summary sentence + detailed paragraph per step: in the dataset, summaries are the concatenation of the summary sentences, and articles of the detailed paragraphs Summarization Paper Github Palm: Prefixed with a prompt, truncated article to 2048 tokens
I suspect data creation can leads to very “robotic” language for the summary baseline, which could underscore more fluid summaries - though ROUGE shouldn’t be too prone to that).
Winogender Bias detection
Winograd Reasoning, Winograd 273 to 285 examples where one must disambiguate who/what a pronoun is referring to on sentence specially constructed to be ambiguous to statistics not to humans Disambiguation of pronoun Paper Website Not sure if GPT3 was evaled on this one or the SuperGLUE one
WinoGrande Reasoning, Winograd 43K sentences Adversarial Winograd Paper Website
WSC Language Understanding, Winograd WinoGrad Schema Challenge (see Winograd) Paper Website
XSUM Summarization 226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts) Paper Github
XSum Generation, Summarization 226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machine Paper Hugging Face Could be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news.

Dataset ideas to manually reproduce

Evaluation name Task type Task content Source Dataset Comments
✍️ GSM8K-Python Code task, Text-to-code Python version of the GSM8K dataset (8.5K grade school math problems) Paper N/A
✍️ MRF Generation, Manual evaluation, Misinformation capabilities 250 headlines extracted from the MRF dataset, grouped in 80 clusters by thesis. Task: from the thesis + 5 headlines, the model must generate plausible headlines which supports the thesis. Annotators evaluate if 1) the headline supports the thesis and 2) looks real. Paper Data See report page 6 for a detailed explanation of the original process, plus sections 8.5.2, E.5, and 5.5 in the HELM paper.
✍️ News article generation Generation Generated 25 articles from titles and subtitles, 80 humans had to classify if generated or original Paper
✍️ Numeracy Prediction Symbolic manipulation “requires the model to perform symbolic regression given a few examples, and apply the number relationship to a new input” Paper Github
✍️ SVG datasets Could construct an SVG dataset to see if models can indeed generate or interpret SVG drawings Twitter thread
✍️ Theory of the mind datasets Could likely be easy to generate Paper
✍️ Wedging prompts Generation, Manual evaluation, Misinformation capabilities 11 prompts with specific intent (ex: influence voting behaviors, target specific groups by generate pro/anti X rhetoric) augmented with 3 examples. Task: generate follow up examples.
Paper Data In HELM: use manual evaluation to determine if the message generate 1) addresses targeted group; 2) supports desired message; 3) is divisive
✍️ Word scrambling Symbolic manipulation 10K examples for 5 tasks of 5 character manipulation tasks (word with cycled letters, anagrammed, random insertions, reversed). Model needs to recover the original word Paper Easy to generate/automate, see Section 3.9.2 of GPT3 paper