RepLiQA - Repository of Likely Question-Answer for benchmarking

Dataset Summary

RepLiQA is an evaluation dataset that contains Context-Question-Answer triplets, where contexts are non-factual but natural-looking documents about made up entities such as people or places that do not exist in reality. RepLiQA is human-created, and designed to test for the ability of Large Language Models (LLMs) to find and use contextual information in provided documents. Unlike existing Question-Answering datasets, the non-factuality of RepLiQA makes it so that the performance of models is not confounded by the ability of LLMs to memorize facts from their training data: one can test with more confidence the ability of a model to leverage the provided context.

Documents in RepLiQA comprise 17 topics or document categories: Company Policies; Cybersecurity News; Local Technology and Innovation; Local Environmental Issues; Regional Folklore and Myths; Local Politics and Governance; News Stories; Local Economy and Market; Local Education Systems; Local Arts and Culture; Local News; Small and Medium Enterprises; Incident Report; Regional Cuisine and Recipes; Neighborhood Stories; Local Sports and Activities; and Local Health and Wellness. Non-factual documents are annotated in one of these topics covering fantasy/made-up entities that are not documented anywhere. Each document is accompanied by 5 question-answer pairs.

Moreover, annotations in RepLiQA are such that approximately 20% of the questions cannot be answered from the provided documents, and models are expected to indicate that an answer cannot be obtained whenever that is the case.

Supported Tasks

RepLiQA is designed to support at least the following tasks:

Question-Answering
Topic Retrieval
Selective Question-Answering (i.e., test for the ability to refuse to answer questions that cannot be answered from the provided context.)

Data Fields

document_id (string): Uniquely identifies the document to which this sample pertains. Note that there are 5 questions per document, so each document_id appears 5 times in the dataset.
document_topic (string): One of the 17 document topic/categories listed above.
document_path (string): Relative path within this repository to the original PDF document.
document_extracted (string): Text automatically extracted from the original PDF document.
question_id (string): Uniquely identifies each document-question combination, and thus each data sample.
question (string): Question that may or may not be answerable using the associated document.
answer (string): Answer to the question when it can be answered using the document, and the string UNANSWERABLE otherwise.
long_answer (string): The annotator who produced the answer was asked to copy-paste here the paragraph from the document in which they found the answer. This long_answer is provided as-is, with no check whether it is actually contained within the document. The long_answer is NA iff the answer is UNANSWERABLE.

Summary of data annotation pipeline

Topic selection.
Produce reference documents of approximately 1000 words. When creating fictitious characters, places and organizations, annotators used random name generators and anonymization tools to cross-reference them against existing entities, in order to avoid unintentional references.
Automatic summarizations of reference documents.
Annotation of 5 specific and direct questions, based solely on the summary.
Annotation of the associated answers, based on the full document and questions.
Quality control: all samples were vetted, with a reported initial rejection rate of around 5-10%.
Data splitting and further cleaning up to remove left out noisy content.

Known issues

Various irregularities have been observed, including code-like chunks (e.g., within angle <> or square [] brackets).
Scoring RepLiQA documents with Fast-DetectGPT results in score that are notably different from those of FineWeb.

Update plan:

RepLiQA consists of five splits, to be released gradually over a year:

repliqa_0 June 12th, 2024.
repliqa_1 December 9th, 2024.
repliqa_2 February 10th, 2025.
repliqa_3 April 14th, 2025.
repliqa_4 June 9th, 2025.

By construction, these splits should all be identically distributed. This gradual release schedule is meant to avoid leaking novel data partitions and ensure models are not trained in its contexts when evaluated.

Comments and requests can addressed in the discussions.

How to benchmark with RepLiQA

At term, five RepLiQA splits will be released. Because evaluating LLMs can be costly, some authors may prefer to evaluate on a subset of the released splits. We recommend the following choices of such subsets, and :

(latest) If you evaluate on only one split, use the latest released split (preferred evaluation setting);
(zeroth+latest) If you evaluate on two splits, use repliqa_0 and the latest released split;
(all) If you evaluate more than two splits, use all released splits.

In general, please clearly specify which RepLiQA splits were used, and report results for each split separately.

Resources

Paper.
- João Monteiro, Pierre-André Noël, Étienne Marcotte, Sai Rajeswar, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, and Perouz Taslakian. RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content. arXiv preprint arXiv:2406.11811, 2024.
Blogpost.
- RepLiQA: A more robust benchmark for QA
RepLiQA Dataset
Associated Code

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LICENSE		LICENSE
README.md		README.md
hf_qa_eval.ipynb		hf_qa_eval.ipynb
length_distributions.ipynb		length_distributions.ipynb
optional_requirements.txt		optional_requirements.txt
repliqa_eval.ipynb		repliqa_eval.ipynb
repliqa_topic_classifier.ipynb		repliqa_topic_classifier.ipynb
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RepLiQA - Repository of Likely Question-Answer for benchmarking

Dataset Summary

Supported Tasks

Data Fields

Summary of data annotation pipeline

Known issues

Update plan:

How to benchmark with RepLiQA

Resources

Licensing Information

RepLiQA Dataset

Associated Code

About

Releases

Packages

Contributors 3

Languages

License

ServiceNow/repliqa

Folders and files

Latest commit

History

Repository files navigation

RepLiQA - Repository of Likely Question-Answer for benchmarking

Dataset Summary

Supported Tasks

Data Fields

Summary of data annotation pipeline

Known issues

Update plan:

How to benchmark with RepLiQA

Resources

Licensing Information

RepLiQA Dataset

Associated Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages