To assess the longtext capabilities more comprehensively, we propose Needle-in-a-Haystack PLUS, which shifts the focus from simple fact retrieval to more challenging single-document/multi-document question answering tasks.
Our test data can be download in NeedleInAHaystack-PLUS.
All datas in NeedleInAHaystack-PLUS are standardized to the following format:
{
"id": "The unique identifier for each test data.",
"context": "The long context of the single-document question answering task.",
"context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
"depth_percent": "The position of the needle in the haystack.",
"input": "The questions of the question single-document answering task.",
"dataset": "needle_squad",
"answers": "A List of all true answers.",
}
{
"id": "The unique identifier for each test data.",
"context": "The long context of the single-document question answering task.",
"context_length": "The length of haystack ranges from 1,000 to 128,000 tokens with equal intervals, totaling 15 different lengths.",
"depth_percent1": "The position of the first needle in the haystack.",
"depth_percent2": "The position of the second needle in the haystack.",
"input": "The questions of the question single-document answering task.",
"dataset": "needle_hotpotqa",
"answers": "A List of all true answers.",
}
The invocation time of the APIs:
- OpenAI's GPT-4-128K (Run 2024-01-31)
- Anthropic's Claude 2.1 (Run 2024-02-08)
NeedleInAHaystack-PLUS is based on the datasets proposed by previous researchers, including NeedleInAHaystack, Squad, HotpotQA.
@misc{zhao2024longagent,
title={LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration},
author={Jun Zhao and Can Zu and Hao Xu and Yi Lu and Wei He and Yiwen Ding and Tao Gui and Qi Zhang and Xuanjing Huang},
year={2024},
eprint={2402.11550},
archivePrefix={arXiv},
primaryClass={cs.CL}
}