Data contamination is a major concern in the evaluation of LLMs. Meanwhile, the Gaokao (Chinese College Entrance Examination) is known for its strict security measures and the innovative nature of its questions, which are designed to maintain fairness and confidentiality. These features make the newly conducted Gaokao an ideal source of questions for evaluating LLMs while minimizing the risk of data contamination.
In this repository, we provide translated versions of the Gaokao 2024 mathematics (6/7/2024) questions along with the corresponding responses generated by widely used LLMs.
大模型数据刷榜一直是备受关注的问题。与此同时,一年一度的高考试题通过严格的保密措施和独特的设计来提高考试的公平性与私密性。这些特点使得新发布的高考试题成为评估大语言模型能力的理想数据集。为此,以下提供了2024高考数学试题的英文翻译集
Question | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Accuracy |
---|---|---|---|---|---|---|---|---|---|
Grand truth | A | C | D | A | B | B | C | B | |
GPT-4o (6/8/2024) | A | C | D | A | B | B | C | C | 87.50% |
Claude_3_Opus (6/8/2024) | A | C | D | A | B | B | A | B | 87.50% |
Gemini-Ultra-1.0 (6/8/2024) | A | A | D | A | B | NaN | C | B | 75.00% |
Llama_3_70b (6/8/2024) | A | NaN | D | NaN | B | A | C | B | 62.50% |
For each question, if all selected options are correct, the model receives the full score. If some selected options are correct, the model receives partial points proportionate to the response. If there are any incorrect selections, the model receives 0 score.
9 | 10 | 11 | Accuracy | |
---|---|---|---|---|
Grand Truth | BC | ACD | ABD | |
GPT-4o (6/8/2024) | BC | ACD | ABD | 100% |
Claude_3_Opus (6/8/2024) | BC | BC | AD | 55.6% |
Gemini-Ultra-1.0 (6/8/2024) | BC | ACD | AD | 55.6% |
Llama_3 (70b; 6/8/2024) | BCD | BCD | AD | 22.2% |
FIB | 12 | 13 | 14 | Accuracy |
---|---|---|---|---|
Grand Truth | 3/2 | ln2 | 1/2 | |
GPT-4o (6/8/2024) | 3/2 | ln2 | 0.69 | 66.70% |
Claude_3_Opus (6/8/2024) | 1 | ln2 | 0.15 | 33.30% |
Gemini-Ultra-1.0 (6/8/2024) | 5/4 | ln2 | 1 | 33.30% |
Llama_3 (70b; 6/8/2024) | 4/a | 1-ln(3/2) | 11/12 | 0% |