🏠 Home Page • 🔥 Quick Start • 🏆 Leaderboard • 🔎 Sample Explorer • 📜 Citation • 🙏 Acknowledgements
CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation) is a benchmark of 800 Python functions and input-output pairs. The benchmark consists of two tasks, CRUXEval-I (input prediction) and CRUXEval-O (output prediction).
The benchmark was constructed as follows: first, we use Code Llama 34B to generate a large set of functions and inputs. The outputs are generated by executing the functions on the inputs. Second, we filter the set so that our benchmark only consists of short problems with low computation and memory requirements, problems which a good human programmer should be able to do without extra memory in a minute or so. Third, we randomly select 800 samples passing the filter, ensuring the benchmark is both small enough to easily run but large enough to reliably see performance differences among various models.
To clone the repository, run
git clone [email protected]:facebookresearch/cruxeval.git
cd cruxeval
If you want to install everything at once, run pip install -r requirements.txt
. Otherwise, if you just want to score generations, run pip install -r requirements-base.txt
. If you just want to run OpenAI models, run pip install -r requirements-openai.txt
. If you just want to run inference on HuggingFace models, run pip install -r requirements-inference.txt
. The code has been tested with Python version 3.9 and CUDA version 12.1.
The dataset is available in .jsonl
format in data/cruxeval.jsonl
and in HuggingFace Datasets. Each sample contains code
, input
, and output
fields. A sample script to print the samples of the dataset is in quickstart.ipynb
.
To evaluate a set of generations, load your generations (function calls for CRUXEval-I or outputs for CRUXEval-O) as strings into a json file such as generations.json
with the following format:
{
"sample_0": ["f([1, 1, 1, 1, 3, 3])", "f([])"],
...
"sample_799": ["f('~neqe-;ew22')", "f('~neqe-;ew22')"]
}
Then, cd evaluation
and run the following command, setting mode
to input
to evaluate CRUXEval-I and output
to evaluate CRUXEval-O.
python evaluate_generations.py \
--generations_path generations.json \
--scored_results_path generations_scored.json \
--mode input
The script should take around a minute or so. An example of input and output generations in the correct format for Code Llama 7B can be found in the samples/model_generations
folder, and an example of the corresponding execution result file is in samples/evaluation_results
. The execution results will be written to the file you specify in --scored_results_path
. It contains raw_generations
(the dictionary of raw generations for each sample that was provided), raw_scored_generations
(the dictionary of scored results for each sample), and overall pass_at_1
and pass_at_5
scores. As an example to reproduce the scoring of Code Llama 7B CRUXEval-I generations, run the following command in the evaluation
folder:
python3 evaluate_generations.py \
--generations_path ../samples/model_generations/sample_codellama-7b_temp0.2_input/generations.json \
--scored_results_path ../samples/evaluation_results/sample_scored_codellama-7b_temp0.2_input.json \
--mode input
We also open-source generations and outputs for the models we display on the leaderboard below. First, cd samples
. To access the generations, run unzip model_generations.zip
. To access the scored versions of the generations run unzip evaluation_results.zip
. The generations and scored generations will appear in samples/model_generations
and samples/evaluation_results
, respectively.
We provide a script compatible with SLURM to run inference on CRUXEval with HuggingFace models. First cd inference
. Then, run ./scripts/run_input_prediction.sh
for CRUXEval-I or ./scripts/run_output_prediction.sh
for CRUXEval-O. The default script in the repository runs a variety of models with 2 GPU's at temperatures 0.2, 0.8
with n_sample=10
generations per sample. You should change --output, --error, --partition
accordingly and also may wish to change one or more of GPUS, batch_size, n_samples, temperatures, dirs (directory names), models
.
This script parallelizes the 800 samples of the benchmark in a data-parallel fashion across the GPU's. After running the scripts, the generations will appear in inference/model_generations_raw/shard_i.json
, where i
ranges from 0
to GPUS-1
. To convert these into a form that is readily available for evaluation, run python combine_generations.py
, which will create a file ../model_generations/{MODEL_INFO}/generations.json
. The generations can then be evaluated by following the above instructions.
For best results, we recommend running WizardCoder with transformers==4.31.0/vllm==0.1.4
and all other models with transformers==4.36.2/vllm==0.2.6
. WizardCoder performance has been known to degrade with newer versions of transformers.
You need to use your own API key and comply with OpenAI terms of use. We provide a script to run inference on OpenAI models if you would like to try different temperatures or latest models. Set the OPENAI_API_KEY
environmental variable to be your API key, for example via export OPENAI_API_KEY = YOUR_KEY
. Then, cd openai
and run python openai_run.py
. Like before, the generations will appear in ../model_generations/{MODEL_INFO}/generations.json
.
Finally, we provide SLURM-based scripts to run evaluation on many models in parallel in evaluation/evaluate_all_predictions_input.sh
and evaluation/evaluate_all_predictions_output.sh
. You should change the --output, --error, --partition
values and may also wish to change run_names
. For convenience, we have provided a script evaluation/print_evaluation_directories.py
that automatically prints all the directories found in model_generations
to populate run_names
with for both scripts.
All raw results (raws
) and pass@1 and 5 scores (pass@1
and pass@5
) can then be found in the evaluation/evaluation_results
folder. We have provided a script evaluation/read_results.py
to print the results in tabular form.
This repository is built on top of bigcode-evaluation-harness
and FastCode
, and we thank the contributors of these repos for their awesome works! We also draw inspiration from the EvalPlus repository.
If you find this repository useful, please cite this as
@article{gu2024cruxeval,
title={CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution},
author={Alex Gu and Baptiste Rozière and Hugh Leather and Armando Solar-Lezama and Gabriel Synnaeve and Sida I. Wang},
year={2024},
journal = {arXiv preprint arXiv:2401.03065},
}
CRUXEval is MIT licensed, as found in the LICENSE file.