- python: 3.10
- pytorch: Select a gpu version based on the cuda version
- transformers
- openai: GPT3.5 and GPT4 are required
- Install the rest as required
1.Install the human - eval, for model generation result assessment, refer to the link https://github.com/openai/human-eval, Make sure to use python 3.7 or later:
$ conda create -n codex python=3.10
$ conda activate codex
Check out and install this repository:
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
2.Implement the model result acquisition function in APIUse/generate_implements.py according to the needs of the experimental model
3.Set the corresponding model name in main.py and run the main function
4.Analyze the experimental results
- result files is under results/ and can be set in code
- A final summary file result.csv of the experiment results will be generated in the APIUse directory of the project, which mainly contains the conditions of each experiment and the final automated evaluation results
- result: Used to store the result file after the request model
- dataset: HumanEval extends benchmarks to other languages
- word_extract: Attention files obtained by different extraction rank methods
Note: The path is set in sys in the following file and needs to be reset according to the actual location of the code
import sys
sys.path.append("xxx")
- APIUse/generate_implements.py: implements different LLM method functions to obtain results (refer to the model functions already implemented). Adds the corresponding word in get_model_completion to ensure that main.py can be obtained through the specified model string
- APIUse/experiment_execut.py: The start class of the experiment, which defines the control of different conditions, if no new variables need to be added without modification, the variables involved in the experiment_execute function are described below
- model_name_list: LLM model list, the element is the string of the model name, need to implement the corresponding access function in generate_implaments.py, to ensure the same name, refer to the implementation of gpt-3.5-turbo-0613
- languages: List of natural languages involved in the experiment, including English, Chinese, French, Spanish and Japanese
- word_extract_list: Word segmentation object list. Each dict contains method and suffix. No modification is required
- template_id_list: The template involved in the experiment, int array
- result_path: indicates the result storage path
- chat_number: indicates the number of session rounds. The current number is 1,2
- remark: Remark, which is part of the result file
- wordNum: Amount of attention
- promptId: indicates whether the baseline value is with_attention and 1 indicates without_attention
- APIUse/main.py: The main function to start the experiment, which needs to set the experimental conditions, and then passed to experiment_execute, so as to start the experiment
- xxx.jsonl: The direct result obtained by the request model
- xxx_processed.jsonl: The targetCode result set extracted by analyzing the direct result
- xxx_processed.jsonl_results.jsonl: targetCode verification result obtained by running the command
- xxx_prompt.jsonl: Complete prompt when requesting a model
- When adding a model, the corresponding function must be added to APIUse/generate_implements.py to obtain the result of the model
- You can categorize the results into different folders by calling experiment_execute several times and specifying a different result_path
- APIUse/command.txt: Stores the command that verifies the accuracy of the result and invots it in the command function in APIUse to generate the result verification for the model, mainly for processing and evaluation,like:
python process_humaneval.py --path=results/test/gpt-3.5-turbo-0613_humanEval_Chinese_text_rank__9_NP_VP_template2.jsonl --out_path=results/test/gpt-3.5-turbo-0613_humanEval_Chinese_text_rank__9_NP_VP_template2_processed.jsonl
evaluate_functional_correctness results/test/gpt-3.5-turbo-0613_humanEval_Chinese_text_rank__9_NP_VP_template2_processed.jsonl --problem_file=../dataSet/human-eval-v2-Chinese.jsonl