These are the supplementary files for the "Adaptive In-conversation Team Building for Language Model Agents." They contain code for running the experiments in the paper.
The codebase is developed upon the AutoGen, where our implementations are located at autogen/agentchat/contrib/meta_agent.py
and autogen/agentchat/contrib/meta_user_proxy_agent.py
We use autogenbench
to test all scenarios in our benchmark. For the detailed instruction on using autogenbench
, please refer to autogenbench.
We also provided some brief instructions for autogenbench
below.
The codebase is built upon autogenbench and autogen. So instead of installing via pip, you should install pyautogen and autogenbench in editable way:
cd /path/to/autogen
pip install -e .
cd /path/to/autogen/samples/autogenbench
pip install -e .
Modify the first line in requirement.txt
to the path of your autogen-autobuild-dev.
This is the general method to run evaluations on different scenarios. Use the following command to run the benchmark for each scenario:
cd [SCENARIO FOLDER. For example, /path/to/scenarios/MATH]
python Scripts/init_tasks.py // initialize the tasks
autogenbench run Tasks/[TASK YOU WANT TO RUN].jsonl --native // run the task. native is use to run the scenario without docker. If you have a docker environment, you can remove it.
autogenbench tabulate Results/[TASK YOU WANT TO RUN] // print the results in tabulate.
If you want to debug, set -s 1
to use a single data for testing:
cd [SCENARIO FOLDER. For example, /path/to/scenarious/MATH]
autogenbench run Tasks/[TASK YOU WANT TO RUN].jsonl -s 1
If you want to debug a specific problem, you can run the scenario.py
in Results/[YOUR TASK]/[PROBLEM ID]/0/scenario.py
manually in debug mode.
Note that every time the autogenbench run TASK
is run, it checks the Results
folder and only runs problems that are not in it. If you want to rerun the tasks, delete the corresponding files in the Results
folder.
Some templates requires manual addition to the Templates/scenarios.py, it is recommended to check the code and fill out the placeholders. For detailed instructions on running each benchmark, please take a look at the respective readme in the folder.