Improve the benchmark by evaluating multiple models and display the results #126

bugsz · 2024-06-26T20:05:58Z

Closes #

📑 Description

As the title suggests:

Support evaluating multiple models at the same time, simply by sotopia benchmark-all --model-list gpt-4o --model-list gpt-3.5-turbo, or just go ahead with the default model names.
Support displaying and saving the results in format in https://github.com/sotopia-lab/sotopia-space/blob/main/data_dir/models_vs_gpt35.jsonl by sotopia benchmark-display. (Seems there is no requirement for pandas so I am not sure how to display in a structured way in CLI)

✅ Checks

My pull request adheres to the code style of this project
My code requires changes to the documentation
I have updated the documentation as required
All the tests have passed
Branch name follows type/descript (e.g. feature/add-llm-agents)
Ready for code review

ℹ Additional Information

…ving the results to compare different models

codecov · 2024-06-26T20:11:04Z

Codecov Report

Attention: Patch coverage is 17.02128% with 39 lines in your changes missing coverage. Please review.

Project coverage is 60.90%. Comparing base (701f2a8) to head (d8bfa47).
Report is 4 commits behind head on main.

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
- Coverage   61.71%   60.90%   -0.81%     
==========================================
  Files          55       55              
  Lines        2714     2778      +64     
==========================================
+ Hits         1675     1692      +17     
- Misses       1039     1086      +47

Files	Coverage Δ
sotopia/cli/benchmark/benchmark.py	`21.17% <17.02%> (-2.27%)`	⬇️

... and 6 files with indirect coverage changes

sotopia/cli/benchmark/benchmark.py

ProKil

Please change this and we can merge this PR.

sotopia/cli/benchmark/benchmark.py

support benchmarking for multiple models & support aggregating and sa…

aefb489

…ving the results to compare different models

fix mypy issue

af5f725

ProKil reviewed Jun 26, 2024

View reviewed changes

sotopia/cli/benchmark/benchmark.py Outdated Show resolved Hide resolved

bugsz added 2 commits June 26, 2024 16:22

fix mypy issue

6260566

merge all the benchmark functions

ed1bebb

ProKil requested changes Jun 28, 2024

View reviewed changes

sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved

sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved

sotopia/cli/benchmark/benchmark.py Show resolved Hide resolved

sotopia/cli/benchmark/benchmark.py Outdated Show resolved Hide resolved

ProKil assigned bugsz Jun 29, 2024

now support printing to table with Rich

8fb172a

ProKil requested changes Jul 10, 2024

View reviewed changes

sotopia/cli/benchmark/benchmark.py Outdated Show resolved Hide resolved

make output_to_json in benchmark_display an argument

d8bfa47

ProKil approved these changes Jul 11, 2024

View reviewed changes

ProKil merged commit 28f053a into main Jul 11, 2024
8 checks passed

ProKil deleted the feature/benchmark branch July 11, 2024 22:14

bugsz restored the feature/benchmark branch July 12, 2024 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the benchmark by evaluating multiple models and display the results #126

Improve the benchmark by evaluating multiple models and display the results #126

bugsz commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading

ProKil left a comment

Improve the benchmark by evaluating multiple models and display the results #126

Improve the benchmark by evaluating multiple models and display the results #126

Conversation

bugsz commented Jun 26, 2024 • edited Loading

📑 Description

✅ Checks

ℹ Additional Information

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

ProKil left a comment

Choose a reason for hiding this comment

bugsz commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 26, 2024 •

edited

Loading