Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic benchmarking of gpt-engineer with APPS #819

Closed
ATheorell opened this issue Oct 23, 2023 · 4 comments
Closed

Automatic benchmarking of gpt-engineer with APPS #819

ATheorell opened this issue Oct 23, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@ATheorell
Copy link
Collaborator

Feature description

gpt-engineer has an automatic evals suite in "evals/eval_new_code.py". However, only 2 test cases are given in evals/new_code_eval.yaml . An alternative to filling in more testcases manually, we should parse in prompts and tests from the (very large) APPS dataset (https://paperswithcode.com/dataset/apps).

Since APPS is way too large to run in its entirety, there should be functionality to run n randomly selected tests and run n tests according to some predetermined test ordering (so that consecutive benchmark runs are comparable).

The APPS database should not be added to the gpt-engineer git repo! Probably the best way to handle this is to pull it from huggingface (https://huggingface.co/datasets/codeparrot/apps) in the code itself (potentially caching it and gitignoring it so it doesn't need to be pulled on every run).

Motivation/Application

Automatic benchmarking is the ideal way to determine whether an imposed change to the code base is advantageous.

@ATheorell ATheorell added enhancement New feature or request triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. and removed triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. labels Oct 23, 2023
@pbharrin
Copy link
Contributor

I can add a sampled version of the APPS dataset which will give us a good idea how our project is doing without costing a fortune. APPS is a great at testing how well our repair of broken code works.

@azrv
Copy link
Contributor

azrv commented Feb 1, 2024

@ATheorell assign to me

@ATheorell
Copy link
Collaborator Author

Yes, please a shot shot at this @azrv :)

@viborc viborc moved this to Todo in gpt-engineer roadmap Feb 8, 2024
@viborc viborc moved this from Todo to In Progress in gpt-engineer roadmap Feb 15, 2024
@azrv
Copy link
Contributor

azrv commented Mar 23, 2024

#1051 was merged 🎉

@pbharrin It's now a matter of cherry-picking problems we want to constantly test against.

gpt_engineer/benchmark/benchmarks/apps/problems.py:4

@github-project-automation github-project-automation bot moved this from In Progress to Done in gpt-engineer roadmap Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants