-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic benchmarking of gpt-engineer with APPS #819
Comments
ATheorell
added
enhancement
New feature or request
triage
Interesting but stale issue. Will be close if inactive for 3 more days after label added.
and removed
triage
Interesting but stale issue. Will be close if inactive for 3 more days after label added.
labels
Oct 23, 2023
ATheorell
added this to the Achieve 3 advanced usecases for gpt-engineer that work reliably milestone
Oct 23, 2023
I can add a sampled version of the APPS dataset which will give us a good idea how our project is doing without costing a fortune. APPS is a great at testing how well our repair of broken code works. |
Closed
AntonOsika
modified the milestones:
Achieve 3 advanced usecases for gpt-engineer that work reliably,
3 standard benchmarks
Dec 18, 2023
@ATheorell assign to me |
Yes, please a shot shot at this @azrv :) |
github-project-automation
bot
moved this from In Progress
to Done
in gpt-engineer roadmap
Apr 4, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature description
gpt-engineer has an automatic evals suite in "evals/eval_new_code.py". However, only 2 test cases are given in evals/new_code_eval.yaml . An alternative to filling in more testcases manually, we should parse in prompts and tests from the (very large) APPS dataset (https://paperswithcode.com/dataset/apps).
Since APPS is way too large to run in its entirety, there should be functionality to run n randomly selected tests and run n tests according to some predetermined test ordering (so that consecutive benchmark runs are comparable).
The APPS database should not be added to the gpt-engineer git repo! Probably the best way to handle this is to pull it from huggingface (https://huggingface.co/datasets/codeparrot/apps) in the code itself (potentially caching it and gitignoring it so it doesn't need to be pulled on every run).
Motivation/Application
Automatic benchmarking is the ideal way to determine whether an imposed change to the code base is advantageous.
The text was updated successfully, but these errors were encountered: