Automatic benchmarking of gpt-engineer with APPS #819

ATheorell · 2023-10-23T09:35:02Z

Feature description

gpt-engineer has an automatic evals suite in "evals/eval_new_code.py". However, only 2 test cases are given in evals/new_code_eval.yaml . An alternative to filling in more testcases manually, we should parse in prompts and tests from the (very large) APPS dataset (https://paperswithcode.com/dataset/apps).

Since APPS is way too large to run in its entirety, there should be functionality to run n randomly selected tests and run n tests according to some predetermined test ordering (so that consecutive benchmark runs are comparable).

The APPS database should not be added to the gpt-engineer git repo! Probably the best way to handle this is to pull it from huggingface (https://huggingface.co/datasets/codeparrot/apps) in the code itself (potentially caching it and gitignoring it so it doesn't need to be pulled on every run).

Motivation/Application

Automatic benchmarking is the ideal way to determine whether an imposed change to the code base is advantageous.

pbharrin · 2023-10-23T18:15:47Z

I can add a sampled version of the APPS dataset which will give us a good idea how our project is doing without costing a fortune. APPS is a great at testing how well our repair of broken code works.

azrv · 2024-02-01T20:13:22Z

@ATheorell assign to me

ATheorell · 2024-02-02T18:03:15Z

Yes, please a shot shot at this @azrv :)

azrv · 2024-03-23T12:36:30Z

#1051 was merged 🎉

@pbharrin It's now a matter of cherry-picking problems we want to constantly test against.

gpt_engineer/benchmark/benchmarks/apps/problems.py:4

ATheorell added enhancement New feature or request triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. and removed triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. labels Oct 23, 2023

ATheorell added this to the Achieve 3 advanced usecases for gpt-engineer that work reliably milestone Oct 23, 2023

johndoe1a mentioned this issue Oct 24, 2023

Create SECURITY.md #48

Closed

AntonOsika modified the milestones: Achieve 3 advanced usecases for gpt-engineer that work reliably, 3 standard benchmarks Dec 18, 2023

similato87 mentioned this issue Jan 14, 2024

Diff syntax for improve command #965

Closed

viborc assigned azrv Feb 1, 2024

viborc added this to gpt-engineer roadmap Feb 8, 2024

viborc moved this to Todo in gpt-engineer roadmap Feb 8, 2024

viborc moved this from Todo to In Progress in gpt-engineer roadmap Feb 15, 2024

azrv mentioned this issue Feb 21, 2024

Benchmark on APPS dataset #1025

Closed

azrv mentioned this issue Mar 23, 2024

Integrate APPS benchmarking #1051

Merged

ATheorell closed this as completed Apr 4, 2024

github-project-automation bot moved this from In Progress to Done in gpt-engineer roadmap Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic benchmarking of gpt-engineer with APPS #819

Automatic benchmarking of gpt-engineer with APPS #819

ATheorell commented Oct 23, 2023

pbharrin commented Oct 23, 2023

azrv commented Feb 1, 2024

ATheorell commented Feb 2, 2024

azrv commented Mar 23, 2024

Automatic benchmarking of gpt-engineer with APPS #819

Automatic benchmarking of gpt-engineer with APPS #819

Comments

ATheorell commented Oct 23, 2023

Feature description

Motivation/Application

pbharrin commented Oct 23, 2023

azrv commented Feb 1, 2024

ATheorell commented Feb 2, 2024

azrv commented Mar 23, 2024