Benchmarks #173

gasse · 2024-10-07T21:28:08Z

Missing:

unit tests
workarena l2 and l3 task metadata
webarena metadata (categories)
visualwebarena task list and metadata
integration in AgentLab (different PR)

gasse · 2024-10-08T20:07:44Z

solves #170 #160

browsergym/experiments/src/browsergym/experiments/benchmark.py

gasse · 2024-10-17T20:32:47Z

Trust-approved by @recursix

* Patch VWA task IDs * Add BLIP2 evaluator; patch timeout * Actually add the captioning_fn into evaluator constructor * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * support custom observation * add user data dir * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <[email protected]> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <[email protected]> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <[email protected]> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * add init script support * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump v0.13.1 * Update README.md * Update README.md * Update README.md * Update README.md * Authors update (ServiceNow#260) * TapeAgents export for experiment results (ServiceNow#238) * Update README.md * Cleanup * Add weblinx_browsergym as a dependency (ServiceNow#261) * Typo fix (ServiceNow#264) * Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165) * adding AgentInfo to __init__ for convenience (ServiceNow#166) * libvisualwebarena==0.0.14 (ServiceNow#171) fixed the jsons file! * Leaner traces (ServiceNow#169) * images aren't saved in pkl files anymore, and are stuffed back in at load time * added kwargs to control img/som saving * saving as png, adding screenshots back into obs * retrocompatibility for image loading * making get_screenshots work for png and jpg * fixing image types and closing files * Goal refactor to allow for local image files (ServiceNow#110) --------- Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]> Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.8.0 * Integrate AgentLab tests (ServiceNow#176) * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <[email protected]> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <[email protected]> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <[email protected]> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump v0.13.1 * Fix broken links * Update README.md * fix merging issues * Update README.md (ServiceNow#270) * Update README.md * README update * More permissive WA/VWA instance reset (ServiceNow#272) * New debug benchmark visualwebarena_tiny (ServiceNow#271) * Version bump v0.13.2 * Update README.md * Metadata column fix (ServiceNow#278) * Update README.md * Update README.md * Update README.md * Update README.md * Shunt WA / VWA unit tests * README update * Minor fixes (ServiceNow#281) * version bump v0.13.3 * remove unused fluff * revert more unintended changes --------- Co-authored-by: Peng Qi <[email protected]> Co-authored-by: Thibault LSDC <[email protected]> Co-authored-by: Maxime Gasse <[email protected]> Co-authored-by: Yanan Xie <[email protected]> Co-authored-by: Alexandre Lacoste <[email protected]> Co-authored-by: oriyor <[email protected]> Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Oleh Shliazhko <[email protected]> Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>

gasse requested a review from recursix October 7, 2024 21:33

gasse force-pushed the benchmark_metadata branch from 3422c54 to fa847fd Compare October 7, 2024 21:34

gasse force-pushed the benchmark_metadata branch 2 times, most recently from 94598a8 to e57c0c9 Compare October 8, 2024 20:13

recursix reviewed Oct 8, 2024

View reviewed changes

browsergym/experiments/src/browsergym/experiments/benchmark.py Outdated Show resolved Hide resolved

recursix reviewed Oct 8, 2024

View reviewed changes

browsergym/experiments/src/browsergym/experiments/benchmark.py Show resolved Hide resolved

gasse force-pushed the benchmark_metadata branch 2 times, most recently from 8891d50 to 36ba273 Compare October 10, 2024 14:36

gasse added 11 commits October 16, 2024 18:17

browsergym.experiments.benchmark

82449aa

cleanup

06cc817

workarena tasks metadata

6b8afc3

wa / vwa metadata

91aa95e

minor fixes

c8d4ea8

tests

aa06a6b

workarena dependency bump

ae2964a

directory renaming

4e0427b

minor refactors

adfcb0b

minor comments and refactors

2e2f49a

benchmark.subset()

8ee51e0

gasse force-pushed the benchmark_metadata branch from 620b5f4 to 8ee51e0 Compare October 16, 2024 22:18

gasse requested a review from recursix October 16, 2024 22:24

gasse added 2 commits October 17, 2024 12:32

subset_from benchmark methods

a9f2bcd

miniwob_coord benchmark action space

de06597

gasse merged commit 3a33a69 into main Oct 17, 2024
12 checks passed

gasse deleted the benchmark_metadata branch October 17, 2024 20:33

xhluca mentioned this pull request Oct 18, 2024

Add weblinx to benchmark #193

Merged

This was referenced Oct 22, 2024

Define benchmarks as an object that can generate a consistent list of env_args #160

Closed

Benchmark specific action set #170

Closed

qipeng pushed a commit to orby-ai-engineering/BrowserGym that referenced this pull request Nov 20, 2024

Benchmarks (ServiceNow#173)

874efbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks #173

Benchmarks #173

gasse commented Oct 7, 2024 •

edited

Loading

gasse commented Oct 8, 2024

gasse commented Oct 17, 2024

Benchmarks #173

Benchmarks #173

Conversation

gasse commented Oct 7, 2024 • edited Loading

gasse commented Oct 8, 2024

gasse commented Oct 17, 2024

gasse commented Oct 7, 2024 •

edited

Loading