Benchmark.prepare_backend() #204

gasse · 2024-10-23T21:33:46Z

No description provided.

gasse · 2024-10-23T21:34:09Z

@recursix that's a bit hacky but it should do for now

browsergym/experiments/src/browsergym/experiments/benchmark/base.py

recursix · 2024-10-24T01:13:43Z

browsergym/experiments/src/browsergym/experiments/benchmark/base.py

@@ -55,6 +55,7 @@ class Benchmark(DataClassJsonMixin):
    high_level_action_set_args: HighLevelActionSetArgs
    is_multi_tab: bool
    env_args_list: list[EnvArgs]
+    full_reset_script: Optional[str]


hmmm this is a symptom that we should have went with a class hieararchy instead of a single class for all benchmarks

browsergym/experiments/src/browsergym/experiments/loop.py

recursix · 2024-10-24T01:17:27Z

browsergym/webarena/src/browsergym/webarena/instance.py

@@ -50,6 +53,29 @@ def __init__(

        self.credentials = ACCOUNTS

+    def full_reset(self):


helper function instead of code duplicate?

What do you mean code duplicate? This code is new

it's 90% the same code on the side of visualwebarena

* Patch VWA task IDs * Add BLIP2 evaluator; patch timeout * Actually add the captioning_fn into evaluator constructor * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * support custom observation * add user data dir * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <[email protected]> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <[email protected]> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <[email protected]> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * add init script support * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump v0.13.1 * Update README.md * Update README.md * Update README.md * Update README.md * Authors update (ServiceNow#260) * TapeAgents export for experiment results (ServiceNow#238) * Update README.md * Cleanup * Add weblinx_browsergym as a dependency (ServiceNow#261) * Typo fix (ServiceNow#264) * Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165) * adding AgentInfo to __init__ for convenience (ServiceNow#166) * libvisualwebarena==0.0.14 (ServiceNow#171) fixed the jsons file! * Leaner traces (ServiceNow#169) * images aren't saved in pkl files anymore, and are stuffed back in at load time * added kwargs to control img/som saving * saving as png, adding screenshots back into obs * retrocompatibility for image loading * making get_screenshots work for png and jpg * fixing image types and closing files * Goal refactor to allow for local image files (ServiceNow#110) --------- Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]> Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.8.0 * Integrate AgentLab tests (ServiceNow#176) * downgrading ubuntu version for github tests (ServiceNow#179) * making webarena tests not run on PRs (ServiceNow#181) * making webarena tests not run on PRs * making visualwebarena tests not run on PRs * SoM bugfix (ServiceNow#185) * version bump v0.8.1 * workflow image downgrade: ubuntu-latest -> ubuntu-22.04 * Benchmarks (ServiceNow#173) * new ControlOrMeta key modifier (ServiceNow#187) * Multi-tab fix (ServiceNow#188) * Global demo_mode flag (ServiceNow#177) * HighLevelActionSetArgs default value (ServiceNow#191) * version bump v0.9.0 * Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198) * Benchmarks update (ServiceNow#197) * Miniwob number of seeds 10 -> 5 * remove most benchmark variants --------- Co-authored-by: Maxime Gasse <[email protected]> * New benchmark AssistantBench (ServiceNow#186) --------- Co-authored-by: Maxime Gasse <[email protected]> * Default `browsergym_split` metadata for every benchmark (ServiceNow#190) --------- Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> * Fixing logging with multiple jobs (ServiceNow#182) * Benchmark updates (ServiceNow#199) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump 0.10.0 * README update (ServiceNow#200) * Train / test splits for workarena-l2/l3 (ServiceNow#203) * Fine-grained benchmark action sets (ServiceNow#202) * version bump v0.10.1 * Update README.md * Update README.md * Benchmark.prepare_backend() (ServiceNow#204) * save_step_info bugfix (obs=None) (ServiceNow#207) * version bump v0.10.2 * full_reset fixes (ServiceNow#209) * Hide all bids from obs (ServiceNow#212) * Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208) --------- Co-authored-by: Maxime Gasse <[email protected]> * Leaner Unicode() gym space (ServiceNow#218) * a method to get the status of an experiment (ServiceNow#219) * version bump v0.11.0 * Rename benchmark after subset_from_split() (ServiceNow#221) * exp_dir sanitization (ServiceNow#222) * get_step_info() bugfix (ServiceNow#223) * Set webarena / visualwebarena max steps to 30 (ServiceNow#214) * Benchmark dependencies (ServiceNow#220) * Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224) * version bump v0.11.1 * ExpResult.status minor fix (ServiceNow#225) * version bump 0.11.2 * Fix duplicate depends_on in webarena metadata (ServiceNow#228) * Duplicate webarena dependencies fix (ServiceNow#229) * nltk.download() during import for webarena and visualwebarena (ServiceNow#227) * Refactor full_reset() for webarena / visualwebarena (ServiceNow#230) * webarena_tiny (ServiceNow#232) * Set ExpArgs.exp_id at post-init time (ServiceNow#231) * Remove ARIA extraction warnings (ServiceNow#233) * Update README.md * Update README.md * Update README.md * version bump v0.11.3 * ci tests fix (ServiceNow#234) * Benchmark update for weblinx (ServiceNow#235) * Refactor ExpArgs.exp_id generation (ServiceNow#236) * VisualWebArena task dependencies (ServiceNow#237) * VWA dependencies fix (ServiceNow#239) * VWA evaluator fix, missing captioning_fn (ServiceNow#240) * version bump v0.12.0 * Update README.md * VWA hide huggingface progress bar (ServiceNow#241) * WebLINX pre-download data in prepare_backend() (ServiceNow#226) * AssistantBench + WebLINX fixes (ServiceNow#244) * Increase assistantbench max_steps to 30 * Setting AssistantBench locale and timezone * Dedicated AssistantBench action set * small fix * missing change * Lenient frame marking in last retry (ServiceNow#245) * WA / VWA default action set update (ServiceNow#247) * version bump v0.13.0 * visualwebarena massage (ServiceNow#248) * Minor fix (ServiceNow#250) * Remove gym warnings "obs not within observation space" (ServiceNow#251) * Lower trace level info -> debug (ServiceNow#252) * Make env.close() usable after failure (finally block) (ServiceNow#253) * VWA / WA updates (ServiceNow#254) * Minor refactors (ServiceNow#255) * Optional method AbstractBrowserTask.teardown() * browsergym registration refactor * Deal with problematic frame unmarking (ServiceNow#256) * Add missing property exception to _get_obs() retry (ServiceNow#258) * Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257) * Massage WebArena instance (ServiceNow#259) * Refactor AssistantBench output directories (ServiceNow#242) --------- Co-authored-by: Maxime Gasse <[email protected]> * version bump v0.13.1 * Fix broken links * Update README.md * fix merging issues * Update README.md (ServiceNow#270) * Update README.md * README update * More permissive WA/VWA instance reset (ServiceNow#272) * New debug benchmark visualwebarena_tiny (ServiceNow#271) * Version bump v0.13.2 * Update README.md * Metadata column fix (ServiceNow#278) * Update README.md * Update README.md * Update README.md * Update README.md * Shunt WA / VWA unit tests * README update * Minor fixes (ServiceNow#281) * version bump v0.13.3 * remove unused fluff * revert more unintended changes --------- Co-authored-by: Peng Qi <[email protected]> Co-authored-by: Thibault LSDC <[email protected]> Co-authored-by: Maxime Gasse <[email protected]> Co-authored-by: Yanan Xie <[email protected]> Co-authored-by: Alexandre Lacoste <[email protected]> Co-authored-by: oriyor <[email protected]> Co-authored-by: Xing Han Lu <[email protected]> Co-authored-by: ljang0 <[email protected]> Co-authored-by: Megh Thakkar <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Oleh Shliazhko <[email protected]> Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>

gasse requested a review from recursix October 23, 2024 21:33

gasse added 3 commits October 23, 2024 17:46

benchmark full_reset_script

1d6c5c3

fix

c29eccc

fix

67b9686

gasse force-pushed the webarena_reset branch from f9106e5 to 67b9686 Compare October 23, 2024 21:46

recursix reviewed Oct 24, 2024

View reviewed changes

browsergym/experiments/src/browsergym/experiments/benchmark/base.py Outdated Show resolved Hide resolved

recursix reviewed Oct 24, 2024

View reviewed changes

browsergym/experiments/src/browsergym/experiments/loop.py Show resolved Hide resolved

recursix reviewed Oct 24, 2024

View reviewed changes

recursix previously approved these changes Oct 24, 2024

View reviewed changes

benchmark prepare_backend refactor

5c6654b

gasse dismissed recursix’s stale review via 5c6654b October 24, 2024 14:08

gasse changed the title ~~New Benchmark field full_reset_script~~ Benchmark.prepare_backend() Oct 24, 2024

test

f8f1598

recursix previously approved these changes Oct 24, 2024

View reviewed changes

fix

5a0a48b

gasse dismissed recursix’s stale review via 5a0a48b October 24, 2024 14:49

gasse merged commit 444599b into main Oct 24, 2024
13 checks passed

gasse deleted the webarena_reset branch October 24, 2024 14:52

gasse mentioned this pull request Nov 7, 2024

Benchmark Server objects #175

Closed

qipeng pushed a commit to orby-ai-engineering/BrowserGym that referenced this pull request Nov 20, 2024

Benchmark.prepare_backend() (ServiceNow#204)

5a8b1bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark.prepare_backend() #204

Benchmark.prepare_backend() #204

gasse commented Oct 23, 2024

gasse commented Oct 23, 2024

recursix Oct 24, 2024

recursix Oct 24, 2024

gasse Oct 24, 2024

recursix Oct 24, 2024

		@@ -50,6 +53,29 @@ def __init__(

		self.credentials = ACCOUNTS

		def full_reset(self):

Benchmark.prepare_backend() #204

Benchmark.prepare_backend() #204

Conversation

gasse commented Oct 23, 2024

gasse commented Oct 23, 2024

recursix Oct 24, 2024

Choose a reason for hiding this comment

recursix Oct 24, 2024

Choose a reason for hiding this comment

gasse Oct 24, 2024

Choose a reason for hiding this comment

recursix Oct 24, 2024

Choose a reason for hiding this comment