Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark.prepare_backend() #204

Merged
merged 6 commits into from
Oct 24, 2024
Merged

Benchmark.prepare_backend() #204

merged 6 commits into from
Oct 24, 2024

Conversation

gasse
Copy link
Collaborator

@gasse gasse commented Oct 23, 2024

No description provided.

@gasse gasse requested a review from recursix October 23, 2024 21:33
@gasse
Copy link
Collaborator Author

gasse commented Oct 23, 2024

@recursix that's a bit hacky but it should do for now

@@ -55,6 +55,7 @@ class Benchmark(DataClassJsonMixin):
high_level_action_set_args: HighLevelActionSetArgs
is_multi_tab: bool
env_args_list: list[EnvArgs]
full_reset_script: Optional[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm this is a symptom that we should have went with a class hieararchy instead of a single class for all benchmarks

@@ -50,6 +53,29 @@ def __init__(

self.credentials = ACCOUNTS

def full_reset(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper function instead of code duplicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean code duplicate? This code is new

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's 90% the same code on the side of visualwebarena

recursix
recursix previously approved these changes Oct 24, 2024
@gasse gasse changed the title New Benchmark field full_reset_script Benchmark.prepare_backend() Oct 24, 2024
recursix
recursix previously approved these changes Oct 24, 2024
@gasse gasse merged commit 444599b into main Oct 24, 2024
13 checks passed
@gasse gasse deleted the webarena_reset branch October 24, 2024 14:52
@gasse gasse mentioned this pull request Nov 7, 2024
qipeng pushed a commit to orby-ai-engineering/BrowserGym that referenced this pull request Nov 20, 2024
qipeng added a commit to orby-ai-engineering/BrowserGym that referenced this pull request Jan 18, 2025
* Patch VWA task IDs

* Add BLIP2 evaluator; patch timeout

* Actually add the captioning_fn into evaluator constructor

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* support custom observation

* add user data dir

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <[email protected]>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* add init script support

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump v0.13.1

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Authors update (ServiceNow#260)

* TapeAgents export for experiment results (ServiceNow#238)

* Update README.md

* Cleanup

* Add weblinx_browsergym as a dependency (ServiceNow#261)

* Typo fix (ServiceNow#264)

* Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165)

* adding AgentInfo to __init__ for convenience (ServiceNow#166)

* libvisualwebarena==0.0.14 (ServiceNow#171)

fixed the jsons file!

* Leaner traces (ServiceNow#169)

* images aren't saved in pkl files anymore, and are stuffed back in at load time

* added kwargs to control img/som saving

* saving as png, adding screenshots back into obs

* retrocompatibility for image loading

* making get_screenshots work for png and jpg

* fixing image types and closing files

* Goal refactor to allow for local image files (ServiceNow#110)


---------

Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.8.0

* Integrate AgentLab tests (ServiceNow#176)

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <[email protected]>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump v0.13.1

* Fix broken links

* Update README.md

* fix merging issues

* Update README.md (ServiceNow#270)

* Update README.md

* README update

* More permissive WA/VWA instance reset (ServiceNow#272)

* New debug benchmark visualwebarena_tiny (ServiceNow#271)

* Version bump v0.13.2

* Update README.md

* Metadata column fix (ServiceNow#278)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Shunt WA / VWA unit tests

* README update

* Minor fixes (ServiceNow#281)

* version bump v0.13.3

* remove unused fluff

* revert more unintended changes

---------

Co-authored-by: Peng Qi <[email protected]>
Co-authored-by: Thibault LSDC <[email protected]>
Co-authored-by: Maxime Gasse <[email protected]>
Co-authored-by: Yanan Xie <[email protected]>
Co-authored-by: Alexandre Lacoste <[email protected]>
Co-authored-by: oriyor <[email protected]>
Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>
Co-authored-by: Imene Kerboua <[email protected]>
Co-authored-by: Oleh Shliazhko <[email protected]>
Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants