Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks #173

Merged
merged 13 commits into from
Oct 17, 2024
Merged

Benchmarks #173

merged 13 commits into from
Oct 17, 2024

Conversation

gasse
Copy link
Collaborator

@gasse gasse commented Oct 7, 2024

  • JSON serializable Benchmark class with name, high_level_action_set and env_args_list
    from dataclasses import dataclass
    
    from dataclasses_json import DataClassJsonMixin
    
    
    @dataclass
    class Example(DataClassJsonMixin):
        a: int
        b: str
    
        def do_something(self):
            print(self.a, self.b)
      
    x = Example(0, "hello")
    x.do_something()
    
    x_json = x.to_json()
    
    y = Example.from_json(x_json)
    y.do_something()
  • flexible, easy to crate your own benchmark
    Benchmark(
        name="miniwob_tiny_test",
        high_level_action_set=DEFAULT_HIGHLEVEL_ACTION_SETS["miniwob"],
        env_args_list=_make_env_args_list_from_repeat_tasks(
            task_list=["miniwob.click-dialog", "miniwob.click-checkboxes"],
            max_steps=5,
            n_repeats=2,
            seeds_rng=np.random.RandomState(42),
        ),
    )
  • CSV files with benchmark task list and their metadata
  • default benchmarks collection
    • miniwob_all
    • miniwob_webgum
    • miniwob_tiny_test
    • miniwob_train
    • miniwob_test
    • webarena
    • visualwebarena
    • workarena_l1
    • workarena_l1_sort
    • workarena_l2
    • workarena_l3

Missing:

  • unit tests
  • workarena l2 and l3 task metadata
  • webarena metadata (categories)
  • visualwebarena task list and metadata
  • integration in AgentLab (different PR)

@gasse gasse requested a review from recursix October 7, 2024 21:33
@gasse gasse force-pushed the benchmark_metadata branch from 3422c54 to fa847fd Compare October 7, 2024 21:34
@gasse
Copy link
Collaborator Author

gasse commented Oct 8, 2024

solves #170 #160

@gasse gasse force-pushed the benchmark_metadata branch 2 times, most recently from 94598a8 to e57c0c9 Compare October 8, 2024 20:13
@gasse gasse force-pushed the benchmark_metadata branch 2 times, most recently from 8891d50 to 36ba273 Compare October 10, 2024 14:36
@gasse gasse force-pushed the benchmark_metadata branch from 620b5f4 to 8ee51e0 Compare October 16, 2024 22:18
@gasse gasse requested a review from recursix October 16, 2024 22:24
@gasse
Copy link
Collaborator Author

gasse commented Oct 17, 2024

Trust-approved by @recursix

@gasse gasse merged commit 3a33a69 into main Oct 17, 2024
12 checks passed
@gasse gasse deleted the benchmark_metadata branch October 17, 2024 20:33
@xhluca xhluca mentioned this pull request Oct 18, 2024
qipeng pushed a commit to orby-ai-engineering/BrowserGym that referenced this pull request Nov 20, 2024
qipeng added a commit to orby-ai-engineering/BrowserGym that referenced this pull request Jan 18, 2025
* Patch VWA task IDs

* Add BLIP2 evaluator; patch timeout

* Actually add the captioning_fn into evaluator constructor

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* support custom observation

* add user data dir

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <[email protected]>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* add init script support

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump v0.13.1

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Authors update (ServiceNow#260)

* TapeAgents export for experiment results (ServiceNow#238)

* Update README.md

* Cleanup

* Add weblinx_browsergym as a dependency (ServiceNow#261)

* Typo fix (ServiceNow#264)

* Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165)

* adding AgentInfo to __init__ for convenience (ServiceNow#166)

* libvisualwebarena==0.0.14 (ServiceNow#171)

fixed the jsons file!

* Leaner traces (ServiceNow#169)

* images aren't saved in pkl files anymore, and are stuffed back in at load time

* added kwargs to control img/som saving

* saving as png, adding screenshots back into obs

* retrocompatibility for image loading

* making get_screenshots work for png and jpg

* fixing image types and closing files

* Goal refactor to allow for local image files (ServiceNow#110)


---------

Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.8.0

* Integrate AgentLab tests (ServiceNow#176)

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <[email protected]>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <[email protected]>

* version bump v0.13.1

* Fix broken links

* Update README.md

* fix merging issues

* Update README.md (ServiceNow#270)

* Update README.md

* README update

* More permissive WA/VWA instance reset (ServiceNow#272)

* New debug benchmark visualwebarena_tiny (ServiceNow#271)

* Version bump v0.13.2

* Update README.md

* Metadata column fix (ServiceNow#278)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Shunt WA / VWA unit tests

* README update

* Minor fixes (ServiceNow#281)

* version bump v0.13.3

* remove unused fluff

* revert more unintended changes

---------

Co-authored-by: Peng Qi <[email protected]>
Co-authored-by: Thibault LSDC <[email protected]>
Co-authored-by: Maxime Gasse <[email protected]>
Co-authored-by: Yanan Xie <[email protected]>
Co-authored-by: Alexandre Lacoste <[email protected]>
Co-authored-by: oriyor <[email protected]>
Co-authored-by: Xing Han Lu <[email protected]>
Co-authored-by: ljang0 <[email protected]>
Co-authored-by: Megh Thakkar <[email protected]>
Co-authored-by: Imene Kerboua <[email protected]>
Co-authored-by: Oleh Shliazhko <[email protected]>
Co-authored-by: Thibault Le Sellier de Chezelles <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants