Sub task benchmark #2

chc012 · 2025-02-27T20:57:44Z

No description provided.

sanjari-orb · 2025-02-28T02:58:16Z

browsergym/visualwebarena/src/browsergym/visualwebarena/task.py

@@ -144,7 +144,7 @@ def __init__(



Minor nit, can we undo the changes to other files, so that merging back into browsergym is easier in the future

sanjari-orb · 2025-02-28T02:58:27Z

...m/assistantbench/src/browsergym/assistantbench/evaluation/evaluate_utils/evaluate_strings.py

@@ -69,7 +69,7 @@ def _normalize_number(text: str) -> str:


 def _answer_to_bags(


sanjari-orb · 2025-02-28T02:58:33Z

browsergym/core/src/browsergym/core/env.py

@@ -237,15 +237,15 @@ def override_property(task, env, property):
        browser_args = []


sanjari-orb · 2025-02-28T03:02:20Z

browsergym/subtaskbench/src/browsergym/subtaskbench/task.py

+        from subtaskbench import Evaluator
+
+        self.evaluator = Evaluator(
+            start_url=self.task_config["start_url"],


Hmm so the start URL field might be a problem for this benchmark, because it will be ephemeral and thus cannot be read from a pre-defined config. One thing we can do let this be an environment id and pass it to a an handler in this file which can return the correct start url (can be left empty for now).
The same applies to the creation of the playwright.sync_api.Page as well.

sanjari-orb · 2025-02-28T03:03:25Z

browsergym/subtaskbench/src/browsergym/subtaskbench/task.py

+            start_url=self.task_config["start_url"],
+            goal=self.task_config["goal"],
+            evaluation_script=self.task_config["evaluation_script"],
+            expected_output=self.task_config["expected_output"],


What is the expected output field expected to look like? Is this for information seeking tasks? (eg, string_match, fuzzy match type evals)

chc012 added 3 commits February 26, 2025 12:22

init

c907842

task ids

15df5cc

subtaskbench task

f8b2929

chc012 requested a review from sanjari-orb February 27, 2025 20:57

chc012 self-assigned this Feb 27, 2025

sanjari-orb reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub task benchmark #2

Sub task benchmark #2

chc012 commented Feb 27, 2025

sanjari-orb Feb 28, 2025

sanjari-orb Feb 28, 2025

sanjari-orb Feb 28, 2025

sanjari-orb Feb 28, 2025

sanjari-orb Feb 28, 2025

		@@ -69,7 +69,7 @@ def _normalize_number(text: str) -> str:


		def _answer_to_bags(

		@@ -237,15 +237,15 @@ def override_property(task, env, property):
		browser_args = []

Sub task benchmark #2

Are you sure you want to change the base?

Sub task benchmark #2

Conversation

chc012 commented Feb 27, 2025

sanjari-orb Feb 28, 2025

Choose a reason for hiding this comment

sanjari-orb Feb 28, 2025

Choose a reason for hiding this comment

sanjari-orb Feb 28, 2025

Choose a reason for hiding this comment

sanjari-orb Feb 28, 2025

Choose a reason for hiding this comment

sanjari-orb Feb 28, 2025

Choose a reason for hiding this comment