You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A toolkit that can achieve a certain degree of webpage (rendered) interaction, performs web-based tasks. (e.g. click elements and scrolling pages, open a given url, make a screenshot, use MLLM to understand the webpage content)
example task:
"Question": "Eva Draconis has a personal website which can be accessed on her YouTube page. What is the meaning of the only symbol seen in the top banner that has a curved line that isn't a circle or a portion of a circle? Answer without punctuation.",
"Final answer": "War is not here this is a land of peace",
"Annotation Metadata": {
"Steps": "1. By googling Eva Draconis youtube, you can find her channel.\n2. In her about section, she has written her website URL, orionmindproject.com.\n3. Entering this website, you can see a series of symbols at the top, and the text \"> see what the symbols mean here\" below it.\n4. Reading through the entries, you can see a short description of some of the symbols.\n5. The only symbol with a curved line that isn't a circle or a portion of a circle is the last one.\n6. Note that the symbol supposedly means \"War is not here, this is a land of peace.\"",
"Number of steps": "6",
"How long did this take?": "30 minutes.",
"Tools": "1. A web browser.\n2. A search engine.\n3. Access to YouTube\n4. Image recognition tools",
"Number of tools": "4"
}
},
stagehand : compared to other libraries, stagehand provide natural language APIs (act, extract, and observe) on top of Playwright. Its key feature is offering a lightweight, model-agnostic framework for executing atomic web tasks via natural language instructions. (e.g., "Click the link to the quickstart")
steel-browser : from my perspective, the key feature of steel-browser lies in 1) post-processing of page data. (Easily extract page data as cleaned HTML, markdown, PDFs, or screenshots) 2) Bypass anti-bot measures 3) Optimizing data formats to reduce LLM token usage
Required prerequisites
Motivation
A toolkit that can achieve a certain degree of webpage (rendered) interaction, performs web-based tasks. (e.g. click elements and scrolling pages, open a given url, make a screenshot, use MLLM to understand the webpage content)
example task:
Solution
study solutions like
https://www.browserbase.com/ (https://docs.stagehand.dev/get_started/introduction)
https://github.com/steel-dev/steel-browser
https://pptr.dev/guides/getting-started
https://playwright.dev/
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: