Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/browser replay extended #875

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 21 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ See also:
- https://github.com/OpenAdaptAI/pynput
- https://github.com/OpenAdaptAI/atomacos

# OpenAdapt: AI-First Process Automation with Large Multimodal Models (LMMs).
# OpenAdapt: Open Source Generative Process Automation.
## AI-First Process Automation with Large Multimodal Models (LMMs).

**OpenAdapt** is the **open** source software **adapt**er between Large Multimodal Models (LMMs) and traditional desktop and web Graphical User Interfaces (GUIs).

Expand All @@ -35,9 +36,8 @@ with the power of Large Multimodal Modals (LMMs) by:
- Recording screenshots and associated user input
- Aggregating and visualizing user input and recordings for development
- Converting screenshots and user input into tokenized format
- Generating synthetic input via transformer model completions
- Generating task trees by analyzing recordings (work-in-progress)
- Replaying synthetic input to complete tasks (work-in-progress)
- Generating and replaying synthetic input via transformer model completions
- Generating process graphs by analyzing recording logs (work-in-progress)

The goal is similar to that of
[Robotic Process Automation](https://en.wikipedia.org/wiki/Robotic_process_automation),
Expand Down Expand Up @@ -165,37 +165,6 @@ pointing the cursor and left or right clicking, as described in this
[open issue](https://github.com/OpenAdaptAI/OpenAdapt/issues/145)


### Capturing Browser Events

To capture (record) browser events in Chrome, follow these steps:

1. Go to: [Chrome Extension Page](chrome://extensions/)

2. Enable `Developer mode` (located at the top right):

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/c97eb9fb-05d6-465d-85b3-332694556272)

3. Click `Load unpacked` (located at the top left).

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/00c8adf5-074a-4655-b132-fd87644007fc)

4. Select the `chrome_extension` directory:

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/71610ed3-f8d4-431a-9a22-d901127b7b0c)

5. You should see the following confirmation, indicating that the extension is loaded:

![image](https://github.com/OpenAdaptAI/OpenAdapt/assets/65433817/7ee19da9-37e0-448f-b9ab-08ef99110e85)

6. Set the flag to `true` if it is currently `false`:

![image](https://github.com/user-attachments/assets/8eba24a3-7c68-4deb-8fbe-9d03cece1482)

7. Start recording. Once recording begins, navigate to the Chrome browser, browse some pages, and perform a few clicks. Then, stop the recording and let it complete successfully.

8. After recording, check the `openadapt.db` table `browser_event`. It should contain all your browser activity logs. You can verify the data's correctness using the `sqlite3` CLI or an extension like `SQLite Viewer` in VS Code to open `data/openadapt.db`.


### Visualize

Quickly visualize the latest recording you created by running the following command:
Expand Down Expand Up @@ -243,6 +212,7 @@ Other replay strategies include:
- [`StatefulReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/stateful.py): Early proof-of-concept which uses the OpenAI GPT-4 API with prompts constructed via OS-level window data.
- (*)[`VisualReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/visual.py): Uses [Fast Segment Anything Model (FastSAM)](https://github.com/CASIA-IVA-Lab/FastSAM) to segment active window.
- (*)[`VanillaReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/vanilla.py): Assumes the model is capable of directly reasoning on states and actions accurately. With future frontier models, we hope that this script will suddenly work a lot better.
- (*)[`BrowserReplayStrategy`](https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/strategies/browser.py): Uses the browser extension to read the visible DOM, and refers to recorded browser events to identify target elements.


The (*) prefix indicates strategies which accept an "instructions" parameter that is used to modify the recording, e.g.:
Expand All @@ -253,6 +223,22 @@ python -m openadapt.replay VanillaReplayStrategy --instructions "calculate 9-8"

See https://github.com/OpenAdaptAI/OpenAdapt/tree/main/openadapt/strategies for a complete list. More ReplayStrategies coming soon! (see [Contributing](#Contributing)).

### Browser integration

To record browser events in Google Chrome (required by the `BrowserReplayStrategy`), follow these steps:

1. Go to your Chrome extensions page by entering [chrome://extensions](chrome://extensions/) in your address bar.

2. Enable `Developer mode` (located at the top right).

3. Click `Load unpacked` (located at the top left).

4. Select the `chrome_extension` directory in the OpenAdapt repo.

5. Make sure the Chrome extension is enabled (the switch to the right of the OpenAdapt extension widget is turned on).

6. Set the `RECORD_BROWSER_EVENTS` flag to `true` in `openadapt/data/config.json`.

## Features

### State-of-the-art GUI understanding via [Segment Anything in High Quality](https://github.com/SysCV/sam-hq):
Expand Down Expand Up @@ -306,13 +292,6 @@ We're looking forward to your contributions. Let's build the future 🚀

## Contributing

### Notable Works-in-progress (incomplete, see https://github.com/OpenAdaptAI/OpenAdapt/pulls and https://github.com/OpenAdaptAI/OpenAdapt/issues/ for more)

- [Video Recording Hardware Acceleration](https://github.com/OpenAdaptAI/OpenAdapt/issues/570) (help wanted)
- [Audio Narration](https://github.com/OpenAdaptAI/OpenAdapt/pull/346) (help wanted)
- [Chrome Extension](https://github.com/OpenAdaptAI/OpenAdapt/pull/364) (help wanted)
- [Gemini Vision](https://github.com/OpenAdaptAI/OpenAdapt/issues/551) (help wanted)

### Replay Problem Statement

Our goal is to automate the task described and demonstrated in a `Recording`.
Expand Down
73 changes: 60 additions & 13 deletions chrome_extension/background.js
Original file line number Diff line number Diff line change
@@ -1,33 +1,28 @@
/**
* @file background.js
* @description Creates a new background script that listens for messages from the content script
* and sends them to a WebSocket server.
*/
* @description Background script that maintains the current mode and communicates with content scripts.
*/

let socket;
let currentMode = null; // Maintain the current mode here
let timeOffset = 0; // Global variable to store the time offset

/*
* TODO:
* Ideally we read `WS_SERVER_PORT`, `WS_SERVER_ADDRESS` and
* `RECONNECT_TIMEOUT_INTERVAL` from config.py,
* or it gets passed in somehow.
*/
/*
* Note: these need to match the corresponding values in config[.defaults].json
*/
let RECONNECT_TIMEOUT_INTERVAL = 1000; // ms
let WS_SERVER_PORT = 8765;
let WS_SERVER_ADDRESS = "localhost";
let WS_SERVER_URL = "ws://" + WS_SERVER_ADDRESS + ":" + WS_SERVER_PORT;


function socketSend(socket, message) {
console.log({ message });
socket.send(JSON.stringify(message));
}


/*
* Function to connect to the WebSocket server.
*/
*/
function connectWebSocket() {
socket = new WebSocket(WS_SERVER_URL);

Expand All @@ -38,11 +33,34 @@ function connectWebSocket() {
socket.onmessage = function(event) {
console.log("Message from server:", event.data);
const message = JSON.parse(event.data);

// Handle mode messages
if (message.type === 'SET_MODE') {
currentMode = message.mode; // Update the current mode
console.log(`Mode set to: ${currentMode}`);

// Send the mode to all active tabs
chrome.tabs.query(
{
active: true,
},
function(tabs) {
tabs.forEach(function(tab) {
chrome.tabs.sendMessage(tab.id, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + tab.id, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + tab.id, response);
}
});
});
}
);
}
};

socket.onclose = function(event) {
console.log("WebSocket connection closed", event);
// Reconnect after 5 seconds if the connection is lost
setTimeout(connectWebSocket, RECONNECT_TIMEOUT_INTERVAL);
};

Expand All @@ -66,3 +84,32 @@ chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
sendResponse({ status: "WebSocket connection not open" });
}
});

/* Listen for tab activation */
chrome.tabs.onActivated.addListener((activeInfo) => {
// Send current mode to the newly active tab if it's not null
if (currentMode) {
const message = { type: 'SET_MODE', mode: currentMode };
chrome.tabs.sendMessage(activeInfo.tabId, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + activeInfo.tabId, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + activeInfo.tabId, response);
}
});
}
});

/* Listen for tab updates to handle new pages or reloading */
chrome.tabs.onUpdated.addListener((tabId, changeInfo, tab) => {
if (changeInfo.status === 'complete' && currentMode) {
const message = { type: 'SET_MODE', mode: currentMode };
chrome.tabs.sendMessage(tabId, message, function(response) {
if (chrome.runtime.lastError) {
console.error("Error sending message to content script in tab " + tabId, chrome.runtime.lastError.message);
} else {
console.log("Message sent to content script in tab " + tabId, response);
}
});
}
});
Loading
Loading