Skip to content

Commit

Permalink
feat(web): refactor the web operator (#772)
Browse files Browse the repository at this point in the history
Because

- we want to make scrape page to pages
- there are merged code not updated from the review of
#753

This commit

- make scrape page to pages
- modify the code based on the review

note
- web operator QA results are in `Test Result` thread in the linear
ticket
- migration QA result is in `Migration QA` thread in the linear ticket
  • Loading branch information
chuang8511 authored Oct 30, 2024
1 parent 8d842a6 commit ae4e3c2
Show file tree
Hide file tree
Showing 14 changed files with 542 additions and 184 deletions.
38 changes: 24 additions & 14 deletions pkg/component/operator/web/v0/.compogen/bottom.mdx
Original file line number Diff line number Diff line change
@@ -1,44 +1,54 @@


## Example Recipes

```yaml
version: v1beta

variable:
url:
title: url
title: URL
instill-format: string

component:
crawler:
type: web
input:
root-url: ${variable.url}
url: ${variable.url}
allowed-domains:
max-k: 10
timeout: 0
max-k: 30
timeout: 1000
max-depth: 0
condition:
task: TASK_CRAWL_SITE

json-filter:
type: json
input:
json-value: ${crawler.output.pages}
jq-filter: .[] | ."link"
condition:
task: TASK_JQ

scraper:
type: web
input:
url: ${crawler.output.pages[0].link}
urls: ${json-filter.output.results}
scrape-method: http
include-html: false
only-main-content: true
remove-tags:
only-include-tags:
timeout: 1000
timeout: 0
condition:
task: TASK_SCRAPE_PAGE
task: TASK_SCRAPE_PAGES

output:
markdown:
title: Markdown
value: ${scraper.output.markdown}
pages:
title: Pages
value: ${crawler.output.pages}
links:
title: links
value: ${scraper.output.links-on-page}
title: Links
value: ${json-filter.output.results}
scraper-pages:
title: Scraper Pages
value: ${scraper.output.pages}
```
4 changes: 2 additions & 2 deletions pkg/component/operator/web/v0/.compogen/scrape_page.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@


#### About Dynamic Content
`TASK_SCRAPE_PAGE` supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:
`TASK_SCRAPE_PAGES` supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:

Scrolling:
- Mimics user scrolling down the page to load additional content dynamically.
Expand All @@ -36,4 +36,4 @@ Future enhancements will include additional user interactions, such as:
- Taking Screenshots: Capture screenshots of the current view.
- Keyboard Actions: Simulate key presses and other keyboard interactions.

`TASK_SCRAPE_PAGE` aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.
`TASK_SCRAPE_PAGES` aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.
77 changes: 48 additions & 29 deletions pkg/component/operator/web/v0/README.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description: "Learn about how to set up a VDP Web component https://github.com/i
The Web component is an operator component that allows users to scrape websites.
It can carry out the following tasks:
- [Crawl Site](#crawl-site)
- [Scrape Page](#scrape-page)
- [Scrape Pages](#scrape-pages)
- [Scrape Sitemap](#scrape-sitemap)


Expand All @@ -32,7 +32,7 @@ The component definition and tasks are defined in the [definition.json](https://

### Crawl Site

This task involves systematically navigating through a website, starting from a designated page (typically the homepage), and following internal links to discover and retrieve page titles and URLs. Note that this process only gathers links and titles from multiple pages; it does not extract the content of the pages themselves. If you need to collect specific content from individual pages, please use the [Scrape Page](#scrape-page) task instead.
This task involves systematically navigating through a website, starting from a designated page (typically the homepage), and following internal links to discover and retrieve page titles and URLs. The process is limited to 120 seconds and only collects links and titles from multiple pages; it does not extract the content of the pages themselves. If you need to collect specific content from individual pages, please use the Scrape Page task instead.

<div class="markdown-col-no-wrap" data-col-1 data-col-2>

Expand Down Expand Up @@ -72,16 +72,16 @@ This task involves systematically navigating through a website, starting from a
</div>
</details>

### Scrape Page
### Scrape Pages

This task focuses on extracting specific data from a single targeted webpage by parsing its HTML structure. Unlike crawling, which navigates across multiple pages, scraping retrieves content only from the specified page. After scraping, the data can be further processed using a defined [jQuery](https://www.w3schools.com/jquery/jquery_syntax.asp) in a specified sequence. The sequence of jQuery filtering data will be executed in the order of `only-main-content`, `remove-tags`, and `only-include-tags`. Refer to the [jQuery Syntax Examples](#jquery-syntax-examples) for more details on how to filter and manipulate the data.
This task focuses on extracting specific data from targeted webpages by parsing its HTML structure. Unlike crawling, which navigates across multiple pages, scraping retrieves content only from the specified page. After scraping, the data can be further processed using a defined [jQuery](https://www.w3schools.com/jquery/jquery_syntax.asp) in a specified sequence. The sequence of jQuery filtering data will be executed in the order of `only-main-content`, `remove-tags`, and `only-include-tags`. Refer to the [jQuery Syntax Examples](#jquery-syntax-examples) for more details on how to filter and manipulate the data. To avoid a single URL failure from affecting all requests, we will not return an error when an individual URL fails. Instead, we will return all contents that are successfully scraped.

<div class="markdown-col-no-wrap" data-col-1 data-col-2>

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_PAGE` |
| URL (required) | `url` | string | The URL to scrape the webpage contents. |
| Task ID (required) | `task` | string | `TASK_SCRAPE_PAGES` |
| URLs (required) | `urls` | array[string] | The URLs to scrape the webpage contents. |
| Scrape Method (required) | `scrape-method` | string | Defines the method used for web scraping. Available options include 'http' for standard HTTP-based scraping and 'chrome-simulator' for scraping through a simulated Chrome browser environment. |
| Include HTML | `include-html` | boolean | Indicate whether to include the raw HTML of the webpage in the output. If you want to include the raw HTML, set this to true. |
| Only Main Content | `only-main-content` | boolean | Only return the main content of the page by excluding the content of the tag of header, nav, footer. |
Expand All @@ -99,17 +99,26 @@ This task focuses on extracting specific data from a single targeted webpage by

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Content | `content` | string | The scraped plain content without html tags of the webpage. |
| Markdown | `markdown` | string | The scraped markdown of the webpage. |
| HTML (optional) | `html` | string | The scraped html of the webpage. |
| [Metadata](#scrape-page-metadata) (optional) | `metadata` | object | The metadata of the webpage. |
| Links on Page (optional) | `links-on-page` | array[string] | The list of links on the webpage. |
| [Pages](#scrape-pages-pages) | `pages` | array[object] | A list of page objects that have been scraped. |
</div>

<details>
<summary> Output Objects in Scrape Page</summary>
<summary> Output Objects in Scrape Pages</summary>

<h4 id="scrape-pages-pages">Pages</h4>

<div class="markdown-col-no-wrap" data-col-1 data-col-2>

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Content | `content` | string | The scraped plain content without html tags of the webpage. |
| HTML | `html` | string | The scraped html of the webpage. |
| Links on Page | `links-on-page` | array | The list of links on the webpage. |
| Markdown | `markdown` | string | The scraped markdown of the webpage. |
| [Metadata](#scrape-pages-metadata) | `metadata` | object | The metadata of the webpage. |
</div>

<h4 id="scrape-page-metadata">Metadata</h4>
<h4 id="scrape-pages-metadata">Metadata</h4>

<div class="markdown-col-no-wrap" data-col-1 data-col-2>

Expand Down Expand Up @@ -148,7 +157,7 @@ This task focuses on extracting specific data from a single targeted webpage by


#### About Dynamic Content
`TASK_SCRAPE_PAGE` supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:
`TASK_SCRAPE_PAGES` supports fetching dynamic content from web pages by simulating user behaviours, such as scrolling down. The initial implementation includes the following capabilities:

Scrolling:
- Mimics user scrolling down the page to load additional content dynamically.
Expand All @@ -158,7 +167,7 @@ Future enhancements will include additional user interactions, such as:
- Taking Screenshots: Capture screenshots of the current view.
- Keyboard Actions: Simulate key presses and other keyboard interactions.

`TASK_SCRAPE_PAGE` aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.
`TASK_SCRAPE_PAGES` aims to provide a robust framework for interacting with web pages and extracting dynamic content effectively.

### Scrape Sitemap

Expand All @@ -185,47 +194,57 @@ This task extracts data directly from a website’s sitemap. A sitemap is typica
</div>




## Example Recipes

```yaml
version: v1beta

variable:
url:
title: url
title: URL
instill-format: string

component:
crawler:
type: web
input:
root-url: ${variable.url}
url: ${variable.url}
allowed-domains:
max-k: 10
timeout: 0
max-k: 30
timeout: 1000
max-depth: 0
condition:
task: TASK_CRAWL_SITE

json-filter:
type: json
input:
json-value: ${crawler.output.pages}
jq-filter: .[] | ."link"
condition:
task: TASK_JQ

scraper:
type: web
input:
url: ${crawler.output.pages[0].link}
urls: ${json-filter.output.results}
scrape-method: http
include-html: false
only-main-content: true
remove-tags:
only-include-tags:
timeout: 1000
timeout: 0
condition:
task: TASK_SCRAPE_PAGE
task: TASK_SCRAPE_PAGES

output:
markdown:
title: Markdown
value: ${scraper.output.markdown}
pages:
title: Pages
value: ${crawler.output.pages}
links:
title: links
value: ${scraper.output.links-on-page}
title: Links
value: ${json-filter.output.results}
scraper-pages:
title: Scraper Pages
value: ${scraper.output.pages}
```
4 changes: 2 additions & 2 deletions pkg/component/operator/web/v0/config/definition.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"availableTasks": [
"TASK_CRAWL_SITE",
"TASK_SCRAPE_PAGE",
"TASK_SCRAPE_PAGES",
"TASK_SCRAPE_SITEMAP"
],
"documentationUrl": "https://www.instill.tech/docs/component/operator/web",
Expand All @@ -11,7 +11,7 @@
"title": "Web",
"type": "COMPONENT_TYPE_OPERATOR",
"uid": "98909958-db7d-4dfe-9858-7761904be17e",
"version": "0.3.0",
"version": "0.4.0",
"sourceUrl": "https://github.com/instill-ai/pipeline-backend/blob/main/pkg/component/operator/web/v0",
"description": "Scrape websites",
"releaseStage": "RELEASE_STAGE_ALPHA"
Expand Down
Loading

0 comments on commit ae4e3c2

Please sign in to comment.