Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): refactor web operator #753

Merged
merged 2 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions pkg/component/operator/web/v0/README.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ This task involves systematically navigating through a website, starting from a
| Task ID (required) | `task` | string | `TASK_CRAWL_SITE` |
| URL (required) | `url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
| Allowed Domains | `allowed-domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
| Max Number of Pages (required) | `max-k` | integer | Max-K specifies the maximum number of pages to return. If max-k is set to 0, all available pages will be returned, up to a maximum of 100. If max-k is set to a positive number, the result will include up to max-k pages, but no more than that. |
| Timeout | `timeout` | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. |
| Max Number of Pages (required) | `max-k` | integer | Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more. |
| Timeout | `timeout` | integer | The time to wait for a page to load in milliseconds. Min 0, Max 60000. Please notice the timeout here is set for each page rather than the whole crawl task. |
| Max Depth | `max-depth` | integer | Max Depth specifies how deep the crawler will navigate from the root URL. If max depth is set to 1, the crawler will only scrape the root URL and will not follow any links to other pages. If max depth is set to 0, the crawler will scrape all reachable pages until the total number of scraped pages reaches max-k. If both max-k and max depth are defined, the crawler will prioritize the max-k setting when determining how many pages to scrape. |
</div>

Expand Down Expand Up @@ -75,11 +75,12 @@ This task focuses on extracting specific data from a single targeted webpage by
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_SCRAPE_PAGE` |
| URL (required) | `url` | string | The URL to scrape the webpage contents. |
| Scrape Method (required) | `scrape-method` | string | Defines the method used for web scraping. Available options include 'http' for standard HTTP-based scraping and 'chrome-simulator' for scraping through a simulated Chrome browser environment. |
| Include HTML | `include-html` | boolean | Indicate whether to include the raw HTML of the webpage in the output. If you want to include the raw HTML, set this to true. |
| Only Main Content | `only-main-content` | boolean | Only return the main content of the page by excluding the content of the tag of header, nav, footer. |
| Remove Tags | `remove-tags` | array[string] | A list of tags, classes, and ids to remove from the output. You can use [jQuery](https://www.w3schools.com/jquery/jquery_syntax.asp) to remove data. If empty, no tags will be removed. Example: 'script, .ad, #footer'. Please check the [jQuery Syntax Examples](#jquery-syntax-examples). |
| Only Include Tags | `only-include-tags` | array[string] | A list of tags, classes, and ids to include in the output. You can use [jQuery](https://www.w3schools.com/jquery/jquery_syntax.asp) to include data. If empty, all tags will be included. Example: 'script, .ad, #footer'. Please check the [jQuery Syntax Examples](#jquery-syntax-examples). |
| Timeout | `timeout` | integer | The time to wait for the page to load in milliseconds. Min 0, Max 60000. Please set it as 0 if you only want to collect static content. Please notice that if the timeout is set a small value, the page may not be fully loaded. |
| Timeout | `timeout` | integer | This parameter specifies the time to wait for a page to load, measured in milliseconds. The minimum value is 0, and the maximum value is 60,000. Please note that if you set a short timeout, the page may not fully load. Conversely, setting a long timeout could significantly increase the time it takes for the task to complete. This timeout setting applies only to the Chrome simulator. |
</div>


Expand Down
39 changes: 29 additions & 10 deletions pkg/component/operator/web/v0/config/tasks.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
},
"max-k": {
"default": 10,
"description": "Max-K specifies the maximum number of pages to return. If max-k is set to 0, all available pages will be returned, up to a maximum of 100. If max-k is set to a positive number, the result will include up to max-k pages, but no more than that.",
"description": "Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document the 120s limitation on the task description, not the max-k field. Though I value having a note here saying that when the task timeout is reached the available results are returned.

"instillAcceptFormats": [
"integer"
],
Expand All @@ -72,14 +72,13 @@
"value",
"reference"
],
"maximum": 100,
"minimum": 0,
"title": "Max Number of Pages",
"type": "integer"
},
"timeout": {
"default": 1000,
"description": "The time to wait for the page to load in milliseconds. Min 0, Max 60000.",
"description": "The time to wait for a page to load in milliseconds. Min 0, Max 60000. Please notice the timeout here is set for each page rather than the whole crawl task.",
"instillAcceptFormats": [
"integer"
],
Expand Down Expand Up @@ -238,12 +237,31 @@
"title": "URL",
"type": "string"
},
"scrape-method": {
"description": "Defines the method used for web scraping. Available options include 'http' for standard HTTP-based scraping and 'chrome-simulator' for scraping through a simulated Chrome browser environment.",
"instillAcceptFormats": [
"string"
],
"enum": [
"http",
"chrome-simulator"
],
"instillUIOrder": 1,
"instillUpstreamTypes": [
"value",
"reference",
"template"
],
"default": "http",
"title": "Scrape Method",
"type": "string"
},
"include-html": {
"description": "Indicate whether to include the raw HTML of the webpage in the output. If you want to include the raw HTML, set this to true.",
"instillAcceptFormats": [
"boolean"
],
"instillUIOrder": 1,
"instillUIOrder": 2,
"instillUpstreamTypes": [
"value",
"reference"
Expand All @@ -256,7 +274,7 @@
"instillAcceptFormats": [
"boolean"
],
"instillUIOrder": 2,
"instillUIOrder": 3,
"instillUpstreamTypes": [
"value",
"reference"
Expand All @@ -269,7 +287,7 @@
"instillAcceptFormats": [
"array:string"
],
"instillUIOrder": 3,
"instillUIOrder": 4,
"instillUpstreamTypes": [
"value",
"reference"
Expand All @@ -285,7 +303,7 @@
"instillAcceptFormats": [
"array:string"
],
"instillUIOrder": 4,
"instillUIOrder": 5,
"instillUpstreamTypes": [
"value",
"reference"
Expand All @@ -298,11 +316,11 @@
},
"timeout": {
"default": 1000,
"description": "The time to wait for the page to load in milliseconds. Min 0, Max 60000. Please set it as 0 if you only want to collect static content. Please notice that if the timeout is set a small value, the page may not be fully loaded.",
"description": "This parameter specifies the time to wait for a page to load, measured in milliseconds. The minimum value is 0, and the maximum value is 60,000. Please note that if you set a short timeout, the page may not fully load. Conversely, setting a long timeout could significantly increase the time it takes for the task to complete. This timeout setting applies only to the Chrome simulator.",
"instillAcceptFormats": [
"integer"
],
"instillUIOrder": 5,
"instillUIOrder": 6,
"instillUpstreamTypes": [
"value",
"reference"
Expand All @@ -314,7 +332,8 @@
}
},
"required": [
"url"
"url",
"scrape-method"
],
"title": "Input",
"type": "object"
Expand Down
85 changes: 70 additions & 15 deletions pkg/component/operator/web/v0/crawl_website.go
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
package web

import (
"context"
"fmt"
"log"
"net/url"
"strings"
"sync"
Expand All @@ -19,8 +19,11 @@ import (
"github.com/instill-ai/pipeline-backend/pkg/component/internal/util"
)

// PageInfo defines the information of a page
type PageInfo struct {
Link string `json:"link"`
// Link: The URL of the page.
Link string `json:"link"`
// Title: The title of the page.
Title string `json:"title"`
}

Expand All @@ -38,9 +41,12 @@ type CrawlWebsiteInput struct {
MaxDepth int `json:"max-depth"`
}

func (i *CrawlWebsiteInput) Preset() {
if i.MaxK < 0 {
i.MaxK = 0
func (i *CrawlWebsiteInput) preset() {
if i.MaxK <= 0 {
// When the users set to 0, it means infinite.
// However, there is performance issue when we set it to infinite.
// So, we set the default value to solve performance issue easily.
i.MaxK = 8000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation should be documented in the public readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to make it in README.mdx?

This setting is just to fix the performance issue. So, the users don't have to know this information. Mainly, it is for developer. I will add clearer comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated here.

}
}

Expand All @@ -61,7 +67,7 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
return nil, fmt.Errorf("error converting input to struct: %v", err)
}

inputStruct.Preset()
inputStruct.preset()

output := ScrapeWebsiteOutput{}

Expand All @@ -70,45 +76,67 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
var mu sync.Mutex
pageLinks := []string{}

// We will have the component timeout feature in the future.
// Before that, we initialize the context here.
ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
defer cancel()

// On every a element which has href attribute call callback
// Wont be called if error occurs
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mu.Lock()
mu.Lock()
defer mu.Unlock()

The mutex need to be unlocked no matter what. We can use defer to guard this. I assume the reason you didn’t use defer is that you didn’t want to scope _ = e.Request.Visit(link) within the mutex. Perhaps we can rearrange the code to avoid this issue while still utilizing defer.


if ctx.Err() != nil {
mu.Unlock()
return
}

// If we set output.Pages to the slice of PageInfo, it will take a longer time if the first html page has a lot of links.
// To improve the small Max-K execution time, we will use a separate slice to store the links.
// However, when K is big, the output length could be less than K.
// So, I set twice the MaxK to stop the scraping.
if inputStruct.MaxK > 0 && len(pageLinks) >= inputStruct.MaxK*2 {
if len(pageLinks) >= inputStruct.MaxK*getPageTimes(inputStruct.MaxK) {
mu.Unlock()
return
}

link := e.Attr("href")

if util.InSlice(pageLinks, link) {
mu.Unlock()
return
}

pageLinks = append(pageLinks, link)
mu.Unlock()

_ = e.Request.Visit(link)
})

// Set error handler
c.OnError(func(r *colly.Response, err error) {
log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
// In the future, we can design the error handling logic.
})

c.OnRequest(func(r *colly.Request) {
mu.Lock()

// Before length of output page is over, we should always send request.
if inputStruct.MaxK > 0 && len(output.Pages) >= inputStruct.MaxK {
if (len(output.Pages) >= inputStruct.MaxK) || ctx.Err() != nil {
r.Abort()
mu.Unlock()
return
}

mu.Unlock()
// Set a random user agent to avoid being blocked by websites
r.Headers.Set("User-Agent", randomString())
})

c.OnResponse(func(r *colly.Response) {
if ctx.Err() != nil {
return
}

strippedURL := stripQueryAndTrailingSlash(r.Request.URL)

Expand All @@ -128,20 +156,36 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
title := util.ScrapeWebpageTitle(doc)
page.Title = title

defer mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we don't use defer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The colly is hard to understand.
So, I just tried lots of workaround to confirm if it works.

I will clean the code and make sure the performance is same.

mu.Lock()
// If we do not set this condition, the length of output.Pages could be over the limit.
if len(output.Pages) < inputStruct.MaxK {
output.Pages = append(output.Pages, page)

// If the length of output.Pages is equal to MaxK, we should stop the scraping.
if len(output.Pages) == inputStruct.MaxK {
mu.Unlock()
cancel()
return
}
mu.Unlock()
return
}
mu.Unlock()
cancel()

})

// Start scraping
if !strings.HasPrefix(inputStruct.URL, "http://") && !strings.HasPrefix(inputStruct.URL, "https://") {
inputStruct.URL = "https://" + inputStruct.URL
}
_ = c.Visit(inputStruct.URL)
c.Wait()

go func() {
_ = c.Visit(inputStruct.URL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's in a goroutine now. Do we still need to c.Wait() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it. c.Wait() is for async crawling in colly package.

c.Wait()
}()

<-ctx.Done()

outputStruct, err := base.ConvertToStructpb(output)
if err != nil {
Expand Down Expand Up @@ -179,19 +223,21 @@ func initColly(inputStruct CrawlWebsiteInput) *colly.Collector {
)

// Limit the number of requests to avoid being blocked.
// Set it to 10 first in case sending too many requests at once.
var parallel int
if inputStruct.MaxK < 10 {
if inputStruct.MaxK < 30 {
parallel = inputStruct.MaxK
} else {
parallel = 10
parallel = 30
}

_ = c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: parallel,
// We set the delay to avoid being blocked.
Delay: 100 * time.Millisecond,
})

// Timeout here is set of each page rather than whole colly instance.
c.SetRequestTimeout(time.Duration(inputStruct.Timeout) * time.Millisecond)

if len(inputStruct.AllowedDomains) > 0 {
Expand All @@ -201,3 +247,12 @@ func initColly(inputStruct CrawlWebsiteInput) *colly.Collector {

return c
}

// It ensures that we fetch enough pages to get the required number of pages.
func getPageTimes(maxK int) int {
if maxK < 10 {
return 10
} else {
return 2
}
}
2 changes: 1 addition & 1 deletion pkg/component/operator/web/v0/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ type execution struct {
base.ComponentExecution
execute func(*structpb.Struct) (*structpb.Struct, error)
externalCaller func(url string) (ioCloser io.ReadCloser, err error)
getDocAfterRequestURL func(url string, timeout int) (*goquery.Document, error)
getDocAfterRequestURL func(url string, timeout int, scrapeMethod string) (*goquery.Document, error)
}

func Init(bc base.Component) *component {
Expand Down
2 changes: 1 addition & 1 deletion pkg/component/operator/web/v0/main_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ func TestScrapeWebpage(t *testing.T) {
})
}

func fakeHTTPRequest(url string, timeout int) (*goquery.Document, error) {
func fakeHTTPRequest(url string, timeout int, scrapeMethod string) (*goquery.Document, error) {
html := `
<!DOCTYPE html>
<html>
Expand Down
Loading
Loading