feat(web): refactor web operator #753

chuang8511 · 2024-10-17T17:27:27Z

Because

we want to increase max k
we want to make scrape method explicit

This commit

refactor crawler
add scrape method for scraper

linear · 2024-10-17T17:27:30Z

INS-6645 Set 120s timeout for crawling process with maxK as infinity

chuang8511 · 2024-10-17T20:52:04Z

QA cases

donch1989 · 2024-10-18T04:08:14Z

pkg/component/operator/web/v0/crawl_website.go

 	// On every a element which has href attribute call callback
 	// Wont be called if error occurs
 	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
+		mu.Lock()


Suggested change

mu.Lock()

mu.Lock()

defer mu.Unlock()

The mutex need to be unlocked no matter what. We can use defer to guard this. I assume the reason you didn’t use defer is that you didn’t want to scope _ = e.Request.Visit(link) within the mutex. Perhaps we can rearrange the code to avoid this issue while still utilizing defer.

donch1989 · 2024-10-18T04:09:32Z

pkg/component/operator/web/v0/crawl_website.go

@@ -128,20 +156,36 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
 		title := util.ScrapeWebpageTitle(doc)
 		page.Title = title

-		defer mu.Unlock()


Any reason why we don't use defer here?

The colly is hard to understand.
So, I just tried lots of workaround to confirm if it works.

I will clean the code and make sure the performance is same.

donch1989 · 2024-10-18T04:33:18Z

pkg/component/operator/web/v0/crawl_website.go

-	c.Wait()
+
+	go func() {
+		_ = c.Visit(inputStruct.URL)


Since it's in a goroutine now. Do we still need to c.Wait() here?

Yes, we need it. c.Wait() is for async crawling in colly package.

jvallesm · 2024-10-18T05:55:53Z

pkg/component/operator/web/v0/config/tasks.json

@@ -63,7 +63,7 @@
        },
        "max-k": {
          "default": 10,
-          "description": "Max-K specifies the maximum number of pages to return. If max-k is set to 0, all available pages will be returned, up to a maximum of 100. If max-k is set to a positive number, the result will include up to max-k pages, but no more than that.",
+          "description": "Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more.",


I think we should document the 120s limitation on the task description, not the max-k field. Though I value having a note here saying that when the task timeout is reached the available results are returned.

jvallesm · 2024-10-18T05:57:58Z

pkg/component/operator/web/v0/crawl_website.go

+		// When the users set to 0, it means infinite.
+		// However, there is performance issue when we set it to infinite.
+		// So, we set the default value to solve performance issue easily.
+		i.MaxK = 8000


This limitation should be documented in the public readme.

Do you mean to make it in README.mdx?

This setting is just to fix the performance issue. So, the users don't have to know this information. Mainly, it is for developer. I will add clearer comments here.

I updated here.

donch1989 · 2024-10-18T17:02:48Z

I'll merge this first. @chuang8511 please check the comments and send another PR later.

🤖 I have created a release *beep* *boop* --- ## [0.44.0-beta](v0.43.2-beta...v0.44.0-beta) (2024-10-22) ### Features * **collection:** add concat ([#748](#748)) ([04d1467](04d1467)) * **compogen:** improve Title Case capitalization ([#757](#757)) ([f956ead](f956ead)) * **component:** update documentation URL to component ID ([#749](#749)) ([d4083c2](d4083c2)) * **instillmodel:** implement instill model embedding ([#727](#727)) ([17d88bc](17d88bc)) * **run:** run logging data list by requester API ([#730](#730)) ([e1e844b](e1e844b)) * **slack:** enable OAuth 2.0 integration ([#758](#758)) ([8043dc5](8043dc5)) * standardize the tag naming convention ([#767](#767)) ([fd0500f](fd0500f)) * **web:** refactor web operator ([#753](#753)) ([700805f](700805f)) ### Bug Fixes * **groq:** use credential-supported model in example ([#752](#752)) ([fc90435](fc90435)) * **run:** not return minio error in list pipeline run ([#744](#744)) ([4d0afa1](4d0afa1)) * **whatsapp:** fix dir name ([#763](#763)) ([029aef9](029aef9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

Because - we want to make scrape page to pages - there are merged code not updated from the review of #753 This commit - make scrape page to pages - modify the code based on the review note - web operator QA results are in `Test Result` thread in the linear ticket - migration QA result is in `Migration QA` thread in the linear ticket

feat(web): refactor web operator

0672ca8

chuang8511 requested review from GeorgeWilliamStrong, donch1989, pinglin, xiaofei-du and jvallesm as code owners October 17, 2024 17:27

droplet-bot added instill vdp instill component labels Oct 17, 2024

chuang8511 marked this pull request as draft October 17, 2024 17:30

fix(web): fix max k

ce21548

chuang8511 marked this pull request as ready for review October 17, 2024 17:44

donch1989 requested changes Oct 18, 2024

View reviewed changes

jvallesm reviewed Oct 18, 2024

View reviewed changes

donch1989 merged commit 700805f into main Oct 18, 2024
11 checks passed

donch1989 deleted the chunhao/ins-6645-web-improve branch October 18, 2024 17:02

droplet-bot mentioned this pull request Oct 18, 2024

chore(main): release 0.44.0-beta #747

Merged

chuang8511 mentioned this pull request Oct 23, 2024

feat(web): refactor the web operator #772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): refactor web operator #753

feat(web): refactor web operator #753

chuang8511 commented Oct 17, 2024

linear bot commented Oct 17, 2024

chuang8511 commented Oct 17, 2024

donch1989 Oct 18, 2024

donch1989 Oct 18, 2024

chuang8511 Oct 23, 2024

donch1989 Oct 18, 2024

chuang8511 Oct 23, 2024

jvallesm Oct 18, 2024

jvallesm Oct 18, 2024

chuang8511 Oct 23, 2024

chuang8511 Oct 23, 2024

donch1989 commented Oct 18, 2024

feat(web): refactor web operator #753

feat(web): refactor web operator #753

Conversation

chuang8511 commented Oct 17, 2024

linear bot commented Oct 17, 2024

chuang8511 commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donch1989 commented Oct 18, 2024