-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web): refactor web operator #753
Conversation
// On every a element which has href attribute call callback | ||
// Wont be called if error occurs | ||
c.OnHTML("a[href]", func(e *colly.HTMLElement) { | ||
mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mu.Lock() | |
mu.Lock() | |
defer mu.Unlock() |
The mutex need to be unlocked no matter what. We can use defer
to guard this. I assume the reason you didn’t use defer
is that you didn’t want to scope _ = e.Request.Visit(link)
within the mutex. Perhaps we can rearrange the code to avoid this issue while still utilizing defer
.
@@ -128,20 +156,36 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro | |||
title := util.ScrapeWebpageTitle(doc) | |||
page.Title = title | |||
|
|||
defer mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we don't use defer
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The colly is hard to understand.
So, I just tried lots of workaround to confirm if it works.
I will clean the code and make sure the performance is same.
c.Wait() | ||
|
||
go func() { | ||
_ = c.Visit(inputStruct.URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's in a goroutine now. Do we still need to c.Wait()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need it. c.Wait()
is for async crawling in colly package.
@@ -63,7 +63,7 @@ | |||
}, | |||
"max-k": { | |||
"default": 10, | |||
"description": "Max-K specifies the maximum number of pages to return. If max-k is set to 0, all available pages will be returned, up to a maximum of 100. If max-k is set to a positive number, the result will include up to max-k pages, but no more than that.", | |||
"description": "Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should document the 120s limitation on the task description, not the max-k field. Though I value having a note here saying that when the task timeout is reached the available results are returned.
// When the users set to 0, it means infinite. | ||
// However, there is performance issue when we set it to infinite. | ||
// So, we set the default value to solve performance issue easily. | ||
i.MaxK = 8000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limitation should be documented in the public readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to make it in README.mdx
?
This setting is just to fix the performance issue. So, the users don't have to know this information. Mainly, it is for developer. I will add clearer comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated here.
I'll merge this first. @chuang8511 please check the comments and send another PR later. |
🤖 I have created a release *beep* *boop* --- ## [0.44.0-beta](v0.43.2-beta...v0.44.0-beta) (2024-10-22) ### Features * **collection:** add concat ([#748](#748)) ([04d1467](04d1467)) * **compogen:** improve Title Case capitalization ([#757](#757)) ([f956ead](f956ead)) * **component:** update documentation URL to component ID ([#749](#749)) ([d4083c2](d4083c2)) * **instillmodel:** implement instill model embedding ([#727](#727)) ([17d88bc](17d88bc)) * **run:** run logging data list by requester API ([#730](#730)) ([e1e844b](e1e844b)) * **slack:** enable OAuth 2.0 integration ([#758](#758)) ([8043dc5](8043dc5)) * standardize the tag naming convention ([#767](#767)) ([fd0500f](fd0500f)) * **web:** refactor web operator ([#753](#753)) ([700805f](700805f)) ### Bug Fixes * **groq:** use credential-supported model in example ([#752](#752)) ([fc90435](fc90435)) * **run:** not return minio error in list pipeline run ([#744](#744)) ([4d0afa1](4d0afa1)) * **whatsapp:** fix dir name ([#763](#763)) ([029aef9](029aef9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.44.0-beta](v0.43.2-beta...v0.44.0-beta) (2024-10-22) ### Features * **collection:** add concat ([#748](#748)) ([04d1467](04d1467)) * **compogen:** improve Title Case capitalization ([#757](#757)) ([f956ead](f956ead)) * **component:** update documentation URL to component ID ([#749](#749)) ([d4083c2](d4083c2)) * **instillmodel:** implement instill model embedding ([#727](#727)) ([17d88bc](17d88bc)) * **run:** run logging data list by requester API ([#730](#730)) ([e1e844b](e1e844b)) * **slack:** enable OAuth 2.0 integration ([#758](#758)) ([8043dc5](8043dc5)) * standardize the tag naming convention ([#767](#767)) ([fd0500f](fd0500f)) * **web:** refactor web operator ([#753](#753)) ([700805f](700805f)) ### Bug Fixes * **groq:** use credential-supported model in example ([#752](#752)) ([fc90435](fc90435)) * **run:** not return minio error in list pipeline run ([#744](#744)) ([4d0afa1](4d0afa1)) * **whatsapp:** fix dir name ([#763](#763)) ([029aef9](029aef9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Because - we want to make scrape page to pages - there are merged code not updated from the review of #753 This commit - make scrape page to pages - modify the code based on the review note - web operator QA results are in `Test Result` thread in the linear ticket - migration QA result is in `Migration QA` thread in the linear ticket
Because
This commit