Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): refactor web operator #753

Merged
merged 2 commits into from
Oct 18, 2024
Merged

Conversation

chuang8511
Copy link
Contributor

Because

  • we want to increase max k
  • we want to make scrape method explicit

This commit

  • refactor crawler
  • add scrape method for scraper

Copy link

linear bot commented Oct 17, 2024

@chuang8511 chuang8511 marked this pull request as ready for review October 17, 2024 17:44
@chuang8511
Copy link
Contributor Author

QA cases
image

// On every a element which has href attribute call callback
// Wont be called if error occurs
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mu.Lock()
mu.Lock()
defer mu.Unlock()

The mutex need to be unlocked no matter what. We can use defer to guard this. I assume the reason you didn’t use defer is that you didn’t want to scope _ = e.Request.Visit(link) within the mutex. Perhaps we can rearrange the code to avoid this issue while still utilizing defer.

@@ -128,20 +156,36 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
title := util.ScrapeWebpageTitle(doc)
page.Title = title

defer mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we don't use defer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The colly is hard to understand.
So, I just tried lots of workaround to confirm if it works.

I will clean the code and make sure the performance is same.

c.Wait()

go func() {
_ = c.Visit(inputStruct.URL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's in a goroutine now. Do we still need to c.Wait() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it. c.Wait() is for async crawling in colly package.

@@ -63,7 +63,7 @@
},
"max-k": {
"default": 10,
"description": "Max-K specifies the maximum number of pages to return. If max-k is set to 0, all available pages will be returned, up to a maximum of 100. If max-k is set to a positive number, the result will include up to max-k pages, but no more than that.",
"description": "Max-K sets a limit on the number of pages to fetch. If Max-K is set to 0, all available pages will be fetched within the time limit of 120 seconds. If Max-K is a positive number, the fetch will return up to that many pages, but no more.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document the 120s limitation on the task description, not the max-k field. Though I value having a note here saying that when the task timeout is reached the available results are returned.

// When the users set to 0, it means infinite.
// However, there is performance issue when we set it to infinite.
// So, we set the default value to solve performance issue easily.
i.MaxK = 8000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation should be documented in the public readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to make it in README.mdx?

This setting is just to fix the performance issue. So, the users don't have to know this information. Mainly, it is for developer. I will add clearer comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated here.

@donch1989
Copy link
Member

I'll merge this first. @chuang8511 please check the comments and send another PR later.

@donch1989 donch1989 merged commit 700805f into main Oct 18, 2024
11 checks passed
@donch1989 donch1989 deleted the chunhao/ins-6645-web-improve branch October 18, 2024 17:02
donch1989 pushed a commit that referenced this pull request Oct 22, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.44.0-beta](v0.43.2-beta...v0.44.0-beta)
(2024-10-22)


### Features

* **collection:** add concat
([#748](#748))
([04d1467](04d1467))
* **compogen:** improve Title Case capitalization
([#757](#757))
([f956ead](f956ead))
* **component:** update documentation URL to component ID
([#749](#749))
([d4083c2](d4083c2))
* **instillmodel:** implement instill model embedding
([#727](#727))
([17d88bc](17d88bc))
* **run:** run logging data list by requester API
([#730](#730))
([e1e844b](e1e844b))
* **slack:** enable OAuth 2.0 integration
([#758](#758))
([8043dc5](8043dc5))
* standardize the tag naming convention
([#767](#767))
([fd0500f](fd0500f))
* **web:** refactor web operator
([#753](#753))
([700805f](700805f))


### Bug Fixes

* **groq:** use credential-supported model in example
([#752](#752))
([fc90435](fc90435))
* **run:** not return minio error in list pipeline run
([#744](#744))
([4d0afa1](4d0afa1))
* **whatsapp:** fix dir name
([#763](#763))
([029aef9](029aef9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
jvallesm pushed a commit that referenced this pull request Oct 29, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.44.0-beta](v0.43.2-beta...v0.44.0-beta)
(2024-10-22)


### Features

* **collection:** add concat
([#748](#748))
([04d1467](04d1467))
* **compogen:** improve Title Case capitalization
([#757](#757))
([f956ead](f956ead))
* **component:** update documentation URL to component ID
([#749](#749))
([d4083c2](d4083c2))
* **instillmodel:** implement instill model embedding
([#727](#727))
([17d88bc](17d88bc))
* **run:** run logging data list by requester API
([#730](#730))
([e1e844b](e1e844b))
* **slack:** enable OAuth 2.0 integration
([#758](#758))
([8043dc5](8043dc5))
* standardize the tag naming convention
([#767](#767))
([fd0500f](fd0500f))
* **web:** refactor web operator
([#753](#753))
([700805f](700805f))


### Bug Fixes

* **groq:** use credential-supported model in example
([#752](#752))
([fc90435](fc90435))
* **run:** not return minio error in list pipeline run
([#744](#744))
([4d0afa1](4d0afa1))
* **whatsapp:** fix dir name
([#763](#763))
([029aef9](029aef9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
donch1989 pushed a commit that referenced this pull request Oct 30, 2024
Because

- we want to make scrape page to pages
- there are merged code not updated from the review of
#753

This commit

- make scrape page to pages
- modify the code based on the review

note
- web operator QA results are in `Test Result` thread in the linear
ticket
- migration QA result is in `Migration QA` thread in the linear ticket
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants