[Crawl] Allow users to crawl an entire website #77

StanGirard · 2023-05-20T23:47:34Z

Have a recursive crawling over a website, pulls every other website linked in the website

StanGirard · 2023-05-20T23:47:57Z

The endpoint and options already exist in the backend but it doesn't have the code behind :D

StanGirard · 2023-05-21T17:06:08Z

#108 adds an input in the frontend for an URL. There should be a setting to add JS rendering and crawling website

Ruben1701 · 2023-05-23T10:44:27Z

Starting on this!

shashank-crypto · 2023-06-01T08:22:38Z

@Ruben1701 @StanGirard Do you mind if I take over?

StanGirard · 2023-06-01T08:23:20Z

No go ahead @shashank-crypto

Ruben1701 · 2023-06-01T08:24:27Z

@shashank-crypto if you check the pull requests there is a working version just with a bug I cant seem to figure out!

shashank-crypto · 2023-06-01T08:55:24Z

@Ruben1701 I saw your PR, but I don't know if the selenium is the right answer for this. It is a heavy tool plus using chromium might take some unnecessary space. I am thinking it as more of a text scraping tool.
Why did you choose selenium as your first choice ? May be I am missing something. Because the solution I am thinking will have issue covering the cases where we have pdf pages or are not simple html pages.

Ruben1701 · 2023-06-01T08:58:28Z

@shashank-crypto it should only use selenium if its a website that loads its content using javascript

shashank-crypto · 2023-06-01T09:09:28Z

@Ruben1701 Yeah, that's going to be tough to handle but I wanted to avoid using Selenium for its scalability issue. But what bug were you having ? Just this one ?

Works for one url and crawls the rest but somehow doesnt save it into the file, think it might have to do with the endpoint itself in the API.py the Spoolfile to be more specific.

Ruben1701 · 2023-06-01T09:15:39Z

@shashank-crypto yep if you check the console you will see it returns all the different urls it scrapes but the file only contains the original url content. Kinda caught up in my graduation right now so haven't had a chance to look at it yet.

shashank-crypto · 2023-06-01T09:26:44Z

Sure @Ruben1701, I will try to work with your PR.

Ruben1701 · 2023-06-01T09:28:18Z

@shashank-crypto have discord? join the server and hit me up on there at the office all day but i will have it open!

shashank-crypto · 2023-06-01T09:51:29Z

Which server ? @Ruben1701

Ruben1701 · 2023-06-01T09:54:19Z

Which server ? @Ruben1701

https://discord.gg/VTgSYEg7

@shashank-crypto

shashank-crypto · 2023-06-04T15:25:28Z

@StanGirard I have raised a PR #247 for this. I have tested in my local. It's uploading the page contents. Functionality is working. I have tested with langchanin, aws and beautifulsoup documentation pages. There are few edge cases of collecting links and adding to paths. I have covered the most common cases for now.

github-actions · 2023-08-22T16:07:25Z

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.

StanGirard added this to Quivr's Roadmap May 20, 2023

StanGirard converted this from a draft issue May 20, 2023

StanGirard added this to the V1 milestone May 20, 2023

StanGirard added the good first issue Good for newcomers label May 20, 2023

StanGirard moved this to Todo in Quivr's Roadmap May 20, 2023

StanGirard added the enhancement New feature or request label May 21, 2023

StanGirard moved this from Todo to In Progress in Quivr's Roadmap May 24, 2023

StanGirard moved this from In Progress to Todo in Quivr's Roadmap May 29, 2023

StanGirard moved this from Todo to Backlog in Quivr's Roadmap May 29, 2023

shashank-crypto mentioned this issue Jun 4, 2023

Web crawling implemented #247

Closed

gozineb removed the status in Quivr's Roadmap Jun 22, 2023

gozineb changed the title ~~Allow users to crawl an entire website~~ [Crawl] Allow users to crawl an entire website Jun 29, 2023

gozineb modified the milestones: V1, V3 Jul 3, 2023

gozineb added the epic Used to tag the issue describing the whole epic label Jul 3, 2023

github-actions bot added Stale and removed Stale labels Aug 22, 2023

StanGirard closed this as completed Aug 22, 2023

github-project-automation bot moved this to Done in Quivr's Roadmap Aug 22, 2023

dosubot bot mentioned this issue Dec 2, 2023

[Feature]: Crawl entire website #1789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Crawl] Allow users to crawl an entire website #77

[Crawl] Allow users to crawl an entire website #77

StanGirard commented May 20, 2023 •

edited by gozineb

Loading

StanGirard commented May 20, 2023

StanGirard commented May 21, 2023

Ruben1701 commented May 23, 2023

shashank-crypto commented Jun 1, 2023

StanGirard commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 4, 2023

github-actions bot commented Aug 22, 2023

[Crawl] Allow users to crawl an entire website #77

[Crawl] Allow users to crawl an entire website #77

Comments

StanGirard commented May 20, 2023 • edited by gozineb Loading

StanGirard commented May 20, 2023

StanGirard commented May 21, 2023

Ruben1701 commented May 23, 2023

shashank-crypto commented Jun 1, 2023

StanGirard commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 1, 2023

Ruben1701 commented Jun 1, 2023

shashank-crypto commented Jun 4, 2023

github-actions bot commented Aug 22, 2023

StanGirard commented May 20, 2023 •

edited by gozineb

Loading