Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Crawl] Allow users to crawl an entire website #77

Closed
StanGirard opened this issue May 20, 2023 · 18 comments
Closed

[Crawl] Allow users to crawl an entire website #77

StanGirard opened this issue May 20, 2023 · 18 comments
Labels
enhancement New feature or request epic Used to tag the issue describing the whole epic good first issue Good for newcomers
Milestone

Comments

@StanGirard
Copy link
Collaborator

StanGirard commented May 20, 2023

Have a recursive crawling over a website, pulls every other website linked in the website

@StanGirard StanGirard converted this from a draft issue May 20, 2023
@StanGirard
Copy link
Collaborator Author

The endpoint and options already exist in the backend but it doesn't have the code behind :D

@StanGirard StanGirard added this to the V1 milestone May 20, 2023
@StanGirard StanGirard added the good first issue Good for newcomers label May 20, 2023
@StanGirard StanGirard moved this to Todo in Quivr's Roadmap May 20, 2023
@StanGirard StanGirard added the enhancement New feature or request label May 21, 2023
@StanGirard
Copy link
Collaborator Author

#108 adds an input in the frontend for an URL. There should be a setting to add JS rendering and crawling website

@Ruben1701
Copy link
Contributor

Starting on this!

@StanGirard StanGirard moved this from Todo to In Progress in Quivr's Roadmap May 24, 2023
@StanGirard StanGirard moved this from In Progress to Todo in Quivr's Roadmap May 29, 2023
@StanGirard StanGirard moved this from Todo to Backlog in Quivr's Roadmap May 29, 2023
@shashank-crypto
Copy link

@Ruben1701 @StanGirard Do you mind if I take over?

@StanGirard
Copy link
Collaborator Author

No go ahead @shashank-crypto

@Ruben1701
Copy link
Contributor

@shashank-crypto if you check the pull requests there is a working version just with a bug I cant seem to figure out!

@shashank-crypto
Copy link

@Ruben1701 I saw your PR, but I don't know if the selenium is the right answer for this. It is a heavy tool plus using chromium might take some unnecessary space. I am thinking it as more of a text scraping tool.
Why did you choose selenium as your first choice ? May be I am missing something. Because the solution I am thinking will have issue covering the cases where we have pdf pages or are not simple html pages.

@Ruben1701
Copy link
Contributor

@shashank-crypto it should only use selenium if its a website that loads its content using javascript

@shashank-crypto
Copy link

@Ruben1701 Yeah, that's going to be tough to handle but I wanted to avoid using Selenium for its scalability issue. But what bug were you having ? Just this one ?

Works for one url and crawls the rest but somehow doesnt save it into the file, think it might have to do with the endpoint itself in the API.py the Spoolfile to be more specific.

@Ruben1701
Copy link
Contributor

@shashank-crypto yep if you check the console you will see it returns all the different urls it scrapes but the file only contains the original url content. Kinda caught up in my graduation right now so haven't had a chance to look at it yet.

@shashank-crypto
Copy link

Sure @Ruben1701, I will try to work with your PR.

@Ruben1701
Copy link
Contributor

@shashank-crypto have discord? join the server and hit me up on there at the office all day but i will have it open!

@shashank-crypto
Copy link

Which server ? @Ruben1701

@Ruben1701
Copy link
Contributor

Which server ? @Ruben1701

https://discord.gg/VTgSYEg7

@shashank-crypto

@shashank-crypto
Copy link

@StanGirard I have raised a PR #247 for this. I have tested in my local. It's uploading the page contents. Functionality is working. I have tested with langchanin, aws and beautifulsoup documentation pages. There are few edge cases of collecting links and adding to paths. I have covered the most common cases for now.

@gozineb gozineb removed the status in Quivr's Roadmap Jun 22, 2023
@gozineb gozineb changed the title Allow users to crawl an entire website [Crawl] Allow users to crawl an entire website Jun 29, 2023
@gozineb gozineb modified the milestones: V1, V3 Jul 3, 2023
@gozineb gozineb added the epic Used to tag the issue describing the whole epic label Jul 3, 2023
@github-actions
Copy link
Contributor

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic Used to tag the issue describing the whole epic good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants
@StanGirard @Ruben1701 @shashank-crypto @gozineb and others