-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Crawl] Allow users to crawl an entire website #77
Comments
The endpoint and options already exist in the backend but it doesn't have the code behind :D |
#108 adds an input in the frontend for an URL. There should be a setting to add JS rendering and crawling website |
Starting on this! |
@Ruben1701 @StanGirard Do you mind if I take over? |
No go ahead @shashank-crypto |
@shashank-crypto if you check the pull requests there is a working version just with a bug I cant seem to figure out! |
@Ruben1701 I saw your PR, but I don't know if the selenium is the right answer for this. It is a heavy tool plus using chromium might take some unnecessary space. I am thinking it as more of a text scraping tool. |
@shashank-crypto it should only use selenium if its a website that loads its content using javascript |
@Ruben1701 Yeah, that's going to be tough to handle but I wanted to avoid using Selenium for its scalability issue. But what bug were you having ? Just this one ?
|
@shashank-crypto yep if you check the console you will see it returns all the different urls it scrapes but the file only contains the original url content. Kinda caught up in my graduation right now so haven't had a chance to look at it yet. |
Sure @Ruben1701, I will try to work with your PR. |
@shashank-crypto have discord? join the server and hit me up on there at the office all day but i will have it open! |
Which server ? @Ruben1701 |
|
@StanGirard I have raised a PR #247 for this. I have tested in my local. It's uploading the page contents. Functionality is working. I have tested with langchanin, aws and beautifulsoup documentation pages. There are few edge cases of collecting links and adding to paths. I have covered the most common cases for now. |
Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion. |
Have a recursive crawling over a website, pulls every other website linked in the website
The text was updated successfully, but these errors were encountered: