Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a fast parallel webcrawler is hard, let's make it easier #10

Open
dlrobertson opened this issue Mar 5, 2017 · 3 comments
Open

Comments

@dlrobertson
Copy link

dlrobertson commented Mar 5, 2017

tl;dr

Life is hard, but writing a fast parallel webcrawler doesn't have to be. I'd like to make it as easy as possible to create a fast and parallel crawler. I'd like to essentially implement the Master-Worker model used in the wrapper classes of palantiri here.

My thoughts

The API I'm envisioning is something like

crawler = rasp.CoolCrawlerClass(num_threads=n,
                                engine=rasp.TorEngine,
                                log_level=logging.DEBUG)

@crawler.crawl(seed_list)
def f(website, crawler_handle):
    # do stuff with the website

Finding a flexible method for defining the terminating condition could be difficult. I'm tempted to leave that to the end user (hence the crawler_handle). A simple approach would be to essentially implement the the function returned by the decorator Crawler::crawl as

url = self.url_queue.pop()
while url:
    website = self.engine.get(url)
    f(website, self.as_handle())

then rely on the user to add any interesting sites they want to now visit to the url_queue in the user defined function f.

Side notes:

  • Debugging a parallel crawler is hard. A debug crawler class should be created that does everything in the main thread.
  • If we do use a queue and the absence of items as the terminating condition, we should probably implement it as a priority queue.
  • Depending on the implementation of Add method to get functional representation of engine #9 this might not be necessary.
  • This should be discussed/hashed out a bit more before it is actually implemented/worked on.
@coxjonc
Copy link

coxjonc commented Mar 5, 2017

I wonder if we might be able to fork an existing project like Scrapy, which uses the Twisted library to run concurrent web crawlers. Scrapy uses Twisted's Deferred object to make asynchronous function calls that don't block program execution, which sounds like just what we need. Unless there's a particular reason why Scrapy falls short, I'm inclined to say we should use that and not reinvent the wheel here.

@wdm0006
Copy link
Collaborator

wdm0006 commented Mar 5, 2017

So certainly there are other mature web scrapers, and theres probably some discussion to be had about the balance between picking the best tool to accomplish a particular requirement within a project vs. maintaining some internal projects for the sake of the mentorship/development aspect of the organization.

Insofar as rasp development is concerned: I see two situations here:

  1. If you have a bounded list of urls you want to parse, but you want to do that as fast as possible, there are bunch of ways to do that with Add method to get functional representation of engine #9. Toolz, concurrent.futures, twisted, joblib, dask, pyspark, multiprocessing, asyncio, twisted, rq etc. would all support parallel and/or concurrent execution of the function(s) across the list.
  2. If you don't know the list ahead of time (i.e. you start with example.com, then crawl all links on that page, and all links on those pages, etc), most of those packages will also work but it may be a bit more involved. In this case the problem looks a lot more like streaming than a large parallel batch job.

I think the initial development would look the same for either: expose a functional interface that the workers can use to parse URLs, and establish the Webpage object with more meta-data. Then, unless we had a strong tendency towards (1) or (2), I'd try to figure out how to structure a generic crawler such that it could fit either, and make a reference batch-crawler and streaming-crawler, as you mention, single threaded. If all that works, as mentioned, there are a bunch of options for more sophisticated execution.

@wdm0006
Copy link
Collaborator

wdm0006 commented Mar 17, 2017

I'm going to take a stab at a reference implementation of a crawler here, if nothing else than for more specific discussion in the resultant PR, so don't let that stop anyone else from working on this as well.

@omnunum omnunum added this to the v0.0.2 milestone Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants