Writing a fast parallel webcrawler is hard, let's make it easier #10

dlrobertson · 2017-03-05T20:37:36Z

tl;dr

Life is hard, but writing a fast parallel webcrawler doesn't have to be. I'd like to make it as easy as possible to create a fast and parallel crawler. I'd like to essentially implement the Master-Worker model used in the wrapper classes of palantiri here.

My thoughts

The API I'm envisioning is something like

crawler = rasp.CoolCrawlerClass(num_threads=n,
                                engine=rasp.TorEngine,
                                log_level=logging.DEBUG)

@crawler.crawl(seed_list)
def f(website, crawler_handle):
    # do stuff with the website

Finding a flexible method for defining the terminating condition could be difficult. I'm tempted to leave that to the end user (hence the crawler_handle). A simple approach would be to essentially implement the the function returned by the decorator Crawler::crawl as

url = self.url_queue.pop()
while url:
    website = self.engine.get(url)
    f(website, self.as_handle())

then rely on the user to add any interesting sites they want to now visit to the url_queue in the user defined function f.

Side notes:

Debugging a parallel crawler is hard. A debug crawler class should be created that does everything in the main thread.
If we do use a queue and the absence of items as the terminating condition, we should probably implement it as a priority queue.
Depending on the implementation of Add method to get functional representation of engine #9 this might not be necessary.
This should be discussed/hashed out a bit more before it is actually implemented/worked on.

The text was updated successfully, but these errors were encountered:

coxjonc · 2017-03-05T21:22:13Z

I wonder if we might be able to fork an existing project like Scrapy, which uses the Twisted library to run concurrent web crawlers. Scrapy uses Twisted's Deferred object to make asynchronous function calls that don't block program execution, which sounds like just what we need. Unless there's a particular reason why Scrapy falls short, I'm inclined to say we should use that and not reinvent the wheel here.

wdm0006 · 2017-03-05T22:44:18Z

So certainly there are other mature web scrapers, and theres probably some discussion to be had about the balance between picking the best tool to accomplish a particular requirement within a project vs. maintaining some internal projects for the sake of the mentorship/development aspect of the organization.

Insofar as rasp development is concerned: I see two situations here:

If you have a bounded list of urls you want to parse, but you want to do that as fast as possible, there are bunch of ways to do that with Add method to get functional representation of engine #9. Toolz, concurrent.futures, twisted, joblib, dask, pyspark, multiprocessing, asyncio, twisted, rq etc. would all support parallel and/or concurrent execution of the function(s) across the list.
If you don't know the list ahead of time (i.e. you start with example.com, then crawl all links on that page, and all links on those pages, etc), most of those packages will also work but it may be a bit more involved. In this case the problem looks a lot more like streaming than a large parallel batch job.

I think the initial development would look the same for either: expose a functional interface that the workers can use to parse URLs, and establish the Webpage object with more meta-data. Then, unless we had a strong tendency towards (1) or (2), I'd try to figure out how to structure a generic crawler such that it could fit either, and make a reference batch-crawler and streaming-crawler, as you mention, single threaded. If all that works, as mentioned, there are a bunch of options for more sophisticated execution.

wdm0006 · 2017-03-17T02:35:53Z

I'm going to take a stab at a reference implementation of a crawler here, if nothing else than for more specific discussion in the resultant PR, so don't let that stop anyone else from working on this as well.

dlrobertson added E:hard lang:python type:enhancement labels Mar 5, 2017

omnunum added this to the v0.0.2 milestone Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing a fast parallel webcrawler is hard, let's make it easier #10

Writing a fast parallel webcrawler is hard, let's make it easier #10

dlrobertson commented Mar 5, 2017 •

edited

Loading

coxjonc commented Mar 5, 2017

wdm0006 commented Mar 5, 2017

wdm0006 commented Mar 17, 2017

Writing a fast parallel webcrawler is hard, let's make it easier #10

Writing a fast parallel webcrawler is hard, let's make it easier #10

Comments

dlrobertson commented Mar 5, 2017 • edited Loading

tl;dr

My thoughts

Side notes:

coxjonc commented Mar 5, 2017

wdm0006 commented Mar 5, 2017

wdm0006 commented Mar 17, 2017

dlrobertson commented Mar 5, 2017 •

edited

Loading