You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Life is hard, but writing a fast parallel webcrawler doesn't have to be. I'd like to make it as easy as possible to create a fast and parallel crawler. I'd like to essentially implement the Master-Worker model used in the wrapper classes of palantiri here.
My thoughts
The API I'm envisioning is something like
crawler = rasp.CoolCrawlerClass(num_threads=n,
engine=rasp.TorEngine,
log_level=logging.DEBUG)
@crawler.crawl(seed_list)
def f(website, crawler_handle):
# do stuff with the website
Finding a flexible method for defining the terminating condition could be difficult. I'm tempted to leave that to the end user (hence the crawler_handle). A simple approach would be to essentially implement the the function returned by the decorator Crawler::crawl as
url = self.url_queue.pop()
while url:
website = self.engine.get(url)
f(website, self.as_handle())
then rely on the user to add any interesting sites they want to now visit to the url_queue in the user defined function f.
Side notes:
Debugging a parallel crawler is hard. A debug crawler class should be created that does everything in the main thread.
If we do use a queue and the absence of items as the terminating condition, we should probably implement it as a priority queue.
I wonder if we might be able to fork an existing project like Scrapy, which uses the Twisted library to run concurrent web crawlers. Scrapy uses Twisted's Deferred object to make asynchronous function calls that don't block program execution, which sounds like just what we need. Unless there's a particular reason why Scrapy falls short, I'm inclined to say we should use that and not reinvent the wheel here.
So certainly there are other mature web scrapers, and theres probably some discussion to be had about the balance between picking the best tool to accomplish a particular requirement within a project vs. maintaining some internal projects for the sake of the mentorship/development aspect of the organization.
Insofar as rasp development is concerned: I see two situations here:
If you have a bounded list of urls you want to parse, but you want to do that as fast as possible, there are bunch of ways to do that with Add method to get functional representation of engine #9. Toolz, concurrent.futures, twisted, joblib, dask, pyspark, multiprocessing, asyncio, twisted, rq etc. would all support parallel and/or concurrent execution of the function(s) across the list.
If you don't know the list ahead of time (i.e. you start with example.com, then crawl all links on that page, and all links on those pages, etc), most of those packages will also work but it may be a bit more involved. In this case the problem looks a lot more like streaming than a large parallel batch job.
I think the initial development would look the same for either: expose a functional interface that the workers can use to parse URLs, and establish the Webpage object with more meta-data. Then, unless we had a strong tendency towards (1) or (2), I'd try to figure out how to structure a generic crawler such that it could fit either, and make a reference batch-crawler and streaming-crawler, as you mention, single threaded. If all that works, as mentioned, there are a bunch of options for more sophisticated execution.
I'm going to take a stab at a reference implementation of a crawler here, if nothing else than for more specific discussion in the resultant PR, so don't let that stop anyone else from working on this as well.
tl;dr
Life is hard, but writing a fast parallel webcrawler doesn't have to be. I'd like to make it as easy as possible to create a fast and parallel crawler. I'd like to essentially implement the Master-Worker model used in the wrapper classes of palantiri here.
My thoughts
The API I'm envisioning is something like
Finding a flexible method for defining the terminating condition could be difficult. I'm tempted to leave that to the end user (hence the crawler_handle). A simple approach would be to essentially implement the the function returned by the decorator
Crawler::crawl
asthen rely on the user to add any interesting sites they want to now visit to the
url_queue
in the user defined functionf
.Side notes:
queue
and the absence of items as the terminating condition, we should probably implement it as a priority queue.The text was updated successfully, but these errors were encountered: