GitHub - haywood/Web-Crawler: A web crawler implemented in Python

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README		README
crawler.py		crawler.py
reader.py		reader.py

Repository files navigation

crawler.py

usage: crawler.py pages children timelimit

	Given a seed file called seeds, crawler.py will crawl the web, starting with the urls therein. Each unique url it finds will be added to its database. If a url is not unique, that url's inbound links count will be incremented by 1. The script requires MongoDB and PyMongo.

	pages the minimum number of pages to be crawled.
	children the maximum number of child processes to generate for page processing.
	timelimit an integer number of seconds. The timelimit is applied to the crawling phase itself and takes precedence over pages.