Skip to content

haywood/Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

crawler.py

usage: crawler.py pages children timelimit

	Given a seed file called seeds, crawler.py will crawl the web, starting with the urls therein. Each unique url it finds will be added to its database. If a url is not unique, that url's inbound links count will be incremented by 1. The script requires MongoDB and PyMongo.

	pages the minimum number of pages to be crawled.
	children the maximum number of child processes to generate for page processing.
	timelimit an integer number of seconds. The timelimit is applied to the crawling phase itself and takes precedence over pages.

About

A web crawler implemented in Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages