Skip to content

jbrough/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

I've changed the websites table to a view (it could be a materialised view), as crawling may be an ongoing process as new pages are added and it's not certain how to be sure that we have comprehensively crawled the whole website.

Have we stopped crawling because we've visited everything, or because we've encountered errors that have limited the depth of the crawl? This implementation is not sophisticated enough to know.

By storing all links we can partition by the correlation_id assigned when a crawl is first seeded by rake crawl:vice, or by timestamp (eg, counting all the unique urls encountered in the last week may give more accurate numbers than all the uniq links encountered in the last run). This allows us to see trends over time and know whether the crawler is at least behaving consistently.

To seed the crawl queue with the first entry URL, and gather stats on the TLD from Alexa, run

rake crawl:vice

There are three resque worker queues:

crawl_links fetches a page and parses out the internal and external links, and adds these to another queue for the links to be persisted:

COUNT=5 QUEUE=crawl_links rake environment resque:work

save_links adds links in the database (isolating remote fetching and local persistence):

COUNT=5 QUEUE=save_links rake environment resque:work

save_stats saves Alexa traffic data to the database:

COUNT=5 QUEUE=save_stats rake environment resque:work

NOTE: stats are fetched, pasred and added to the save_stats queue by the crawl:vice rake task, but ideally they'd also be a fetch_stats queue.

Once starting rails server, the redis dashboard can be viewed at:

http://localhost:3000/resque/overview

Considerations:

  • there is insufficient error handling and no retry logic
  • should we observe robots txt?
  • should we strip query parameters from links? dynamic links may present a 'spider trap', but removing them may break pagination and discovery of legitimate content

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published