go build
./web-crawler https://bbc.co.uk
github.com/stretchr/testify/assert
golang.org/x/net/html
The main component is a Crawler
that is constructed with a domain to visit and is responsible for crawling the domain specified.
Page
objects are crawled HTML pages in a domain which contain the links to other pages.
The only operation that can be performed on a Crawler
object is Crawl
. Crawl
starts the crawling process off, immediately returning a channel that emits the Page
objects as they are crawled. When that channel is closed by the Crawler
object then that signals that crawling has completed.
The URLs to Visit channel contains all the URLs that should be visited, potentially including URLs that have already been visited or are not on the domain that is being crawled.
The Visit Pages goroutine is responsible for processing the URLs to Visit channel. It checks if a URL has already been visited, also checking if the URL is for the domain that is being crawled. It launches a Visit Page goroutine with a single URL parameter for ech url that is required to be crawled. This means each url to visit and crawl is handled by a new goroutine.
The Visit Page goroutine performs a HTTP get on the url provided to it, processes the response and builds a Page
object containing the url and any links from the page.
The Pages Crawled Channel buffers all the Page
objets produced by Visit Page goroutines.
The Process Page goroutine process the Pages Crawled channel and sends all links from the page to the URLs To Visit Channel. It also has the responsibility of performing the Add
and Done
operations on the Complete Wait Group. The last operation performed is pushing the Page
object on the Results channel to be consumed by main
The Complte Wait Group acts as a way of allowing the crawler to know when all relevent urls have been visited on a domain. When a new url is found to crawl the Wait Group has an Add
operation performed on it. When a url is either disregarded or visited then the Done
operation is performed on it. When callers to the Wait
operation unblock then all relvant urls should have been visited.
The Wait For Completion goroutine blocks on the Complete Wait Group with a Wait
operation. When that Wait
operations unblocks it then closes the Results channel to signal to clients that the crawling has completed.