Copy a subtree of a web site, with smart page filters. Distinctive features:
- Crawl pages without saving them, in order to discover links to the pages you really want.
- Automatically crawl from the Internet Archive instead of the live site (please be nice to the Archive!)
WORK IN PROGRESS: This program is not yet ready to use.
go get -t github.com/jesand/webcp
go test github.com/jesand/webcp/...
go install github.com/jesand/webcp
See the usage notes:
webcp -h
Download a URL and its sub-pages into the current directory:
webcp <url> .
By default, the crawl will fetch all linked pages up to a depth of 5, and will delay 5 seconds between subsequent requests to the same domain.
If you have a large crawl that you might need to kill and later resume, you can do that by providing a resume file:
webcp --resume=links.txt <url> .
The command line interface described here is just a thin wrapper around the Crawler
type in the crawl
package. You can easily use the crawler component directly in some other program. See the API reference on godoc for details.
- Crawl linked pages to the maximum depth, but only save pages whose URLs/MIME types match certain filters.
- Stay on the same domain, or set of domains.
- Crawl from the Internet Wayback Machine instead of from the live site, with fancy date filtering to get the page version you want.