Skip to content
/ webcp Public

Copy a subtree of a web site, with smart page filters.

Notifications You must be signed in to change notification settings

jesand/webcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webcp

Copy a subtree of a web site, with smart page filters. Distinctive features:

  • Crawl pages without saving them, in order to discover links to the pages you really want.
  • Automatically crawl from the Internet Archive instead of the live site (please be nice to the Archive!)

WORK IN PROGRESS: This program is not yet ready to use.

Compiling

go get -t github.com/jesand/webcp
go test github.com/jesand/webcp/...
go install github.com/jesand/webcp

Usage Examples

See the usage notes:

webcp -h

Download a URL and its sub-pages into the current directory:

webcp <url> .

By default, the crawl will fetch all linked pages up to a depth of 5, and will delay 5 seconds between subsequent requests to the same domain.

If you have a large crawl that you might need to kill and later resume, you can do that by providing a resume file:

webcp --resume=links.txt <url> .

API

The command line interface described here is just a thin wrapper around the Crawler type in the crawl package. You can easily use the crawler component directly in some other program. See the API reference on godoc for details.

Planned Enhancements

  • Crawl linked pages to the maximum depth, but only save pages whose URLs/MIME types match certain filters.
  • Stay on the same domain, or set of domains.
  • Crawl from the Internet Wayback Machine instead of from the live site, with fancy date filtering to get the page version you want.

About

Copy a subtree of a web site, with smart page filters.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published