-
Notifications
You must be signed in to change notification settings - Fork 7
URL search: CDX server API
CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.
The only required parameter to the cdx-server api is the url, eg: http://arquivo.pt/wayback/-cdx?url=publico.pt
will return a list of captures for 'publico.pt'
Setting from= or to= will restrict the results to the given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or upper bound.
For example, http://arquivo.pt/wayback/-cdx?url=sapo.pt&from=2014&to=2014 will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959
The cdx-server supports the following matchType
exact -- default setting, will return captures that match the url exactly
prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*
host -- return captures which for a begin host (the path segment is ignored if specified)
domain -- return captures for the current host and all subdomains, eg. *.example.com
Instead of specifying a separate matchType parameter, wildcards may be used in the url:
- ?url=http://www.sapo.pt/noticias/* is equivalent to ?url=http://www.sapo.pt/noticias/&matchType=prefix
- ?url=*.sapo.pt is equivalent to ?url=sapo.pt&matchType=domain
Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow. For example http://arquivo.pt/wayback/-cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.
The sort param can be set as follows:
reverse -- will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.
closest -- setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.
Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.