Skip to content

URL search: CDX server API

Fernando-Melo edited this page Oct 31, 2017 · 51 revisions

API Reference

CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.

url

The only required parameter to the cdx-server api is the url, eg: http://arquivo.pt/wayback/-cdx?url=publico.pt

will return a list of captures for 'publico.pt'

from / to

Setting from= or to= will restrict the results to the given date/time range (inclusive).

Timestamps may be <=14 digits and will be padded to either lower or upper bound.

For example, http://arquivo.pt/wayback/-cdx?url=sapo.pt&from=2014&to=2014 will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959

matchType

The cdx-server supports the following matchType

exact -- default setting, will return captures that match the url exactly

prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*

host -- return captures which for a begin host (the path segment is ignored if specified)

domain -- return captures for the current host and all subdomains, eg. *.example.com

Instead of specifying a separate matchType parameter, wildcards may be used in the url:

limit

Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow. For example http://arquivo.pt/wayback/-cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.

sort

The sort param can be set as follows:

reverse -- will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.

closest -- setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.

output (JSON output)

Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.