-
Notifications
You must be signed in to change notification settings - Fork 7
URL search: CDX server API
CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.
The only required parameter to the cdx-server api is the url, eg: http://arquivo.pt/wayback/-cdx?url=publico.pt
will return a list of captures for 'publico.pt'
Setting from= or to= will restrict the results to the given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or upper bound.
Example: http://arquivo.pt/wayback/-cdx?url=sapo.pt&from=2014&to=2014 will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959
The cdx-server supports the following matchType
exact -- default setting, will return captures that match the url exactly
prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*
host -- return captures which for a begin host (the path segment is ignored if specified)
domain -- return captures for the current host and all subdomains, eg. *.example.com
Instead of specifying a separate matchType parameter, wildcards may be used in the url:
- ?url=http://www.sapo.pt/noticias/* is equivalent to ?url=http://www.sapo.pt/noticias/&matchType=prefix
- ?url=*.sapo.pt is equivalent to ?url=sapo.pt&matchType=domain
Setting limit= will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow.
Example: http://arquivo.pt/wayback/-cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.
The sort param can be set as follows:
reverse -- will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.
closest -- setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.
Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.
Example: http://arquivo.pt/wayback/-cdx?url=publico.pt&output=json
The filter
param can be specified multiple times to filter by specific fields in the cdx index. Field names correspond to the fields returned in the JSON output. Filters can be specified as follows:
?url=publico.pt/*&filter==mime:text/html&filter=!=status:200
Return captures from publico.pt/* where mime is text/html and http status is not 200.
The !
modifier before =status
indicates negation. The =
and ~
modifiers are optional and specify exact resp. regular expression matches. The default (no specific modifier) is to filter whether the query string is contained in the field value. Negation and exact/regex modifier may be combined, eg. filter=!~text/.*
The formal syntax is: filter=<fieldname>:[!][=|~]<expression>
with the following modifiers:
modifier(s) | example | description |
---|---|---|
(no modifier) | filter=mime:html |
field "mime" contains string "html" |
= |
filter==mime:text/html |
exact match: field "mime" is "text/html" |
~ |
filter=~mime:.*/html$ |
regex match: expression matches beginning of field "mime" (cf. re.match) |
! |
filter=!mime:html |
field "mime" does not contain string "html" |
!= |
filter=!=mime:text/html |
field "mime" is not "text/html" |
!~ |
filter=!~mime:.*/html |
expression does not match beginning of field "mime" |
The fl
param can be used to specify which fields to include in the output. The standard available fields are: urlkey
, timestamp
, url
, mime
, status
, **digest**
, **length**
, **offset**
, **filename**
If a minimal cdx index is used, the mime
and status
fields may not be available. Additional fields may be introduced in the future, especially in the CDX JSON format.
Fields can be comma delimited, for example fl=urlkey,timestamp
will only include the urlkey
, timestamp
and filename
in the output.