Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not overwrite Date with Memento-Datetime value #548

Closed
ibnesayeed opened this issue Mar 27, 2020 · 8 comments
Closed

Do not overwrite Date with Memento-Datetime value #548

ibnesayeed opened this issue Mar 27, 2020 · 8 comments
Labels

Comments

@ibnesayeed
Copy link

In main page mementos the value of the Memento-Datetime header overwrites the Date header, these headers have distinct semantics, their values MUST NOT be the same, except in rare cases when a memento is replayed within one second of its capture.

See: https://ws-dl.blogspot.com/2020/03/2020-03-26-memento-compliance-audit-of.html#1-3-main-page-memento

@ikreymer
Copy link
Member

This is inaccurate. Memento-Datetime never overrides the Date header. The Date header is coming from the http response itself, you can verify it in the WARC.

@ibnesayeed
Copy link
Author

No, this is not inaccurate, unless I am failing to understand something here. I ran the following command a moment ago (you can try it too). The Date header should report March 27 today, but it is reporting the March 23 (exactly as the Memento-Datetime header).

$ curl -IL --http1.1 https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Location: https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <https://pywbtest.ws-dl.cs.odu.edu/example/https://example.com/>; rel="timegate", <https://pywbtest.ws-dl.cs.odu.edu/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <https://pywbtest.ws-dl.cs.odu.edu/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
X-Archive-Orig-Content-Encoding: gzip
X-Archive-Orig-Content-Length: 648
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT

@ibnesayeed
Copy link
Author

And to make sure it is not something that our reverse proxy is mingling with, here is the output from a local instance in the default mode:

$ curl -IL http://localhost:8080/example/20200323134145mp_/https://example.com/
HTTP/1.1 200 OK
X-Archive-Orig-Content-Encoding: gzip
Accept-Ranges: bytes
X-Archive-Orig-Age: 510746
X-Archive-Orig-Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Mon, 23 Mar 2020 13:41:45 GMT
X-Archive-Orig-Etag: "3147526947"
X-Archive-Orig-Expires: Mon, 30 Mar 2020 13:41:45 GMT
X-Archive-Orig-Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
X-Archive-Orig-Server: ECS (dcb/7EA4)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache: HIT
X-Archive-Orig-Content-Length: 648
Memento-Datetime: Mon, 23 Mar 2020 13:41:45 GMT
Link: <https://example.com/>; rel="original", <http://localhost:8080/example/https://example.com/>; rel="timegate", <http://localhost:8080/example/timemap/link/https://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/example/20200323134145mp_/https://example.com/>; rel="memento"; datetime="Mon, 23 Mar 2020 13:41:45 GMT"; collection="example"
Content-Location: http://localhost:8080/example/20200323134145mp_/https://example.com/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'
Content-Length: 0

@ikreymer
Copy link
Member

pywb is not adding the Date header, it is coming directly from the WARC record, eg. from example.com itself. Perhaps its confusing that its not prefixed with X-Archive-Orig-, but pywb is trying to be conservative in rewriting headers, doing so only if it might affect how the browser interprets the response (eg. all the cache headers). The Date header is harmless and therefore is not prefixed.

@ibnesayeed
Copy link
Author

ibnesayeed commented Mar 27, 2020

The Date header is coming from the http response itself, you can verify it in the WARC.

Now I got what you mean here. Though, the Date header is about the time when current response was originated, not a replay of the recorded Date header. You should surface that as X-Archive-Orig-Date instead.

@ibnesayeed
Copy link
Author

pywb is not adding the Date header, it is coming directly from the WARC record, eg. from example.com itself. Perhaps its confusing that its not prefixed with X-Archive-Orig-, but pywb is trying to be conservative in rewriting headers, doing so only if it might affect how the browser interprets the response (eg. all the cache headers). The Date header is harmless and therefore is not prefixed.

In my opinion, it is not an accurate interpretation of the semantics of the Date header. The replay server is not telling the truth, which will affect the behavior when a memento is archived.

@phonedude
Copy link

X-Archive-Orig-Date == Memento-Datetime

Date is when the "the message was originated"

https://tools.ietf.org/html/rfc7231#section-7.1.1.2

The "Date" header field represents the date and time at which the
message was originated, having the same semantics as the Origination
Date Field (orig-date) defined in Section 3.6.1 of [RFC5322].

https://tools.ietf.org/html/rfc5322#section-3.6.1

The origination date specifies the date and time at which the creator
of the message indicated that the message was complete and ready to
enter the mail delivery system.

If you make Date described the archived message, then you have no way to communicate the Date of the response from the archive.

@ibnesayeed
Copy link
Author

X-Archive-Orig-Date == Memento-Datetime

Considering some clock synchronization issues and network latency, it will more accurately be:

X-Archive-Orig-Date ≈ Memento-Datetime

And:

Date > Memento-Datetime

And in rare cases:

Date ≈ Memento-Datetime

ikreymer added a commit that referenced this issue Apr 30, 2020
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7
ikreymer added a commit that referenced this issue May 1, 2020
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants