-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atom/XML Serializer #239
Atom/XML Serializer #239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.
Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index
. But that part may be outside the scope of this PR?
Are we indexing affiliations?
https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226
The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?
I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed
If it isn't supported in the classic API, then no need. NotImplementedError
looks like the right way to go to me.
One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.
I like this. This is what I expect when looking at namespaced XML. This is what I'm seeing in your responses, which looks good and valid.
<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:arxiv="http://arxiv.org/schemas/atom" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns="http://www.w3.org/2005/Atom">
<id>http://api.arxiv.org/</id>
<title>arXiv Query: size: 50; terms: AND all=none; include_fields: ['version', 'title', 'href', 'canonical', 'paper_id', 'paper_id_v', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']</title>
<updated>2019-04-03T09:52:43.426666+00:00</updated>
<link href="https://api.arxiv.org/" type="application/atom+xml"/>
<generator uri="http://lkiesow.github.io/python-feedgen" version="0.7.0">python-feedgen</generator>
<opensearch:itemsPerPage>50</opensearch:itemsPerPage>
<opensearch:totalResults>3</opensearch:totalResults>
<opensearch:startIndex>0</opensearch:startIndex>
<entry>
<id>http://127.0.0.1:5000/1608.07086v2</id>
<title>Why $Ξ(1690)$ and $Ξ(2120)$ are so narrow?</title>
<updated>2019-04-03T09:52:43.427496+00:00</updated>
<link href="http://127.0.0.1:5000/1608.07086v2" rel="alternate" type="text/html"/>
<category term="nucl-th" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
<category term="hep-ph" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
<published>2018-02-06T09:54:00-05:00</published>
<arxiv:comment>New calculations and results, published version, PRD</arxiv:comment>
<arxiv:primary_category term="nucl-th"/>
<arxiv:journal_ref>Phys. Rev. D 97, 034005 (2018)</arxiv:journal_ref>
<arxiv:doi>10.1103/PhysRevD.97.034005</arxiv:doi>
Also
I'm getting Content-Type: text/html; charset=utf-8
header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8
. Might be out of scope, but just noticed it when I spun up the examples.
@@ -0,0 +1,279 @@ | |||
"""Classes derived from the Feedgen extension classes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a line or two of context about Feedgen? E.g. at a high level, how does it work?
from datetime import datetime | ||
from feedgen.feed import FeedGenerator | ||
from pytz import utc | ||
from .atom_extensions import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, would be nice to avoid star imports. Creates some indirection for readers unfamiliar with the code
fg = FeedGenerator() | ||
fg.register_extension('opensearch', OpenSearchExtension) | ||
fg.register_extension("arxiv", ArxivExtension, ArxivEntryExtension, rss=False) | ||
fg.id("http://api.arxiv.org/") # TODO: review API ID generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where you would add the persistent query ID that you mentioned?
That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs -
I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.
This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query? Also the
As the
Definitely in scope, will fix. |
Co-Authored-By: JaimieMurdock <[email protected]>
We discussed this on 12 March, and the outcomes of that discussion were:
|
Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious). |
Per RFC4287§4.2.14:
So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly. |
Can I take even more liberties with this and create something actually human-readable? Old:
New:
Proposed:
|
Sounds reasonable to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Feedgen stuff looks good! I'll try to remember to ping you if I work out a way to add the author affiliations and/or remove the generator element (or discover any problems with my code).
@Trumbore @JaimieMurdock In your atom_feed.remove(atom_feed.find('./generator')) You may want to make that a bit more robust by checking that |
The |
@erickpeirson Thanks for that suggestion, I’ll definitely try it out. That same idea may allow me to add the attributions for the authors. Feedgen doesn’t provide that ability, but I see how I could extend the ElementTree structure for an author.
From: Erick <[email protected]>
Sent: Thursday, April 4, 2019 11:43 AM
To: arXiv/arxiv-search <[email protected]>
Cc: Ben Trumbore <[email protected]>; Mention <[email protected]>
Subject: Re: [arXiv/arxiv-search] Atom/XML Serializer (#239)
@Trumbore<https://github.com/Trumbore> @JaimieMurdock<https://github.com/JaimieMurdock> In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:
atom_feed.remove(atom_feed.find('./generator'))
You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#239 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i>.
|
Based on the recent work in
arxiv-rss
, I implemented an Atom+XML serializer for the arxiv-search API.It adds to the work done by @Trumbore in arXiv/arxiv-feed#3:
<arxiv:affiliation>
tags has been added.Notes on implementation:
search.controllers.api
to add fields to support the Atom feed. It's too much fitting the data to the implementation, especially since we may want to add content negotiation later, in which case we won't want the extra fields, but also won't want to repeat the negotiation in the controller functions.search.process.transform._transform_author
does not do anything with affiliations. Are we indexing affiliations? If so, once this function is changed, then we'll gain support in the atom feeds. I've marked the spot with a# TODO:
arxiv-rss
does not implement any of the OpenSearch extensions. @Trumbore might find the plugin insearch.api.atom_extensions
to be useful.AtomSerializer.serialize_document()
yet. I'm debating whether it makes sense to have a single-entry feed, and whether to just cast the single document as aDocumentSet
and pass it into the existing serializer or whether it made sense to have a stripped downserialize_document
feed, like in theJSONSerializer
case.xmlns
attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.Testing
Fairly straightforward workflow:
docker-compose build && docker-compose up
From the VPN:
Authorization
header for127.0.0.1:5000
via request.ly. The easiest way to generate a token is to use the generate_token.py script in arxiv-auth.$ cd arxiv-auth $ git pull $ pipenv install ./users $ JWT_SECRET=foosecret pipenv run python generate_token.py
Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.
FLASK_APP=classic_api.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 JWT_SECRET=foosecret pipenv run flask run
Some sample queries: