Atom/XML Serializer #239

JaimieMurdock · 2019-04-03T03:36:00Z

Based on the recent work in arxiv-rss, I implemented an Atom+XML serializer for the arxiv-search API.

It adds to the work done by @Trumbore in arXiv/arxiv-feed#3:

support for <arxiv:affiliation> tags has been added.

Notes on implementation:

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed. It's too much fitting the data to the implementation, especially since we may want to add content negotiation later, in which case we won't want the extra fields, but also won't want to repeat the negotiation in the controller functions.
search.process.transform._transform_author does not do anything with affiliations. Are we indexing affiliations? If so, once this function is changed, then we'll gain support in the atom feeds. I've marked the spot with a # TODO:
arxiv-rss does not implement any of the OpenSearch extensions. @Trumbore might find the plugin in search.api.atom_extensions to be useful.
The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed, and whether to just cast the single document as a DocumentSet and pass it into the existing serializer or whether it made sense to have a stripped down serialize_document feed, like in the JSONSerializer case.
One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.
Generating URLs for PDF links is not finalized.

Testing

Fairly straightforward workflow:

docker-compose build && docker-compose up

From the VPN:

pipenv install
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python create_index.py
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python bulk_index.py

Add a security token to the Authorization header for 127.0.0.1:5000 via request.ly. The easiest way to generate a token is to use the generate_token.py script in arxiv-auth.

$ cd arxiv-auth
$ git pull
$ pipenv install ./users
$ JWT_SECRET=foosecret pipenv run python generate_token.py

Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.

Finally, to actually get the app going: FLASK_APP=classic_api.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 JWT_SECRET=foosecret pipenv run flask run

Some sample queries:

http://127.0.0.1:5000/query?search_query=all:none (expect 3 results)
http://127.0.0.1:5000/query?search_query=all:none+OR+ti:universes (expect 9 results)
http://127.0.0.1:5000/query?search_query=all:none+AND+ti:universes (expect 0 results)
http://127.0.0.1:5000/query?search_query=all:none&id_list=1710.08254 (expect 1 result)
http://127.0.0.1:5000/query?id_list=1710.08254v1,1610.07526v3 (expect 2 results)

…arch.process.transform

erickpeirson

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.

I like this. This is what I expect when looking at namespaced XML. This is what I'm seeing in your responses, which looks good and valid.

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:arxiv="http://arxiv.org/schemas/atom" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns="http://www.w3.org/2005/Atom">
    <id>http://api.arxiv.org/</id>
    <title>arXiv Query: size: 50; terms: AND all=none; include_fields: ['version', 'title', 'href', 'canonical', 'paper_id', 'paper_id_v', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']</title>
    <updated>2019-04-03T09:52:43.426666+00:00</updated>
    <link href="https://api.arxiv.org/" type="application/atom+xml"/>
    <generator uri="http://lkiesow.github.io/python-feedgen" version="0.7.0">python-feedgen</generator>
    <opensearch:itemsPerPage>50</opensearch:itemsPerPage>
    <opensearch:totalResults>3</opensearch:totalResults>
    <opensearch:startIndex>0</opensearch:startIndex>
    <entry>
        <id>http://127.0.0.1:5000/1608.07086v2</id>
        <title>Why $Ξ(1690)$ and $Ξ(2120)$ are so narrow?</title>
        <updated>2019-04-03T09:52:43.427496+00:00</updated>
        <link href="http://127.0.0.1:5000/1608.07086v2" rel="alternate" type="text/html"/>
        <category term="nucl-th" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
        <category term="hep-ph" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
        <published>2018-02-06T09:54:00-05:00</published>
        <arxiv:comment>New calculations and results, published version, PRD</arxiv:comment>
        <arxiv:primary_category term="nucl-th"/>
        <arxiv:journal_ref>Phys. Rev. D 97, 034005 (2018)</arxiv:journal_ref>
        <arxiv:doi>10.1103/PhysRevD.97.034005</arxiv:doi>

Also

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

search/routes/api/atom_extensions.py

erickpeirson · 2019-04-03T09:24:06Z

search/routes/api/atom_extensions.py

@@ -0,0 +1,279 @@
+"""Classes derived from the Feedgen extension classes."""


Can you add a line or two of context about Feedgen? E.g. at a high level, how does it work?

erickpeirson · 2019-04-03T09:27:06Z

search/routes/api/serialize.py

+from datetime import datetime
+from feedgen.feed import FeedGenerator
+from pytz import utc
+from .atom_extensions import *


If possible, would be nice to avoid star imports. Creates some indirection for readers unfamiliar with the code

search/routes/api/serialize.py

erickpeirson · 2019-04-03T09:31:00Z

search/routes/api/serialize.py

+        fg = FeedGenerator()
+        fg.register_extension('opensearch', OpenSearchExtension)
+        fg.register_extension("arxiv", ArxivExtension, ArxivEntryExtension, rss=False)
+        fg.id("http://api.arxiv.org/") # TODO: review API ID generation


Is this where you would add the persistent query ID that you mentioned?

JaimieMurdock · 2019-04-03T13:22:43Z

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

      <title xmlns="http://www.w3.org/2005/Atom">ArXiv Query:
    search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>
      <id xmlns="http://www.w3.org/2005/Atom">http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

As the id_list parameter also gets handled by search.controllers.api.classic_search, I'll leave that NotImplementedError in place.

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

Definitely in scope, will fix.

Co-Authored-By: JaimieMurdock <[email protected]>

erickpeirson · 2019-04-03T13:31:39Z

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

We discussed this on 12 March, and the outcomes of that discussion were:

The current APIQuery model does not work for the classic API, because it does not support the boolean operations required by the classic API.
We will define a new ClassicAPIQuery that does not use FieldedSearchTerms and friends, but instead focuses on the nested query-string structure that can be easily translated from the API query and translated to the ElasticSearch query_string query.

erickpeirson · 2019-04-03T13:35:05Z

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious).

erickpeirson · 2019-04-03T13:40:27Z

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

JaimieMurdock · 2019-04-03T14:06:20Z

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

Can I take even more liberties with this and create something actually human-readable?

Old:

ArXiv Query: search_query=all:electron&id_list=&start=0&max_results=1

New:

arXiv Query: size: 50; terms: AND all=none; OR title=universes; include_fields: ['paper_id_v', 'paper_id', 'href', 'canonical', 'version', 'title', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']

Proposed:

arXiv Search: all=none OR title=universes

erickpeirson · 2019-04-03T14:26:54Z

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

Can I take even more liberties with this and create something actually human-readable?

Sounds reasonable to me!

Trumbore

The Feedgen stuff looks good! I'll try to remember to ping you if I work out a way to add the author affiliations and/or remove the generator element (or discover any problems with my code).

erickpeirson · 2019-04-04T15:42:36Z

@Trumbore @JaimieMurdock In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:

atom_feed.remove(atom_feed.find('./generator'))

You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.

mhl10 · 2019-04-04T16:11:34Z

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

The id field is part of the Atom spec and simply represents a universally unique identifier for the feed. I don't think we need generate the id value in exactly the same way as in classic, but we should support it. It could be something as simple as b64 encoding or md5sum or other form of deterministic encoding of the query params.

Trumbore · 2019-04-04T20:26:18Z

@erickpeirson Thanks for that suggestion, I’ll definitely try it out. That same idea may allow me to add the attributions for the authors. Feedgen doesn’t provide that ability, but I see how I could extend the ElementTree structure for an author. From: Erick <[email protected]> Sent: Thursday, April 4, 2019 11:43 AM To: arXiv/arxiv-search <[email protected]> Cc: Ben Trumbore <[email protected]>; Mention <[email protected]> Subject: Re: [arXiv/arxiv-search] Atom/XML Serializer (#239) @Trumbore<https://github.com/Trumbore> @JaimieMurdock<https://github.com/JaimieMurdock> In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be: atom_feed.remove(atom_feed.find('./generator')) You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#239 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGUxHp-97nO0XuZXVTUQ9dWrXrbf8NFmks5vdh1tgaJpZM4cZi-i>.

JaimieMurdock added 4 commits April 2, 2019 19:56

Atom serializer

6ee11de

Adding stubs for where arxiv:affiliation needs to be handled

afef5bb

Adding author affiliation handling to entries. Still no support in se…

a30916b

…arch.process.transform

Adding note about not supporting serialize_document

79efeab

JaimieMurdock self-assigned this Apr 3, 2019

JaimieMurdock requested review from mhl10, bbarker, erickpeirson and Trumbore April 3, 2019 03:36

erickpeirson reviewed Apr 3, 2019

View reviewed changes

Apply suggestions from code review

24ab5f9

Co-Authored-By: JaimieMurdock <[email protected]>

Trumbore approved these changes Apr 3, 2019

View reviewed changes

JaimieMurdock merged commit b1848d4 into task/ARXIVNG-1971 Jun 11, 2019

bmaltzan deleted the task/ARXIVNG-1516 branch April 1, 2021 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atom/XML Serializer #239

Atom/XML Serializer #239

JaimieMurdock commented Apr 3, 2019 •

edited

Loading

erickpeirson left a comment

erickpeirson Apr 3, 2019

erickpeirson Apr 3, 2019

erickpeirson Apr 3, 2019

JaimieMurdock commented Apr 3, 2019

erickpeirson commented Apr 3, 2019 •

edited

Loading

erickpeirson commented Apr 3, 2019 •

edited

Loading

erickpeirson commented Apr 3, 2019

JaimieMurdock commented Apr 3, 2019

erickpeirson commented Apr 3, 2019

Trumbore left a comment

erickpeirson commented Apr 4, 2019

mhl10 commented Apr 4, 2019

Trumbore commented Apr 4, 2019 via email

		@@ -0,0 +1,279 @@
		"""Classes derived from the Feedgen extension classes."""

Atom/XML Serializer #239

Atom/XML Serializer #239

Conversation

JaimieMurdock commented Apr 3, 2019 • edited Loading

Testing

Some sample queries:

erickpeirson left a comment

Choose a reason for hiding this comment

Also

erickpeirson Apr 3, 2019

Choose a reason for hiding this comment

erickpeirson Apr 3, 2019

Choose a reason for hiding this comment

erickpeirson Apr 3, 2019

Choose a reason for hiding this comment

JaimieMurdock commented Apr 3, 2019

erickpeirson commented Apr 3, 2019 • edited Loading

erickpeirson commented Apr 3, 2019 • edited Loading

erickpeirson commented Apr 3, 2019

JaimieMurdock commented Apr 3, 2019

erickpeirson commented Apr 3, 2019

Trumbore left a comment

Choose a reason for hiding this comment

erickpeirson commented Apr 4, 2019

mhl10 commented Apr 4, 2019

Trumbore commented Apr 4, 2019 via email

JaimieMurdock commented Apr 3, 2019 •

edited

Loading

erickpeirson commented Apr 3, 2019 •

edited

Loading

erickpeirson commented Apr 3, 2019 •

edited

Loading