Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom/XML Serializer #239

Merged
merged 5 commits into from
Jun 11, 2019
Merged

Atom/XML Serializer #239

merged 5 commits into from
Jun 11, 2019

Conversation

JaimieMurdock
Copy link
Contributor

@JaimieMurdock JaimieMurdock commented Apr 3, 2019

Based on the recent work in arxiv-rss, I implemented an Atom+XML serializer for the arxiv-search API.

It adds to the work done by @Trumbore in arXiv/arxiv-feed#3:

  • support for <arxiv:affiliation> tags has been added.

Notes on implementation:

  • I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed. It's too much fitting the data to the implementation, especially since we may want to add content negotiation later, in which case we won't want the extra fields, but also won't want to repeat the negotiation in the controller functions.
  • search.process.transform._transform_author does not do anything with affiliations. Are we indexing affiliations? If so, once this function is changed, then we'll gain support in the atom feeds. I've marked the spot with a # TODO:
  • arxiv-rss does not implement any of the OpenSearch extensions. @Trumbore might find the plugin in search.api.atom_extensions to be useful.
  • The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.
  • I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed, and whether to just cast the single document as a DocumentSet and pass it into the existing serializer or whether it made sense to have a stripped down serialize_document feed, like in the JSONSerializer case.
  • One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.
  • Generating URLs for PDF links is not finalized.

Testing

Fairly straightforward workflow:

  • docker-compose build && docker-compose up

From the VPN:

pipenv install
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python create_index.py
FLASK_APP=app.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 pipenv run python bulk_index.py
  • Add a security token to the Authorization header for 127.0.0.1:5000 via request.ly. The easiest way to generate a token is to use the generate_token.py script in arxiv-auth.
$ cd arxiv-auth
$ git pull
$ pipenv install ./users
$ JWT_SECRET=foosecret pipenv run python generate_token.py

Since we're pretty much only dealing with read permissions, it really doesn't matter if you use the defaults or something else in the script.

  • Finally, to actually get the app going: FLASK_APP=classic_api.py FLASK_DEBUG=1 ELASTICSEARCH_HOST=127.0.0.1 JWT_SECRET=foosecret pipenv run flask run

Some sample queries:

Copy link
Contributor

@erickpeirson erickpeirson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

One change from the XML output of the current "classic" API is that the xmlns attribute is not repeated on all Atom elements. This results in cleaner xml, smaller file sizes, and should still be standards compliant.

I like this. This is what I expect when looking at namespaced XML. This is what I'm seeing in your responses, which looks good and valid.

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:arxiv="http://arxiv.org/schemas/atom" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns="http://www.w3.org/2005/Atom">
    <id>http://api.arxiv.org/</id>
    <title>arXiv Query: size: 50; terms: AND all=none; include_fields: ['version', 'title', 'href', 'canonical', 'paper_id', 'paper_id_v', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']</title>
    <updated>2019-04-03T09:52:43.426666+00:00</updated>
    <link href="https://api.arxiv.org/" type="application/atom+xml"/>
    <generator uri="http://lkiesow.github.io/python-feedgen" version="0.7.0">python-feedgen</generator>
    <opensearch:itemsPerPage>50</opensearch:itemsPerPage>
    <opensearch:totalResults>3</opensearch:totalResults>
    <opensearch:startIndex>0</opensearch:startIndex>
    <entry>
        <id>http://127.0.0.1:5000/1608.07086v2</id>
        <title>Why $Ξ(1690)$ and $Ξ(2120)$ are so narrow?</title>
        <updated>2019-04-03T09:52:43.427496+00:00</updated>
        <link href="http://127.0.0.1:5000/1608.07086v2" rel="alternate" type="text/html"/>
        <category term="nucl-th" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
        <category term="hep-ph" scheme="http://arxiv.org://arxiv.org/schemas/atom"/>
        <published>2018-02-06T09:54:00-05:00</published>
        <arxiv:comment>New calculations and results, published version, PRD</arxiv:comment>
        <arxiv:primary_category term="nucl-th"/>
        <arxiv:journal_ref>Phys. Rev. D 97, 034005 (2018)</arxiv:journal_ref>
        <arxiv:doi>10.1103/PhysRevD.97.034005</arxiv:doi>

Also

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

search/routes/api/atom_extensions.py Show resolved Hide resolved
search/routes/api/atom_extensions.py Outdated Show resolved Hide resolved
@@ -0,0 +1,279 @@
"""Classes derived from the Feedgen extension classes."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a line or two of context about Feedgen? E.g. at a high level, how does it work?

from datetime import datetime
from feedgen.feed import FeedGenerator
from pytz import utc
from .atom_extensions import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, would be nice to avoid star imports. Creates some indirection for readers unfamiliar with the code

search/routes/api/serialize.py Show resolved Hide resolved
fg = FeedGenerator()
fg.register_extension('opensearch', OpenSearchExtension)
fg.register_extension("arxiv", ArxivExtension, ArxivEntryExtension, rss=False)
fg.id("http://api.arxiv.org/") # TODO: review API ID generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where you would add the persistent query ID that you mentioned?

@JaimieMurdock
Copy link
Contributor Author

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

Are we indexing affiliations?

https://github.com/arXiv/arxiv-search/blob/master/mappings/DocumentMapping.json#L226

I'll go through and find some examples that should have affiliations and use those for testing. I'm going to defer action on affiliations to ARXIVNG-2043.

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

      <title xmlns="http://www.w3.org/2005/Atom">ArXiv Query:
    search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>
      <id xmlns="http://www.w3.org/2005/Atom">http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>

I didn't implement an AtomSerializer.serialize_document() yet. I'm debating whether it makes sense to have a single-entry feed

If it isn't supported in the classic API, then no need. NotImplementedError looks like the right way to go to me.

As the id_list parameter also gets handled by search.controllers.api.classic_search, I'll leave that NotImplementedError in place.

I'm getting Content-Type: text/html; charset=utf-8 header on the response; at some point will want that to be Content-Type: application/atom+xml; charset=UTF-8. Might be out of scope, but just noticed it when I spun up the examples.

Definitely in scope, will fix.

@erickpeirson
Copy link
Contributor

erickpeirson commented Apr 3, 2019

I'm not crazy about the hack in search.controllers.api to add fields to support the Atom feed.

Agreed. Per our discussion around #234 (review), my understanding was that you would implement a different domain class to describe classic API requests, that was fitted more closely to the query-string model, and that this would then get handled as a query_string query in search.services.index. But that part may be outside the scope of this PR?

That sounds in-scope for the overall project, and an architectural decision to discuss. I chose to use the same class to describe classic API requests, the primary reason being feature parity between old and new APIs - classic_search does the migration by just renaming search_query to query. This gives us, essentially for free, support for the new API's field-specific query params. It also means we only have one place to handle interfacing with search.services.index in controllers - after classic requests are "translated" to the new-style in the controller, both are using search.controllers.api.search to get result sets.

We discussed this on 12 March, and the outcomes of that discussion were:

  • The current APIQuery model does not work for the classic API, because it does not support the boolean operations required by the classic API.
  • We will define a new ClassicAPIQuery that does not use FieldedSearchTerms and friends, but instead focuses on the nested query-string structure that can be easily translated from the API query and translated to the ElasticSearch query_string query.

@erickpeirson
Copy link
Contributor

erickpeirson commented Apr 3, 2019

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

This is based on the example at https://arxiv.org/help/api - even running the example live doesn't result in a working URL though. Perhaps this should somehow be replaced with a url that reissues the query?

Gotcha. I think that your suggestion is a good one -- just return the full URI for the current query. We will need to make it clear in documentation that this does not preserve the result set at the time of the original query (although hopefully that would be obvious).

@erickpeirson
Copy link
Contributor

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

@JaimieMurdock
Copy link
Contributor Author

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

Can I take even more liberties with this and create something actually human-readable?

Old:

ArXiv Query: search_query=all:electron&id_list=&start=0&max_results=1

New:

arXiv Query: size: 50; terms: AND all=none; OR title=universes; include_fields: ['paper_id_v', 'paper_id', 'href', 'canonical', 'version', 'title', 'abstract', 'submitted_date', 'updated_date', 'comments', 'journal_ref', 'doi', 'primary_classification', 'secondary_classification', 'authors']

Proposed:

arXiv Search: all=none OR title=universes

@erickpeirson
Copy link
Contributor

Also the <title> tag in the new method shows the "exploded" parameters, rather than the raw query_string. Just want to make sure that where I go off-script gets another pair of eyes.

Per RFC4287§4.2.14:

The "atom:title" element is a Text construct that conveys a human-readable title for an entry or feed.

So as long as we are conveying the same information to a human reader, it's fine if it doesn't match the legacy API exactly.

Can I take even more liberties with this and create something actually human-readable?

Sounds reasonable to me!

Copy link
Collaborator

@Trumbore Trumbore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Feedgen stuff looks good! I'll try to remember to ping you if I work out a way to add the author affiliations and/or remove the generator element (or discover any problems with my code).

@erickpeirson
Copy link
Contributor

@Trumbore @JaimieMurdock In your extend_atom(self, atom_feed) method, atom_feed is an ElementTree root. So a minimal way to remove the generator element would be:

atom_feed.remove(atom_feed.find('./generator'))

You may want to make that a bit more robust by checking that find() actually returns an element. But that should get you started.

@mhl10
Copy link
Contributor

mhl10 commented Apr 4, 2019

The ID schema for the API results isn't yet solidified. In the previous arXiv API, each resultset was given a unique ID so it could be recalled.

I'm not familiar with that feature, but my initial reaction is that is sounds stateful and gross. @mhl10 would it rain fire if we said that we weren't going to support that feature anymore?

The id field is part of the Atom spec and simply represents a universally unique identifier for the feed. I don't think we need generate the id value in exactly the same way as in classic, but we should support it. It could be something as simple as b64 encoding or md5sum or other form of deterministic encoding of the query params.

@Trumbore
Copy link
Collaborator

Trumbore commented Apr 4, 2019 via email

@JaimieMurdock JaimieMurdock merged commit b1848d4 into task/ARXIVNG-1971 Jun 11, 2019
@bmaltzan bmaltzan deleted the task/ARXIVNG-1516 branch April 1, 2021 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants