Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/#258 search query and id list #269

Merged
merged 7 commits into from
Feb 20, 2020

Conversation

alefnula
Copy link
Collaborator

Implemented a new parser for classic api query using lark-parser (it's faster and more flexible and has nicer syntax than ply).

Rewrote all the tests and added tests for ElasticSearch query generation.

I still have to test this a little bit, but as far as I did until now everything works as it should be.

@alefnula alefnula requested a review from mhl10 February 14, 2020 18:58

# Disjunct query with an unary not.
phrase = (('au', 'del_maestro'), 'OR', ('ANDNOT', ('ti', 'checkerboard')))
phrase = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is much easier to read than before 👍


from search.domain.base import Phrase, Operator, Field, Term
class QueryTransformer(Transformer):
def string(self, tokens):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A class docstring would be helpful here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


# Cast field to Field enum.
QUERY_PARSER = Lark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nice!

# Filter id_list if necessary.
if query.id_list:
# Separate versioned and unversioned papers.
paper_ids = [id for id in query.id_list if "v" not in id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unfortunately won't work for a subset of the old-style arXiv identifiers (e.g. solv-int/9704004). Something like this should be ok:

import re
r1 = re.compile(".*v[1-9]\d*$")
paper_ids = []
paper_id_vs = []
for id in query.id_list:
    paper_id_vs.append(id) if r1.search(id) else paper_ids.append(id)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@mhl10 mhl10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos, and thanks! Although I didn't dive into the lark-parser documentation in great depth just yet, I think this is much clearer than the original code. Good call on using that over ply. I think this will be easier to maintain.

All the queries I tested worked as expected.

I made some very minor comments and pushed a tiny commit to remove the generator element.

@alefnula
Copy link
Collaborator Author

I've spent some time comparing the EBNF parsers since I got a little bit stuck with PLY (simple things were simple but then I got stuck with some of more complex stuff). And then I found LARK which is more actively developed, has more github stars, is pretty much faster and uses a lot less memory. I really don't know why I haven't heard of it earlier, but doing it in LARK was a breeze. So I think it's a win win situation.

There is only one thing that I want to fix. But I need a suggestion. What should an empty string parse to? all:""?

@mhl10
Copy link
Contributor

mhl10 commented Feb 17, 2020

There is only one thing that I want to fix. But I need a suggestion. What should an empty string parse to? all:""?

If search_query is empty or any of the field terms are empty I would just short-circuit the request to the search backend and return something like the legacy version does, with no entries and 200 status. It's a bit odd, but would be compatible I suppose.

@alefnula
Copy link
Collaborator Author

OK will do that.

@alefnula
Copy link
Collaborator Author

I just have to add the edge cases where no query string is provided and then this pull request is ready.

@alefnula alefnula force-pushed the feature/#258-search-query-and-id-list branch from 511c906 to fc700ef Compare February 20, 2020 10:30
@alefnula alefnula merged commit 342ff00 into develop Feb 20, 2020
@bmaltzan bmaltzan deleted the feature/#258-search-query-and-id-list branch April 1, 2021 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants