-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial boolean support ARXIVNG-1971 #234
Conversation
@@ -0,0 +1,30 @@ | |||
# Notes for Boolean search implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for capturing this information here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion it seems like cramming this into the Fielded paradigm that we used for the UI is a square-peg-round-hole situation. We arrived at something like
Term = Tuple[str,str]
Expression = Union[Term, Tuple[Operator, Term]]
Phrase = Tuple[Union[Expression, Phrase], Operator, Union[Expression, Phrase]]
for a stripped-down domain representation, possible attached to a class like
class ClassicAPIQuery:
phrase: Phrase
order: Optional[str]
size: int
page_start: int
that supports pagination in whatever way the classic API requires
Query parsing with whitespace and multifield support. ARXIVNG-1897
Atom+XML OpenAPI Schema ARXIVNG-1518
Adding `query` parameter to match classic functionality. starting routes for classic API. ARXIVNG-1893 ARXIVNG-1898
…arch.process.transform
Co-Authored-By: JaimieMurdock <[email protected]>
Atom/XML Serializer
This PR depends on the earlier PR adding query support: #233
Launch the service using the same instructions as for the previous PR.
This adds support for boolean operators. For example:
Known Limitations
The fields must be different - you cannot filter twice on the same field as the same boolean operator will apply to all terms with the same field. For example, "all:none AND all:universe OR all:physics" will result in the following
FieldedSearchTerm
objects:FieldedSearchTerm('OR', 'all', 'none')
FieldedSearchTerm('OR', 'all', 'universe')
FieldedSearchTerm('OR', 'all', 'physics')
Even without parentheticals, that's probably not the intent of the query. At the very least we'd expect:
FieldedSearchTerm('AND', 'all', 'none')
FieldedSearchTerm('AND', 'all', 'universe')
FieldedSearchTerm('OR', 'all', 'physics')
Proper design is complicated by the fact that booleans are applied as operands on each term, rather than as a binary operator between two terms. This design, however, might better reflect the elastic search practices, which would generate a query using
+
and-
operands for the terms, which essentially defaults to anOR
, whereas we default to anAND
. This has the semantics of our query as+none +universe physics
, whereas the elasticsearch defaults would benone +universe physics
. https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_boolean_operatorsWe need to settle on a consistent convention across the API. I personally vote for preserving the
AND
default, because I think that's what people expect when searching our database. They want all the terms included unless they specify anOR
. This is the default in the current search UI as well.