-
Notifications
You must be signed in to change notification settings - Fork 9
Batch Search API
The idea is to facilitate searching for a large number of terms without hitting the rate limiter and with decent performance.
The solution is to use the Elasticsearch _msearch
endpoint with count
operations, so the query will only return the hit count for each individual
query, along with any aggregations that were requested.
Send a GET
request to /batch
, without any parameters, and with the data
formatted as JSON in the body. You can see an example of the request in
src/search.js
, in the search
method.
Here's an example request:
{
"aggs": {
"approx_distinct_hash": {
"cardinality": {
"field": "sha1"
}
}
},
"query_strings": [
"one",
"two",
"three",
"(*!@%~98@*(R",
"*"
],
"collections": [
"Code",
"Test",
"Enron"
]
}
The query_strings
list should be filled by the user
(each line goes into another query_string).
The number of queries submitted for a single request is 100
.
Requests with more than 100
queries will fail.
The aggs
field is appended next to each of the queries sent and
its result is included for each of the queries made.
The response has a responses
field that has the results in the
same order as the queries given.
For each response object, the following data is important:
-
response.hits.total
the total number of hits for that query -
response.timed_out
set if it failed -
response._query_string
the query object you passed in (like"one"
)
The response._query_string
field is filled out so the UI doesn't have to store
the queries until the response is actually returned. The UI should extract
the query string (such as "one"
above) and use it to:
- show the result text
- link to
/search?q=one
The example above also includes an aggregation to approximate the number of documents that are distinct (by hash).
This number varies from query to query. The approximate value is in response.aggregations.approx_distinct_hash.value
.
If one of the queries fails, you won't get any of those fields set on the reponse
.
You will have to get the error message from
response.error.root_cause[0].reason
.
If response.error.root_cause
is actually an empty list, you
could get the error message from response.error.failed_shards[0].reason.reason
. If response.error.failed_shards
is actually an empty list, that means that the Elasticsearch setup is utterly broken and all hope is lost.
A sample of the data returned by the request is below.
{
"status" : "ok",
"responses" : [
{
"timed_out" : false,
"took" : 61,
"_shards" : {
"failed" : 0,
"total" : 10,
"successful" : 10
},
"_query_string" : "one",
"aggregations" : {
"approx_distinct_hash" : {
"value" : 4051
}
},
"hits" : {
"max_score" : 0,
"total" : 4034,
"hits" : []
}
},
{
"hits" : {
"hits" : [],
"max_score" : 0,
"total" : 2350
},
"aggregations" : {
"approx_distinct_hash" : {
"value" : 2350
}
},
"_query_string" : "two",
"took" : 64,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"failed" : 0
}
},
{
"_query_string" : "three",
"aggregations" : {
"approx_distinct_hash" : {
"value" : 1224
}
},
"_shards" : {
"failed" : 0,
"total" : 10,
"successful" : 10
},
"timed_out" : false,
"took" : 62,
"hits" : {
"total" : 1224,
"max_score" : 0,
"hits" : []
}
},
{
"_query_string" : "(*!@%~98@*(R",
"error" : {
"failed_shards" : [
{
"reason" : {
"reason" : "For input string: \"98@\"",
"type" : "number_format_exception"
},
"index" : "hoover-enron-pst",
"shard" : 0,
"node" : "7Qb3oVj7QBiBUTs3ZoL-og"
}
],
"reason" : "all shards failed",
"root_cause" : [
{
"type" : "number_format_exception",
"reason" : "For input string: \"98@\""
}
],
"grouped" : true,
"type" : "search_phase_execution_exception",
"phase" : "query"
}
},
{
"hits" : {
"hits" : [],
"total" : 22185,
"max_score" : 0
},
"aggregations" : {
"approx_distinct_hash" : {
"value" : 21912
}
},
"_query_string" : "*",
"_shards" : {
"successful" : 10,
"total" : 10,
"failed" : 0
},
"took" : 103,
"timed_out" : false
}
]
}