Skip to content
Gabriel Vîjială edited this page Nov 13, 2016 · 3 revisions

Hoover search batch API

The idea is to facilitate searching for a large number of terms without hitting the rate limiter and with decent performance.

The solution is to use the Elasticsearch _msearch endpoint with count operations, so the query will only return the hit count for each individual query, along with any aggregations that were requested.

The request

Send a GET request to /batch, without any parameters, and with the data formatted as JSON in the body. You can see an example of the request in src/search.js, in the search method.

Here's an example request:

{
    "aggs": {
        "approx_distinct_hash": {
            "cardinality": {
                "field": "sha1"
            }
        }
    },
    "query_strings": [
        "one",
        "two",
        "three",
        "(*!@%~98@*(R",
        "*"
    ],
    "collections": [
        "Code",
        "Test",
        "Enron"
    ]
}

The query_strings list should be filled by the user (each line goes into another query_string).

The number of queries submitted for a single request is 100. Requests with more than 100 queries will fail.

The aggs field is appended next to each of the queries sent and its result is included for each of the queries made.

The response

The response has a responses field that has the results in the same order as the queries given.

For each response object, the following data is important:

  • response.hits.total the total number of hits for that query
  • response.timed_out set if it failed
  • response._query_string the query object you passed in (like "one")

The response._query_string field is filled out so the UI doesn't have to store the queries until the response is actually returned. The UI should extract the query string (such as "one" above) and use it to:

  • show the result text
  • link to /search?q=one

The example above also includes an aggregation to approximate the number of documents that are distinct (by hash). This number varies from query to query. The approximate value is in response.aggregations.approx_distinct_hash.value.

If one of the queries fails, you won't get any of those fields set on the reponse. You will have to get the error message from response.error.root_cause[0].reason.

If response.error.root_cause is actually an empty list, you could get the error message from response.error.failed_shards[0].reason.reason. If response.error.failed_shards is actually an empty list, that means that the Elasticsearch setup is utterly broken and all hope is lost.

A sample of the data returned by the request is below.

{
   "status" : "ok",
   "responses" : [
      {
         "timed_out" : false,
         "took" : 61,
         "_shards" : {
            "failed" : 0,
            "total" : 10,
            "successful" : 10
         },
         "_query_string" : "one",
         "aggregations" : {
            "approx_distinct_hash" : {
               "value" : 4051
            }
         },
         "hits" : {
            "max_score" : 0,
            "total" : 4034,
            "hits" : []
         }
      },
      {
         "hits" : {
            "hits" : [],
            "max_score" : 0,
            "total" : 2350
         },
         "aggregations" : {
            "approx_distinct_hash" : {
               "value" : 2350
            }
         },
         "_query_string" : "two",
         "took" : 64,
         "timed_out" : false,
         "_shards" : {
            "total" : 10,
            "successful" : 10,
            "failed" : 0
         }
      },
      {
         "_query_string" : "three",
         "aggregations" : {
            "approx_distinct_hash" : {
               "value" : 1224
            }
         },
         "_shards" : {
            "failed" : 0,
            "total" : 10,
            "successful" : 10
         },
         "timed_out" : false,
         "took" : 62,
         "hits" : {
            "total" : 1224,
            "max_score" : 0,
            "hits" : []
         }
      },
      {
         "_query_string" : "(*!@%~98@*(R",
         "error" : {
            "failed_shards" : [
               {
                  "reason" : {
                     "reason" : "For input string: \"98@\"",
                     "type" : "number_format_exception"
                  },
                  "index" : "hoover-enron-pst",
                  "shard" : 0,
                  "node" : "7Qb3oVj7QBiBUTs3ZoL-og"
               }
            ],
            "reason" : "all shards failed",
            "root_cause" : [
               {
                  "type" : "number_format_exception",
                  "reason" : "For input string: \"98@\""
               }
            ],
            "grouped" : true,
            "type" : "search_phase_execution_exception",
            "phase" : "query"
         }
      },
      {
         "hits" : {
            "hits" : [],
            "total" : 22185,
            "max_score" : 0
         },
         "aggregations" : {
            "approx_distinct_hash" : {
               "value" : 21912
            }
         },
         "_query_string" : "*",
         "_shards" : {
            "successful" : 10,
            "total" : 10,
            "failed" : 0
         },
         "took" : 103,
         "timed_out" : false
      }
   ]
}
Clone this wiki locally