The source_regex
filter is designed to quickly perform arbitrary regular
expression searches against the source documents. It uses an nGram
index
to select only documents that might match and then runs the regular expression
against those documents. In simple local testing that is an order of magnitude
faster (50ms -> 5ms) but it ought to be much much better on real, large
documents (minutes -> tens or hundreds of milliseconds).
Its capable of accelerating even somewhat obtuse regular expressions like /a(b+|c+)d/ and /(abc*)+de/ using the algorithm described here. This filter can also limit the execution cost of even regular expressions that can't be accelerated by that method by limiting the number of documents against which a match attempt is made.
Analyze a field with trigrams like so:
curl -XDELETE http://localhost:9200/regex_test
curl -XPOST http://localhost:9200/regex_test -d '{
"index":{
"number_of_shards":1,
"analysis":{
"analyzer":{
"trigram":{
"type":"custom",
"tokenizer":"trigram",
"filter":["lowercase"]
}
},
"tokenizer":{
"trigram":{
"type":"nGram",
"min_gram":"3",
"max_gram":"3"
}
}
}
}
}'
curl -XPOST http://localhost:9200/regex_test/test/_mapping -d '{
"test":{
"properties":{
"test":{
"type":"string",
"fields":{
"trigrams":{
"type":"string",
"analyzer":"trigram",
"index_options": "docs"
}
}
}
}
}
}'
curl -XPOST http://localhost:9200/regex_test/test -d'{"test": "I can has test"}'
curl -XPOST http://localhost:9200/regex_test/test -d'{"test": "Yay"}'
curl -XPOST http://localhost:9200/regex_test/test -d'{"test": "WoW match STuFF"}'
curl -XPOST http://localhost:9200/regex_test/_refresh
Then send queries like so:
curl -XPOST http://localhost:9200/regex_test/test/_search?pretty=true -d '{
"query": {
"filtered": {
"filter": {
"source_regex": {
"field": "test",
"regex": "i ca..has",
"ngram_field": "test.trigrams"
}
}
}
}
}'
regex
The regular expression to process. Required.field
The field who's source to check against the regex. Required.load_from_source
Loadfield's
value from source. Defaults tofalse
. Set it totrue
iffield
isn't in source but is stored.ngram_field
The field withfield
analyzed with the nGram analyzer. If not sent then the regular expression won't be accelerated with ngrams.gram_size
The number of characters in the ngram. Defaults to3
because trigrams are cool.max_expand
Maximum range before outgoing automaton arcs are ignored. Roughly corresponds to the maximum number of characters in a character class ([abcd]
) before it is treated as.
for purposes of acceleration. Defaults to4
.max_states_traced
Maximum number of automaton states that can be traced before the algorithm gives up and assumes the regex is too complex and throws an error back to the user. Defaults to10000
which handily covers all regexes I cared to test.case_sensitive
Is the regular expression case sensitive? Defaults tofalse
. Note that acceleration is always case insensitive which is why the trigrams index in the example had the lowercase filter. That is important! Without that you can't switch freely from case sensitive to insensitive.locale
Locale used for case conversions. Must match the locale used in the lowercase filter of the index. Defaults toLocale.ROOT
.max_determinized_states
Limits the complexity explosion that comes from compiling Lucene Regular Expressions into DFAs. It defaults to 20,000 states. Increasing it allows more complex regexes to take the memory and time that they need to compile. The default allows for reasonably complex regexes.max_ngrams_extracted
The number of ngrams extracted from the regex to accelerate it. If the regex contains more than that many ngrams they are ignored. Defaults to 100 which makes a lot of term filters but its not too many. Without this even simple little regexes like /[abc]{20,80}/ would make thousands of term filters.max_ngram_clauses
Maximum number of boolean clauses generated by the accelerated ngram query.max_ngrams_extracted
is useful to limit the number of ngram generated but depending on the regex complexity these ngrams may be combined into very large boolean clauses. This parameter prevents the accelerated ngram query from generating a too large boolean expression. Defaults to 1024 (same as BooleanQuery default). If the number of generated gram clauses is higher than the limit then a degraded boolean query may still be attempted.
Also supports the standard OpenSearch filter options:
_name