Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional token filter to elasticsearch #31958

Merged
merged 111 commits into from
Sep 5, 2018
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
e5b20de
WIP
romseygeek Jun 29, 2018
da0fd1e
WIP
romseygeek Jul 2, 2018
df9bffc
WIP
romseygeek Jul 4, 2018
402ed36
WIP
romseygeek Jul 11, 2018
fb7c21d
WIP
romseygeek Jul 11, 2018
d8f0170
docs
romseygeek Jul 11, 2018
bcee3f0
tests
romseygeek Jul 11, 2018
bba5939
d'oh
romseygeek Jul 11, 2018
21cd02f
class name change in SPI
romseygeek Jul 11, 2018
4315682
docs
romseygeek Jul 11, 2018
52955df
Broekn
romseygeek Jul 13, 2018
dd139c7
nuke unit test
romseygeek Jul 17, 2018
1d5deff
Merge branch 'master' into scripted-analysis
romseygeek Jul 17, 2018
57a73f2
feedback
romseygeek Jul 17, 2018
609951b
Term -> Token; move ScriptContext into module
romseygeek Jul 17, 2018
801a704
Re-instate link in StringFunctionUtils javadocs
Jul 13, 2018
f923d9c
Docs: Change formatting of Cloud options
clintongormley Jul 13, 2018
a21fb82
Docs: Restyled cloud link in getting started
clintongormley Jul 13, 2018
62fea58
[Rollup] Use composite's missing_bucket (#31402)
polyfractal Jul 13, 2018
35a6774
Test: Fix a second case of bad watch creation
hub-cap Jul 13, 2018
9203e50
Remove deprecated AnalysisPlugin#requriesAnalysisSettings method (#32…
romseygeek Jul 13, 2018
849e690
Add second level of field collapsing (#31808)
mayya-sharipova Jul 13, 2018
de213a8
Mute ML AutodetectMemoryLimitIT#testTooManyPartitions on Windows (#32…
Jul 13, 2018
3bbc8c6
Watcher: cleanup ensureWatchExists use (#31926)
hub-cap Jul 13, 2018
4440df5
Add secure setting for watcher email password (#31620)
hub-cap Jul 13, 2018
2ab7db3
HLRC: Add xpack usage api (#31975)
rjernst Jul 13, 2018
2183fff
Adds a new auto-interval date histogram (#28993)
colings86 Jul 13, 2018
0e7a6b4
lazy snapshot repository initialization (#31606)
vladimirdolzhenko Jul 13, 2018
315999d
Watcher: Make settings reloadable (#31746)
hub-cap Jul 13, 2018
b959534
fix typo
Jul 13, 2018
94d3311
Clean Up Snapshot Create Rest API (#31779)
jdconrad Jul 13, 2018
2945c3a
[Rollup] Histo group config should support scaled_floats (#32048)
polyfractal Jul 13, 2018
e48de6a
Mute failing tests
Jul 13, 2018
ec470f2
Replace Ingest ScriptContext with Custom Interface (#32003)
original-brownbear Jul 13, 2018
f446f91
Add nio http transport to security plugin (#32018)
Tim-Brooks Jul 13, 2018
bd20e99
Fix compile issues introduced by merge (#32058)
Tim-Brooks Jul 14, 2018
71cd43b
SCRIPTING: Remove unused MultiSearchTemplateRequestBuilder (#32049)
original-brownbear Jul 14, 2018
e045ad6
Cleanup Duplication in `PainlessScriptEngine` (#31991)
original-brownbear Jul 14, 2018
dabbba1
Fix broken OpenLDAP Vagrant QA test
tvernum Jul 16, 2018
82e8fce
Turn off real-mem breaker in single node tests
danielmitterdorfer Jul 16, 2018
9d48815
Turn off real-mem breaker in REST tests
danielmitterdorfer Jul 16, 2018
040bc9d
[Test] Mute MlJobIT#testDeleteJobAfterMissingAliases
jimczi Jul 16, 2018
142d24a
Remove unused params from SSource and Walker (#31935)
Jul 16, 2018
ce8b3e3
[Tests] Fix failure due to changes exception message (#32036)
Jul 16, 2018
a1ad7a1
Fix BWC check after backport
jimczi Jul 16, 2018
bbe1b7c
Unmute field collapsing rest tests
jimczi Jul 16, 2018
391641c
Ensure only parent breaker trips in unit test
danielmitterdorfer Jul 16, 2018
6ec52fe
[Rollup] Fix duplicate field names in test (#32075)
jimczi Jul 16, 2018
ced669b
[TEST] Consistent algorithm usage (#32077)
jkakavas Jul 16, 2018
e38e69c
[Rollup] Replace RollupIT with a ESRestTestCase version (#31977)
polyfractal Jul 16, 2018
4a9fbe7
Scripting: Remove dead code from painless module (#32064)
original-brownbear Jul 16, 2018
5f130a2
Painless: Separate PainlessLookup into PainlessLookup and PainlessLoo…
jdconrad Jul 16, 2018
1dd0279
Use correct formatting for links (#29460)
dedemorton Jul 16, 2018
53f029b
Watcher: Store username on watch execution (#31873)
hub-cap Jul 16, 2018
e325526
Tweaked Elasticsearch Service links for SEO
debadair Jul 16, 2018
e514ad0
Tweaked Elasticsearch Service links for SEO
debadair Jul 16, 2018
50d8fa8
[test] turn on host io cache for opensuse (#32053)
andyb-elastic Jul 16, 2018
c0ffec7
DOCS: put LIMIT 10 to the SQL query (#32065)
ahmedakef Jul 16, 2018
ee4ef86
SQL: allow LEFT and RIGHT as function names (#32066)
costin Jul 16, 2018
9371e77
Revert "[test] disable packaging tests for suse boxes"
andyb-elastic Jul 16, 2018
780697f
[Rollup] Add new capabilities endpoint for concrete rollup indices (#…
polyfractal Jul 16, 2018
1106355
Switch non-x-pack to new style requests (#32106)
nik9000 Jul 16, 2018
7ce9926
Bypass highlight query terms extraction on empty fields (#32090)
jimczi Jul 16, 2018
8ff5735
Painless: Move and Rename Several Methods in the lookup package (#32105)
jdconrad Jul 16, 2018
97fbe49
Add Index UUID to `/_stats` Response (#31871)
original-brownbear Jul 17, 2018
ca2844f
[Test] Modify assert statement for ssl handshake (#32072)
bizybot Jul 17, 2018
8bad2c6
Add exclusion option to `keep_types` token filter (#32012)
Jul 17, 2018
89bce93
Fix put mappings java API documentation (#31955)
Jul 17, 2018
c2ee07b
Enable testing in FIPS140 JVM (#31666)
jkakavas Jul 17, 2018
24547e8
Check that client methods match API defined in the REST spec (#31825)
javanna Jul 17, 2018
5a383c2
Mute :qa:mixed-cluster indices.stats/10_index/Index - all’
davidkyle Jul 17, 2018
5bad3a8
Updates the build to gradle 4.9 (#32087)
alpar-t Jul 17, 2018
94330d8
Relax TermVectors API to work with textual fields other than TextFiel…
markharwood Jul 17, 2018
3d0854d
Handle TokenizerFactory TODOs (#32063)
original-brownbear Jul 17, 2018
70d2db3
Ensure to release translog snapshot in primary-replica resync (#32045)
dnhatn Jul 17, 2018
5afea06
[ML] Move analyzer dependencies out of categorization config (#32123)
droberts195 Jul 17, 2018
bb9fae0
[ML] Wait for aliases in multi-node tests (#32086)
davidkyle Jul 17, 2018
b31dc36
Docs: Fix missing example script quote (#32010)
aptxx Jul 17, 2018
a7c8e07
Re-disable packaging tests on suse boxes
andyb-elastic Jul 17, 2018
7490ec6
Remove empty @param from Javadoc
jkakavas Jul 17, 2018
a481ef6
Painless: Fix Bug with Duplicate PainlessClasses (#32110)
jdconrad Jul 17, 2018
346edfa
Build: Move shadow customizations into common code (#32014)
nik9000 Jul 17, 2018
cdf5c8a
Disable C2 from using AVX-512 on JDK 10 (#32138)
jasontedor Jul 17, 2018
a835503
Build: Make additional test deps of check (#32015)
rjernst Jul 17, 2018
ff7ff36
Painless: Add PainlessClassBuilder (#32141)
jdconrad Jul 17, 2018
101458b
Build: Skip jar tests if jar disabled
nik9000 Jul 17, 2018
88c4f6c
Switch distribution to new style Requests (#30595)
nik9000 Jul 18, 2018
8486f24
Remove versionType from translog (#31945)
dnhatn Jul 18, 2018
413d211
ESIndexLevelReplicationTestCase doesn't support replicated failures b…
bleskes Jul 17, 2018
fa16875
[DOCS] Update TLS on Docker for 6.3 (#32114)
Jul 18, 2018
688deeb
Fix `range` queries on `_type` field for singe type indices (#31756)
Jul 18, 2018
ace3771
Term -> Token; deps
romseygeek Jul 18, 2018
84ee20e
Fix CP for namingConventions when gradle home has spaces (#31914)
alpar-t Jul 18, 2018
947fa2e
Fix Java 11 javadoc compile problem
Jul 18, 2018
e67cede
A replica can be promoted and started in one cluster state update (#3…
bleskes Jul 18, 2018
8924ac3
Add EC2 credential test for repository-s3 (#31918)
vladimirdolzhenko Jul 18, 2018
b79dc6a
Add more contexts to painless execute api (#30511)
martijnvg Jul 18, 2018
d225048
use before instead of onOrBefore
martijnvg Jul 18, 2018
997ebe8
Improve docs for search preferences (#32159)
DaveCTurner Jul 18, 2018
f591095
Fix BwC Tests looking for UUID Pre 6.4 (#32158)
original-brownbear Jul 18, 2018
67a4dcb
Call setReferences() on custom referring tokenfilters in _analyze (#3…
romseygeek Jul 18, 2018
93ecf1d
Merge conflicts
romseygeek Jul 18, 2018
945fadf
more docs
romseygeek Jul 18, 2018
303de4f
tests for all script variables
romseygeek Jul 18, 2018
33b9da4
Merge branch 'master' into scripted-analysis
romseygeek Jul 18, 2018
b775f71
Merge branch 'master' into scripted-analysis
romseygeek Aug 21, 2018
1dff0f6
merge error
romseygeek Aug 21, 2018
546aa11
checkstyle
romseygeek Aug 21, 2018
701fbf2
headers
romseygeek Aug 21, 2018
396843f
Use actual painless syntax, not my own made-up syntax
romseygeek Aug 21, 2018
8bc3d65
Merge branch 'master' into scripted-analysis
romseygeek Sep 5, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/reference/analysis/tokenfilters.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ include::tokenfilters/word-delimiter-graph-tokenfilter.asciidoc[]

include::tokenfilters/multiplexer-tokenfilter.asciidoc[]

include::tokenfilters/condition-tokenfilter.asciidoc[]

include::tokenfilters/stemmer-tokenfilter.asciidoc[]

include::tokenfilters/stemmer-override-tokenfilter.asciidoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
[[analysis-condition-tokenfilter]]
=== Conditional Token Filter

The conditional token filter takes a predicate script and a list of subfilters, and
only applies the subfilters to the current token if it matches the predicate.

[float]
=== Options
[horizontal]
filters:: a list of token filters to apply to the current token if the predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be explicit that these filters are chained. Maybe call it filter (no plural) for consistency with analyzers?

matches. These can be any token filters defined elsewhere in the index mappings.

script:: a predicate script that determines whether or not the filters will be applied
to the current token. Note that only inline scripts are supported

[float]
=== Settings example

You can set it up like:

[source,js]
--------------------------------------------------
PUT /condition_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : [ "my_condition" ]
}
},
"filter" : {
"my_condition" : {
"type" : "condition",
"filters" : [ "lowercase" ],
"script" : {
"source" : "return term.term.length() < 5" <1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you need the return here and I think it'd read a little better without it.

}
}
}
}
}
}
--------------------------------------------------
// CONSOLE

<1> This will only apply the lowercase filter to terms that are less than 5
characters in length

And test it like:

[source,js]
--------------------------------------------------
POST /condition_example/_analyze
{
"analyzer" : "my_analyzer",
"text" : "What Flapdoodle"
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

And it'd respond:

[source,js]
--------------------------------------------------
{
"tokens": [
{
"token": "what", <1>
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Flapdoodle", <2>
"start_offset": 5,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 0
}
]
}
--------------------------------------------------
// TESTRESPONSE
<1> The term `What` has been lowercased, because it is only 4 characters long
<2> The term `Flapdoodle` has been left in its original case, because it doesn't pass
the predicate
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,16 @@
import org.apache.lucene.analysis.tr.ApostropheFilter;
import org.apache.lucene.analysis.tr.TurkishAnalyzer;
import org.apache.lucene.analysis.util.ElisionFilter;
import org.apache.lucene.util.SetOnce;
import org.elasticsearch.client.Client;
import org.elasticsearch.cluster.service.ClusterService;
import org.elasticsearch.common.io.stream.NamedWriteableRegistry;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.logging.Loggers;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
import org.elasticsearch.env.Environment;
import org.elasticsearch.env.NodeEnvironment;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.CharFilterFactory;
import org.elasticsearch.index.analysis.PreBuiltAnalyzerProviderFactory;
Expand All @@ -127,10 +134,15 @@
import org.elasticsearch.indices.analysis.PreBuiltCacheFactory.CachingStrategy;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.watcher.ResourceWatcherService;
import org.tartarus.snowball.ext.DutchStemmer;
import org.tartarus.snowball.ext.FrenchStemmer;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
Expand All @@ -141,6 +153,14 @@ public class CommonAnalysisPlugin extends Plugin implements AnalysisPlugin {

private static final DeprecationLogger DEPRECATION_LOGGER = new DeprecationLogger(Loggers.getLogger(CommonAnalysisPlugin.class));

private final SetOnce<ScriptService> scriptService = new SetOnce<>();

@Override
public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool, ResourceWatcherService resourceWatcherService, ScriptService scriptService, NamedXContentRegistry xContentRegistry, Environment environment, NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) {
this.scriptService.set(scriptService);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where're never very happy with using SetOnce like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.

return Collections.emptyList();
}

@Override
public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> analyzers = new TreeMap<>();
Expand Down Expand Up @@ -202,6 +222,8 @@ public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
filters.put("classic", ClassicFilterFactory::new);
filters.put("czech_stem", CzechStemTokenFilterFactory::new);
filters.put("common_grams", requriesAnalysisSettings(CommonGramsTokenFilterFactory::new));
filters.put("condition",
requriesAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires has a typo?

Copy link
Contributor Author

@romseygeek romseygeek Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pre-existing typo; I'll open a separate PR to fix it

filters.put("decimal_digit", DecimalDigitFilterFactory::new);
filters.put("delimited_payload_filter", LegacyDelimitedPayloadTokenFilterFactory::new);
filters.put("delimited_payload", DelimitedPayloadTokenFilterFactory::new);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
package org.elasticsearch.analysis.common;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.miscellaneous.ConditionalTokenFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.ReferringFilterFactory;
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.script.AnalysisPredicateScript;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.script.ScriptType;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.function.Function;

/**
* A factory for a conditional token filter that only applies child filters if the underlying token
* matches an {@link AnalysisPredicateScript}
*/
public class ScriptedConditionTokenFilterFactory extends AbstractTokenFilterFactory implements ReferringFilterFactory {

private final AnalysisPredicateScript.Factory factory;
private final List<TokenFilterFactory> filters = new ArrayList<>();
private final List<String> filterNames;

ScriptedConditionTokenFilterFactory(IndexSettings indexSettings, String name,
Settings settings, ScriptService scriptService) {
super(indexSettings, name, settings);

Settings scriptSettings = settings.getAsSettings("script");
Script script = Script.parse(scriptSettings);
if (script.getType() != ScriptType.INLINE) {
throw new IllegalArgumentException("Cannot use stored scripts in tokenfilter [" + name + "]");
}
this.factory = scriptService.compile(script, AnalysisPredicateScript.CONTEXT);

this.filterNames = settings.getAsList("filters");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail on an empty list?

}

@Override
public TokenStream create(TokenStream tokenStream) {
Function<TokenStream, TokenStream> filter = in -> {
for (TokenFilterFactory tff : filters) {
in = tff.create(in);
}
return in;
};
AnalysisPredicateScript script = factory.newInstance();
final AnalysisPredicateScript.Term term = new AnalysisPredicateScript.Term();
return new ConditionalTokenFilter(tokenStream, filter) {

CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
PositionLengthAttribute posLenAtt = addAttribute(PositionLengthAttribute.class);
OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
TypeAttribute typeAtt = addAttribute(TypeAttribute.class);
KeywordAttribute keywordAtt = addAttribute(KeywordAttribute.class);

@Override
protected boolean shouldFilter() {
term.term = termAtt;
term.posInc = posIncAtt.getPositionIncrement();
term.posLen = posLenAtt.getPositionLength();
term.startOffset = offsetAtt.startOffset();
term.endOffset = offsetAtt.endOffset();
term.type = typeAtt.type();
term.isKeyword = keywordAtt.isKeyword();
return script.execute(term);
}
};
}

@Override
public void setReferences(Map<String, TokenFilterFactory> factories) {
for (String filter : filterNames) {
TokenFilterFactory tff = factories.get(filter);
if (tff == null) {
throw new IllegalArgumentException("ScriptedConditionTokenFilter [" + name() +
"] refers to undefined token filter [" + filter + "]");
}
filters.add(tff);
}
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
package org.elasticsearch.analysis.common;

import org.elasticsearch.Version;
import org.elasticsearch.cluster.metadata.IndexMetaData;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.env.TestEnvironment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.IndexAnalyzers;
import org.elasticsearch.index.analysis.NamedAnalyzer;
import org.elasticsearch.indices.analysis.AnalysisModule;
import org.elasticsearch.script.AnalysisPredicateScript;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptService;
import org.elasticsearch.test.ESTokenStreamTestCase;
import org.elasticsearch.test.IndexSettingsModule;

import java.util.Collections;

public class ScriptedConditionTokenFilterTests extends ESTokenStreamTestCase {

public void testSimpleCondition() throws Exception {
Settings settings = Settings.builder()
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString())
.build();
Settings indexSettings = Settings.builder()
.put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)
.put("index.analysis.filter.cond.type", "condition")
.put("index.analysis.filter.cond.script.source", "return \"two\".equals(term.term)")
.putList("index.analysis.filter.cond.filters", "uppercase")
.put("index.analysis.analyzer.myAnalyzer.type", "custom")
.put("index.analysis.analyzer.myAnalyzer.tokenizer", "standard")
.putList("index.analysis.analyzer.myAnalyzer.filter", "cond")
.build();
IndexSettings idxSettings = IndexSettingsModule.newIndexSettings("index", indexSettings);

AnalysisPredicateScript.Factory factory = () -> new AnalysisPredicateScript() {
@Override
public boolean execute(Term term) {
return "two".contentEquals(term.term);
}
};

ScriptService scriptService = new ScriptService(indexSettings, Collections.emptyMap(), Collections.emptyMap()){
@Override
public <FactoryType> FactoryType compile(Script script, ScriptContext<FactoryType> context) {
assertEquals(context, AnalysisPredicateScript.CONTEXT);
assertEquals(new Script("return \"two\".equals(term.term)"), script);
return (FactoryType) factory;
}
};

CommonAnalysisPlugin plugin = new CommonAnalysisPlugin();
plugin.createComponents(null, null, null, null, scriptService, null, null, null, null);
AnalysisModule module
= new AnalysisModule(TestEnvironment.newEnvironment(settings), Collections.singletonList(plugin));

IndexAnalyzers analyzers = module.getAnalysisRegistry().build(idxSettings);

try (NamedAnalyzer analyzer = analyzers.get("myAnalyzer")) {
assertNotNull(analyzer);
assertAnalyzesTo(analyzer, "one two three", new String[]{
"one", "TWO", "three"
});
}

}

}
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc {
int getLength()
float getFreq()
}

class org.elasticsearch.script.AnalysisPredicateScript$Term {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be called Token rather than Term? It might be just me but I feel like it better carries the information that there are other attributes here like positions and offsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.

CharSequence term
int posInc
int posLen
int startOffset
int endOffset
String type
boolean isKeyword
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm exposing this information directly at the moment, but it might be better to expose getters instead so that consumers don't try and do things like change offsets or position increments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to exposing getters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Painless doesn't require the get part of the getter anyway so I'd prefer getters.

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
package org.elasticsearch.painless;

import org.elasticsearch.painless.spi.Whitelist;
import org.elasticsearch.script.AnalysisPredicateScript;
import org.elasticsearch.script.ScriptContext;

import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class AnalysisScriptTests extends ScriptTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in your plugin instead of painless if we can manage it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - do you mean moving the analysis script context out of core and into the common analysis plugin, or do you mean creating an entirely new plugin just for this token filter (and any other script-based filters we come up with)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I feel like it should live in the common analysis plugin and not in the server at all. Since it is a new thing I figure we should isolate it as much as we can so we know when stuff is using it. Not because it is dirty or anything, just because a smaller core is easier to reason about. And keeping this entirely within the analysis-common plugin helps to prove out the work that we've done for plugins extending painless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part here will be unit testing, as ScriptTestCase is a painless-specific class. The painless extension module uses rest tests only, but I'd much rather have compiled unit tests here. Maybe we should build a separate painless-testing module, that can be used for tests elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we try to unit test this sort of thing without painless and use a mock script engine instead that just calls code on the test class. It is totally reasonable to have a few integration tests that do use painless though.


@Override
protected Map<ScriptContext<?>, List<Whitelist>> scriptContexts() {
Map<ScriptContext<?>, List<Whitelist>> contexts = new HashMap<>();
contexts.put(AnalysisPredicateScript.CONTEXT, Whitelist.BASE_WHITELISTS);
return contexts;
}

public void testAnalysisScript() {
AnalysisPredicateScript.Factory factory = scriptEngine.compile("test", "return \"one\".contentEquals(term.term)",
AnalysisPredicateScript.CONTEXT, Collections.emptyMap());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to use contentEquals is a bit trappy here as well, but I want to avoid creating Strings on every call to incrementToken().


AnalysisPredicateScript script = factory.newInstance();
AnalysisPredicateScript.Term term = new AnalysisPredicateScript.Term();
term.term = "one";
assertTrue(script.execute(term));
term.term = "two";
assertFalse(script.execute(term));
}
}
Loading