Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional token filter to elasticsearch #31958

Merged
merged 111 commits into from
Sep 5, 2018
Merged

Add conditional token filter to elasticsearch #31958

merged 111 commits into from
Sep 5, 2018

Conversation

romseygeek
Copy link
Contributor

Lucene has a ConditionTokenFilter which allows you to selectively apply tokenfilters, depending on the state of the current token in the tokenstream. This commit exposes this functionality in elasticsearch, adding a new AnalysisPredicateScript context.

@romseygeek romseygeek self-assigned this Jul 11, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Contributor Author

This is a first step adding the analysis script context, and illustrating how to use it with the conditional token filter. I'd also like to look at adding scriptable filtering filters (ie, 'ignore this token if it matches a predicate'), and possibly the ability to make changes to the token itself.

@romseygeek romseygeek requested review from nik9000 and jpountz July 11, 2018 11:10
int endOffset
String type
boolean isKeyword
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm exposing this information directly at the moment, but it might be better to expose getters instead so that consumers don't try and do things like change offsets or position increments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to exposing getters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Painless doesn't require the get part of the getter anyway so I'd prefer getters.


public void testAnalysisScript() {
AnalysisPredicateScript.Factory factory = scriptEngine.compile("test", "return \"one\".contentEquals(term.term)",
AnalysisPredicateScript.CONTEXT, Collections.emptyMap());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to use contentEquals is a bit trappy here as well, but I want to avoid creating Strings on every call to incrementToken().

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but like it overall.

int endOffset
String type
boolean isKeyword
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to exposing getters

int endOffset
String type
boolean isKeyword
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?

@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc {
int getLength()
float getFreq()
}

class org.elasticsearch.script.AnalysisPredicateScript$Term {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be called Token rather than Term? It might be just me but I feel like it better carries the information that there are other attributes here like positions and offsets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.

@@ -202,6 +222,8 @@
filters.put("classic", ClassicFilterFactory::new);
filters.put("czech_stem", CzechStemTokenFilterFactory::new);
filters.put("common_grams", requriesAnalysisSettings(CommonGramsTokenFilterFactory::new));
filters.put("condition",
requriesAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires has a typo?

Copy link
Contributor Author

@romseygeek romseygeek Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pre-existing typo; I'll open a separate PR to fix it

[float]
=== Options
[horizontal]
filters:: a list of token filters to apply to the current token if the predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be explicit that these filters are chained. Maybe call it filter (no plural) for consistency with analyzers?

}
this.factory = scriptService.compile(script, AnalysisPredicateScript.CONTEXT);

this.filterNames = settings.getAsList("filters");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail on an empty list?

"type" : "condition",
"filters" : [ "lowercase" ],
"script" : {
"source" : "return term.term.length() < 5" <1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you need the return here and I think it'd read a little better without it.


@Override
public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool, ResourceWatcherService resourceWatcherService, ScriptService scriptService, NamedXContentRegistry xContentRegistry, Environment environment, NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) {
this.scriptService.set(scriptService);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where're never very happy with using SetOnce like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.

@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc {
int getLength()
float getFreq()
}

class org.elasticsearch.script.AnalysisPredicateScript$Term {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.

int endOffset
String type
boolean isKeyword
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Painless doesn't require the get part of the getter anyway so I'd prefer getters.

import java.util.List;
import java.util.Map;

public class AnalysisScriptTests extends ScriptTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in your plugin instead of painless if we can manage it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow - do you mean moving the analysis script context out of core and into the common analysis plugin, or do you mean creating an entirely new plugin just for this token filter (and any other script-based filters we come up with)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I feel like it should live in the common analysis plugin and not in the server at all. Since it is a new thing I figure we should isolate it as much as we can so we know when stuff is using it. Not because it is dirty or anything, just because a smaller core is easier to reason about. And keeping this entirely within the analysis-common plugin helps to prove out the work that we've done for plugins extending painless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part here will be unit testing, as ScriptTestCase is a painless-specific class. The painless extension module uses rest tests only, but I'd much rather have compiled unit tests here. Maybe we should build a separate painless-testing module, that can be used for tests elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we try to unit test this sort of thing without painless and use a mock script engine instead that just calls code on the test class. It is totally reasonable to have a few integration tests that do use painless though.

bleskes and others added 14 commits July 18, 2018 15:48
…2042)

When a replica is fully recovered (i.e., in `POST_RECOVERY` state) we send a request to the master
to start the shard. The master changes the state of the replica and publishes a cluster state to that
effect. In certain cases, that cluster state can be processed on the node hosting the replica
*together* with a cluster state that promotes that, now started, replica to a primary. This can
happen due to cluster state batched processing or if the master died after having committed the
cluster state that starts the shard but before publishing it to the node with the replica. If the master
also held the primary shard, the new master node will remove the primary (as it failed) and will also
immediately promote the replica (thinking it is started). 

Sadly our code in IndexShard didn't allow for this which caused [assertions](https://github.com/elastic/elasticsearch/blob/13917162ad5c59a96ccb4d6a81a5044546c45c22/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java#L482) to be tripped in some of our tests runs.
Add EC2 credential test for repository-s3

Relates to #26913
This change adds two contexts the execute scripts against:

* SEARCH_SCRIPT: Allows to run scripts in a search script context.
This context is used in `function_score` query's script function,
script fields, script sorting and `terms_set` query.

* FILTER_SCRIPT: Allows to run scripts in a filter script context.
This context is used in the `script` query.

In both contexts a index name needs to be specified and a sample document.
The document is needed to create an in-memory index that the script can
access via the `doc[...]` and other notations. The index name is needed
because a mapping is needed to index the document.

Examples:

```
POST /_scripts/painless/_execute
{
  "script": {
    "source": "doc['field'].value.length()"
  },
  "context" : {
    "search_script": {
      "document": {
        "field": "four"
      },
      "index": "my-index"
    }
  }
}
```

Returns:

```
{
  "result": 4
}
```

POST /_scripts/painless/_execute
{
  "script": {
    "source": "doc['field'].value.length() <= params.max_length",
    "params": {
      "max_length": 4
    }
  },
  "context" : {
    "filter_script": {
      "document": {
        "field": "four"
      },
      "index": "my-index"
    }
  }
}

Returns:

```
{
  "result": true
}
```

Also changed PainlessExecuteAction.TransportAction to use TransportSingleShardAction
instead of HandledAction, because now in case score or filter contexts are used
the request needs to be redirected to a node that has an active IndexService
for the index being referenced (a node with a shard copy for that index).
Today it is unclear what guarantees are offered by the search preference
feature, and we claim a guarantee that is stronger than what we really offer:

> A custom value will be used to guarantee that the same shards will be used
> for the same custom value.

This commit clarifies this documentation.

Forward-port of #32098 to `master`.
* UUID field was added for #31791 and only went into 6.4 and 7.0
* Fixes #32119
…2157)

When building custom tokenfilters without an index in the _analyze endpoint,
we need to ensure that referring filters are correctly built by calling
their #setReferences() method

Fixes #32154
@romseygeek romseygeek added v6.5.0 and removed v6.4.0 labels Aug 21, 2018
@romseygeek
Copy link
Contributor Author

@nik9000 the fix for ScriptContexts in modules is now in, so I think this is ready for re-review.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

@Override
@SuppressWarnings("rawtypes") // TODO ScriptPlugin needs to change this to pass precommit?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to leave this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a backwards breaking change? ScriptPlugin#getContexts() needs to change return value from List<ScriptContext> to List<ScriptContext<?>>. Which ought to be done, but it's a separate change. I'll open an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening an issue makes sense. Yeah, it is a separate thing.

@romseygeek romseygeek merged commit 6364427 into elastic:master Sep 5, 2018
@romseygeek romseygeek deleted the scripted-analysis branch September 5, 2018 13:52
romseygeek added a commit that referenced this pull request Sep 5, 2018
This allows tokenfilters to be applied selectively, depending on the status of
the current token in the tokenstream.  The filter takes a scripted predicate,
and only applies its subfilter when the predicate returns true.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 5, 2018
* master:
  Fix deprecated setting specializations (elastic#33412)
  HLRC: split cluster request converters (elastic#33400)
  HLRC: Add ML get influencers API (elastic#33389)
  Add conditional token filter to elasticsearch (elastic#31958)
  Build: Merge xpack checkstyle config into core (elastic#33399)
  Disable IndexRecoveryIT.testRerouteRecovery.
  INGEST: Implement Drop Processor (elastic#32278)
  [ML] Add field stats to log structure finder (elastic#33351)
  Add interval response parameter to AutoDateInterval histogram (elastic#33254)
  MINOR+CORE: Remove Dead Methods ClusterService (elastic#33346)
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.