Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(peliasAdmin): Remove word delimiter filter #392

Closed
wants to merge 1 commit into from

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Nov 7, 2019

The first error seen when trying to use our current schema with Elasticsearch 7 is:

[illegal_argument_exception] Token filter [word_delimiter] cannot be used to parse synonyms

The word delimiter token filter is only used in one place: the peliasAdmin analyzer.

Looking at the documentation, this token filter does a lot: splitting words, handling punctuation, and even some basic stemming.

It really feels like an extremely convoluted tool and at this point I have a suspicion it is something that Elasticsearch would deprecate in the future.

Furthermore, according to our integration tests, it seems one of the key reasons we used it was to tokenize on hyphens, which we have done using the peliasNameTokenizer since #375.

Considering how complicated this token filter is, and how it's now being used with relatively little effect, it seems like something we should remove.

Connects pelias/pelias#831

@orangejulius
Copy link
Member Author

While I've opened this PR now, I definitely don't want to merge this until ES6 work is pretty much wrapped up, to keep things simple.

@orangejulius orangejulius mentioned this pull request Nov 19, 2019
5 tasks
@missinglink
Copy link
Member

missinglink commented Nov 19, 2019

I think this filter is a relic from the past and it's function has long been forgotten to the ages.

There are two valid solutions to this error:

  • remove the filter
  • move it below the synonyms filter (down one)

I don't have any preconceived notions of which of those two is correct however I think it's important to have an archeological dig to understand the original intent before we remove it entirely.

The first error seen when trying to use our current schema with
Elasticsearch 7 is:

```
[illegal_argument_exception] Token filter [word_delimiter] cannot be
used to parse synonyms
```

The [word delimiter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html)
token filter is only used in one place: the `peliasAdmin` analyzer.

Looking at the documentation for `word_delimiter`, it does _a lot_:
splitting words, handling punctuation, and even some basic stemming.

It really feels like an extremely broad tool and at this point feels
like something that Elasticsearch would deprecate in the future.

Furthermore, looking at our integration tests, it seems one of the key
reasons we used it was to tokenize on hyphens, which we have done using
the `peliasNameTokenizer` since
#375.

Considering how complicated this token filter is, and how it's now being
used with relatively little effect, it seems like something we can
remove.

Connects pelias/pelias#831
@orangejulius orangejulius force-pushed the remove-word_delimiter branch from 4b5dcb4 to 5701484 Compare May 20, 2020 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants