Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hyphen in pelias peliasNameTokenizer and peliasStreetTokenizer #375

Merged
merged 1 commit into from
Sep 12, 2019

Conversation

Joxit
Copy link
Member

@Joxit Joxit commented Sep 9, 2019

👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback ❤️.


Here's the reason for this change 🚀

Some streets are written with hyphen, such as 20 boulevard saint-germain, Paris, France. But when we write this street, we can omit the hyphen.

With the current schema, if we search 20 boulevard saint germain, paris, france we have no results. (This is the same issue for 51 Friedrich-Richter-Straße)


Here's what actually got changed 👏

I added the hyphen in peliasNameTokenizer and peliasStreetTokenizer to covert all analyzers.


Here's how others can test the changes 👀

geocode.earth jawg.io
20 boulevard saint germain, paris, france 20 boulevard saint germain, paris, france
51 Friedrich-Richter-Straße 51 Friedrich-Richter-Straße

orangejulius added a commit to pelias/docker that referenced this pull request Sep 9, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Sep 12, 2019
@orangejulius
Copy link
Member

Looks good in our testing and is definitely overdue for being merged :) Thanks for pushing it through!

@orangejulius orangejulius merged commit 013236a into master Sep 12, 2019
@Joxit Joxit deleted the joxit/hyphen branch September 12, 2019 17:58
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Sep 18, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Sep 18, 2019
orangejulius added a commit that referenced this pull request Nov 7, 2019
The first error seen when trying to use our current schema with
Elasticsearch 7 is:

```
[illegal_argument_exception] Token filter [word_delimiter] cannot be
used to parse synonyms
```

The [word delimiter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html)
token filter is only used in one place: the `peliasAdmin` analyzer.

Looking at the documentation for `word_delimiter`, it does _a lot_:
splitting words, handling punctuation, and even some basic stemming.

It really feels like an extremely broad tool and at this point feels
like something that Elasticsearch would deprecate in the future.

Furthermore, looking at our integration tests, it seems one of the key
reasons we used it was to tokenize on hyphens, which we have done using
the `peliasNameTokenizer` since
#375.

Considering how complicated this token filter is, and how it's now being
used with relatively little effect, it seems like something we can
remove.

Connects pelias/pelias#831
orangejulius added a commit that referenced this pull request May 20, 2020
The first error seen when trying to use our current schema with
Elasticsearch 7 is:

```
[illegal_argument_exception] Token filter [word_delimiter] cannot be
used to parse synonyms
```

The [word delimiter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html)
token filter is only used in one place: the `peliasAdmin` analyzer.

Looking at the documentation for `word_delimiter`, it does _a lot_:
splitting words, handling punctuation, and even some basic stemming.

It really feels like an extremely broad tool and at this point feels
like something that Elasticsearch would deprecate in the future.

Furthermore, looking at our integration tests, it seems one of the key
reasons we used it was to tokenize on hyphens, which we have done using
the `peliasNameTokenizer` since
#375.

Considering how complicated this token filter is, and how it's now being
used with relatively little effect, it seems like something we can
remove.

Connects pelias/pelias#831
@Joxit Joxit mentioned this pull request Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants