Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update street synonyms #446

Closed
wants to merge 3 commits into from
Closed

update street synonyms #446

wants to merge 3 commits into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Jun 12, 2020

This WIP PR is to improve the synonyms specifically for USA streets.
Motivated by the following matching failures:

Boul === Boulevard
Crt === Court
Ter === Terrace

I've added the USPS synonyms list which should cover a large amount of USA street names.

I'd like to see the effect of this on a full-planet build before deciding if this is something we might consider doing more of, for other languages and classes of synonyms (such as Gray<>Grey and 1st<>First).

@missinglink
Copy link
Member Author

I'm still not sure how comfortable I am with some of these synonyms which come from libpostal:

eg:

l,lane
road,raod

@missinglink missinglink mentioned this pull request Jun 12, 2020
@missinglink missinglink force-pushed the updated_street_synonyms branch from d3e26a5 to eb55831 Compare June 12, 2020 16:40
@missinglink
Copy link
Member Author

rebased on top of #447

rise, ri
riverway, rvwy
riviera, rvra
road, rd, ro, r, roa, raod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we leave out any single character synonyms for now?

And definitely typos should not be synonyms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added another PR to remove a bunch of single char synonyms (although I'm not sure if they are such a big issue?)
Also removed some synonyms from the libpostal list which appeared to only be there for the purpose of spelling correction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a recent issue opened specifically suggesting adding a synonym for typos #443

Let's chat about this some more, for now I'm going to remove them.

@@ -68,7 +72,9 @@ function generate(){
"icu_folding",
"trim",
"custom_name",
"street_suffix",
"street_synonyms_en",
"street_synonyms_usps",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about having something with a regex like street_synonyms_* and will fetch all the files ?
This may help for #393 or if we need to add custom street_synonyms by lang.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, we should rethink these a bit and make them easier to customise.

Right now we have fairly limited synonyms coverage (ie. 'just the basics').
I'd like to explore what would happen if we added significantly more synonyms, I suspect that the impact would be very positive for search quality and hopefully doesn't affect performance much.

@Joxit could you please tell me a bit about how you're adding custom synonyms for your builds?

I'm guessing you're adding a bunch of French-specific and some other ones which you include in the custom_street.txt file?

Pending successful testing of this I would next like to expand on French, German and Spanish synonyms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit annoying but better since #375 and #407

I have a fork joxit/pelias-schema, when I start a new build, I fetch the upstream and add this commit.

For the record, I added cutom_street in peliasPhrase and peliasQuery in build-2019-09 (before pelias/parser release). peliasQuery is used for autocomplete where we can find streets names... And peliasPhrase because street_suffix is also present.

I add my synonyms in synonyms/custom_street.txt

bd, bld, blvd, boul, bvd, boulevard #only bld is missing in this PR
ave, av, avenue # OK with this PR
saint, st, street # missing
sainte, ste, suite # missing

This list was created from our common uses (when we test the pelias). I have always been afraid of slowing down queries with too many synonyms 😅.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you possibly know of an official source which publishes a French equivalent of this?
https://pe.usps.com/text/pub28/28apc_002.htm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #449 to hopefully clean up your build process a little.

Ideally you shouldn't have to do any synonyms work in peliasQuery, I think you can remove that and you'll get a small perf benefit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this one : https://blog.bureaudeposte.net/rediger-une-adresse-aux-normes-postales/ for abbreviations

Name abbreviation
Allée ALL
Avenue AV
Boulevard BD
Carrefour CARR
Chaussée CHS
Chemin CHEM
Cours CRS
Faubourg FG
Immeuble IMM
Impasse IMP
Lieu-dit LD
Lotissement LOT
Montée MTE
Passage PAS
Place PL
Résidence RES
Route RTE
Ruelle RLE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time I will check differences between synonyms in peliasQuery and #449

Oh, and saint/sainte aren't for street suffix/prefix but a workaround for street names... 😅

block, blk, blck
bluff, blf, bluf, bluffs, blfs
boardwalk, bwk, bwlk, board walk
boulevard, blvd, bd, bde, blv, bl, blvde, blvrd, boulavard, boul, boulv, bvd, boulevarde
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will help #301 and the example bd saint-martin

@missinglink missinglink force-pushed the updated_street_synonyms branch from eb55831 to f0aa06b Compare June 15, 2020 11:07
…ynonyms which seem to only be present for the purpose of spelling correction
@missinglink
Copy link
Member Author

Okay, I'm going to run this as a full-planet build and see what effect it has on the acceptance tests.

I'm particularly interested to see if adding loads of synonyms is feasible from quality/perf perspective 🤷

@blackmad
Copy link
Contributor

Thanks for working on this, Peter!

@missinglink
Copy link
Member Author

Looks pretty good to me 🎉
https://www.diffchecker.com/qlBPvxAO

orangejulius added a commit to pelias/api that referenced this pull request Jun 16, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
@missinglink
Copy link
Member Author

I've removed all the multi-word synonyms in 5852086, these are potentially dangerous, especially when used at query time (such as is the case right now for peliasStreet fields)

@orangejulius
Copy link
Member

One additional synonym we might want to add here is st/saint. This would fix pelias/pelias#737

We can also do it as a followup PR if that's easier, but let's make sure not to forget :)

missinglink pushed a commit to pelias/api that referenced this pull request Jun 23, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
orangejulius added a commit to pelias/api that referenced this pull request Jun 25, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
orangejulius added a commit to pelias/api that referenced this pull request Jun 26, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
orangejulius added a commit to pelias/api that referenced this pull request Jun 29, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
orangejulius added a commit to pelias/api that referenced this pull request Jun 29, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
@orangejulius
Copy link
Member

This PR has been replaced by #453

@Joxit Joxit deleted the updated_street_synonyms branch July 13, 2020 21:23
bboure pushed a commit to tamaringoapp/api that referenced this pull request Sep 22, 2020
This is an experimental change, but one that has good impact when paired
with pelias/schema#446.

Essentially, this change continues the trend we have started for the
`name.*` and `phrase.*` fields to generate synonyms _only_ at index
time.

It may be the case that variations on input vs data text (for example
`crt` vs `ct` vs `court`) may cause different synonyms to be generated
by the same analyzer.

Many queries, especially `match_phrase` queries, will require that _all_
of those generated synonym tokens must match. This is often not
desirable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants