update street synonyms #446

missinglink · 2020-06-12T14:37:41Z

This WIP PR is to improve the synonyms specifically for USA streets.
Motivated by the following matching failures:

Boul === Boulevard
Crt === Court
Ter === Terrace

I've added the USPS synonyms list which should cover a large amount of USA street names.

I'd like to see the effect of this on a full-planet build before deciding if this is something we might consider doing more of, for other languages and classes of synonyms (such as Gray<>Grey and 1st<>First).

missinglink · 2020-06-12T14:46:02Z

I'm still not sure how comfortable I am with some of these synonyms which come from libpostal:

eg:

l,lane
road,raod

missinglink · 2020-06-12T16:40:56Z

rebased on top of #447

orangejulius · 2020-06-12T16:58:20Z

synonyms/street_synonyms_en.txt

+rise, ri
+riverway, rvwy
+riviera, rvra
+road, rd, ro, r, roa, raod


Yeah, maybe we leave out any single character synonyms for now?

And definitely typos should not be synonyms.

I've added another PR to remove a bunch of single char synonyms (although I'm not sure if they are such a big issue?)
Also removed some synonyms from the libpostal list which appeared to only be there for the purpose of spelling correction.

There is a recent issue opened specifically suggesting adding a synonym for typos #443

Let's chat about this some more, for now I'm going to remove them.

Joxit · 2020-06-13T08:25:08Z

settings.js

@@ -68,7 +72,9 @@ function generate(){
            "icu_folding",
            "trim",
            "custom_name",
-            "street_suffix",
+            "street_synonyms_en",
+            "street_synonyms_usps",


How about having something with a regex like street_synonyms_* and will fetch all the files ?
This may help for #393 or if we need to add custom street_synonyms by lang.

Yeah good point, we should rethink these a bit and make them easier to customise.

Right now we have fairly limited synonyms coverage (ie. 'just the basics').
I'd like to explore what would happen if we added significantly more synonyms, I suspect that the impact would be very positive for search quality and hopefully doesn't affect performance much.

@Joxit could you please tell me a bit about how you're adding custom synonyms for your builds?

I'm guessing you're adding a bunch of French-specific and some other ones which you include in the custom_street.txt file?

Pending successful testing of this I would next like to expand on French, German and Spanish synonyms.

This is a bit annoying but better since #375 and #407

I have a fork joxit/pelias-schema, when I start a new build, I fetch the upstream and add this commit.

For the record, I added cutom_street in peliasPhrase and peliasQuery in build-2019-09 (before pelias/parser release). peliasQuery is used for autocomplete where we can find streets names... And peliasPhrase because street_suffix is also present.

I add my synonyms in synonyms/custom_street.txt

bd, bld, blvd, boul, bvd, boulevard #only bld is missing in this PR ave, av, avenue # OK with this PR saint, st, street # missing sainte, ste, suite # missing

This list was created from our common uses (when we test the pelias). I have always been afraid of slowing down queries with too many synonyms 😅.

Do you possibly know of an official source which publishes a French equivalent of this?
https://pe.usps.com/text/pub28/28apc_002.htm

I've opened #449 to hopefully clean up your build process a little.

Ideally you shouldn't have to do any synonyms work in peliasQuery, I think you can remove that and you'll get a small perf benefit?

I found this one : https://blog.bureaudeposte.net/rediger-une-adresse-aux-normes-postales/ for abbreviations

Name abbreviation

Allée ALL

Avenue AV

Boulevard BD

Carrefour CARR

Chaussée CHS

Chemin CHEM

Cours CRS

Faubourg FG

Immeuble IMM

Impasse IMP

Lieu-dit LD

Lotissement LOT

Montée MTE

Passage PAS

Place PL

Résidence RES

Route RTE

Ruelle RLE

Next time I will check differences between synonyms in peliasQuery and #449

Oh, and saint/sainte aren't for street suffix/prefix but a workaround for street names... 😅

Joxit · 2020-06-13T08:34:46Z

synonyms/street_synonyms_en.txt

+block, blk, blck
+bluff, blf, bluf, bluffs, blfs
+boardwalk, bwk, bwlk, board walk
+boulevard, blvd, bd, bde, blv, bl, blvde, blvrd, boulavard, boul, boulv, bvd, boulevarde


This one will help #301 and the example bd saint-martin

…ynonyms which seem to only be present for the purpose of spelling correction

missinglink · 2020-06-15T12:55:18Z

Okay, I'm going to run this as a full-planet build and see what effect it has on the acceptance tests.

I'm particularly interested to see if adding loads of synonyms is feasible from quality/perf perspective 🤷

blackmad · 2020-06-15T14:29:30Z

Thanks for working on this, Peter!

missinglink · 2020-06-16T11:42:18Z

Looks pretty good to me 🎉
https://www.diffchecker.com/qlBPvxAO

This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.

missinglink · 2020-06-16T18:33:56Z

I've removed all the multi-word synonyms in 5852086, these are potentially dangerous, especially when used at query time (such as is the case right now for peliasStreet fields)

orangejulius · 2020-06-16T20:25:39Z

One additional synonym we might want to add here is st/saint. This would fix pelias/pelias#737

We can also do it as a followup PR if that's easier, but let's make sure not to forget :)

This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.

orangejulius · 2020-07-13T21:06:57Z

This PR has been replaced by #453

This is an experimental change, but one that has good impact when paired with pelias/schema#446. Essentially, this change continues the trend we have started for the `name.*` and `phrase.*` fields to generate synonyms _only_ at index time. It may be the case that variations on input vs data text (for example `crt` vs `ct` vs `court`) may cause different synonyms to be generated by the same analyzer. Many queries, especially `match_phrase` queries, will require that _all_ of those generated synonym tokens must match. This is often not desirable.

missinglink mentioned this pull request Jun 12, 2020

add synonyms linter #447

Merged

missinglink force-pushed the updated_street_synonyms branch from d3e26a5 to eb55831 Compare June 12, 2020 16:40

orangejulius reviewed Jun 12, 2020

View reviewed changes

Joxit reviewed Jun 13, 2020

View reviewed changes

feat(streets): update street synonyms

f0aa06b

missinglink force-pushed the updated_street_synonyms branch from eb55831 to f0aa06b Compare June 15, 2020 11:07

feat(streets): remove all single-char synonyms except directionals, s…

60cfb96

…ynonyms which seem to only be present for the purpose of spelling correction

missinglink mentioned this pull request Jun 15, 2020

"custom_street" synonyms should be applied to name.* and phrase.* indices #449

Merged

feat(streets): remove all multi-word synonyms

5852086

orangejulius mentioned this pull request Jun 17, 2020

Add Saint/St to street_synonyms_en #452

Closed

missinglink mentioned this pull request Jun 18, 2020

major synonyms upgrade #453

Merged

orangejulius closed this Jul 13, 2020

Joxit deleted the updated_street_synonyms branch July 13, 2020 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update street synonyms #446

update street synonyms #446

missinglink commented Jun 12, 2020 •

edited

Loading

missinglink commented Jun 12, 2020

missinglink commented Jun 12, 2020

orangejulius Jun 12, 2020

missinglink Jun 15, 2020

missinglink Jun 15, 2020

Joxit Jun 13, 2020

missinglink Jun 15, 2020

Joxit Jun 15, 2020

missinglink Jun 15, 2020

missinglink Jun 15, 2020

Joxit Jun 15, 2020

Joxit Jun 15, 2020

Joxit Jun 13, 2020

missinglink commented Jun 15, 2020

blackmad commented Jun 15, 2020

missinglink commented Jun 16, 2020

missinglink commented Jun 16, 2020

orangejulius commented Jun 16, 2020

orangejulius commented Jul 13, 2020

Name	abbreviation
Allée	ALL
Avenue	AV
Boulevard	BD
Carrefour	CARR
Chaussée	CHS
Chemin	CHEM
Cours	CRS
Faubourg	FG
Immeuble	IMM
Impasse	IMP
Lieu-dit	LD
Lotissement	LOT
Montée	MTE
Passage	PAS
Place	PL
Résidence	RES
Route	RTE
Ruelle	RLE

update street synonyms #446

update street synonyms #446

Conversation

missinglink commented Jun 12, 2020 • edited Loading

missinglink commented Jun 12, 2020

missinglink commented Jun 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

missinglink commented Jun 15, 2020

blackmad commented Jun 15, 2020

missinglink commented Jun 16, 2020

missinglink commented Jun 16, 2020

orangejulius commented Jun 16, 2020

orangejulius commented Jul 13, 2020

missinglink commented Jun 12, 2020 •

edited

Loading