Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other similarities #73

Open
ierezell opened this issue Jul 5, 2021 · 2 comments
Open

Other similarities #73

ierezell opened this issue Jul 5, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@ierezell
Copy link

ierezell commented Jul 5, 2021

🚀 Feature

Replace random word with a phonetically similar one.

Or also replace a random word with the same Part Of Speech or lemma (adjective with adjective or run with ran / running etc...)

Motivation

I'm training a transformer based model to spell check utterances (like a reversed augly).
Like Hello r u fin tdy => Hello are you fine today.

I realized that quite often the spelling errors come from phonetically similar words
exemple (not so good exemple but for the sake of the explanation) : "I love jeans" vs "I love gins"

Also, augmenting by replacing with sane pos or other inflections of the same lemma would help in the same direction (as better destroying the sentences to train a better spellchecking model)

Having this kind of built-in Augmentation would help building better models.

Pitch

Having a built-in augmenter that create mistakes not only with levensthein like distances but uses phonetics.
I've done mine using epitran for phonetics and spacy for pos but other frameworks exists.

Alternatives

Implement my own augmenter (done).
Use only text based distances which cannot find jean vs gin or cute vs beautiful or run vs running as they are textually too different but often found in chats.

@ierezell ierezell changed the title Phonetic similarity Other similarities Jul 5, 2021
@ierezell ierezell changed the title Other similarities Phonetics similarities Jul 5, 2021
@ierezell ierezell changed the title Phonetics similarities Other similarities Jul 5, 2021
@jbitton
Copy link
Contributor

jbitton commented Jul 9, 2021

Hi @ierezell! Thank you for all the awesome enhancements you're suggesting! This kind of augmentation is actually something we've talked about building internally, as these are very common misspellings that occur in the wild!

I'll take a look at the epitran library and see how we can support this!

@BradKML
Copy link

BradKML commented Sep 1, 2022

Seconded this that similar sounds are included, IDK about phonetic hashing tho

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants