Skip to content

Latest commit

 

History

History
101 lines (76 loc) · 15.2 KB

lang_process.md

File metadata and controls

101 lines (76 loc) · 15.2 KB

Spanish

Author: Ma, Te ([email protected])

1. Text normalization

(1) The G2P models cannot recognize alien words, so we choose to remove sentences that containing alien words. They are listed in the file Spanish_alien_sentences.txt.

(2) Before creating lexicon, we need to normalize text. The code of text normalization for Spanish is in the script named text_norm.sh.

2. Lexicon generation and correction

We use the FST (Finite State Transducer) based G2P (Grapheme-to-Phoneme) toolkit, Phonetisaurus, to create the pronunciation lexicon. The trained FSTs for use with Phonetisaurus is provided in LanguageNet.

Note that the above G2P procedure is not perfect. As noted in LanguageNet, "PERs range from 7% to 45%". The G2P-generated lexicon needs to be corrected. The correction step is based on the LanguageNet symbol table for Spanish. The code of this step of lexicon correction is in the script named lexicon.sh.

(1) We remove some special symbols such as accent symbols to enable sharing more phonemes between different languages.

Removed symbols Note
ː Accent
ˈ Long vowel
ʲ Velarization

(2) A subtle issue is that IPA symbols may be encoded in different forms. So to enforce consistency, the phoneme /g/ is corrected to /ɡ/.

Phonemes from G2P Phonemes corrected
g ɡ

3. Check of phonemes

Strictly speaking, one phoneme might correspond to multiple phones (those phones are referred to as the allophones). Note that our above procedure removes the diacritic, the notion of phonemes in this work is a looser one.

The generated lexicon from the G2P procedure is named lexicon_es.txt. The set of IPA phonemes appeared in the lexicon is saved in phone_list.txt. We further check phone_list.txt, by referring to the following two phoneme lists and with listening tests.

Note that the G2P procedure is not perfect, the G2P-generated phone_list.txt is not exactly the same as the ideal IPA symbol table in LanguageNet. Further, the IPA symbol table in LanguageNet may also differ from other IPA symbol tables from other linguistic resources (e.g., Phoible). So we need to check. The inconsistencies are recorded in the following. The lexicon is not modified, since a complete modification of the whole lexicon requires non-trivial manual labor. The final lexicon is not perfect, with some noise.

Checking process

For each IPA phoneme in phone_list.txt, its sound obtained from Wikipedia is listened. A word, which consists of this IPA phoneme, is arbitrarily chosen from the lexicon and listened from Google Translate. By comparing these two sounds, we could do phoneme check, which is detailed as follows.

Check whether there is any inconsistency between phone_list.txt, IPA symbol table in LanguageNet, and IPA symbol table in Phoible

A phoneme in phone_list.txt should appear in both the IPA symbol table in LanguageNet G2P and the IPA symbol tables in Phoible.

Check whether the G2P labeling is correct

The Wikipedia sound of the phoneme should match that appeared in the corresponding position in the Google Translate pronunciation of the word, which consists of this IPA phoneme.

If either of the above two checks fail, it means that the lexicon contains some errors and needs to be further corrected.

Checking result

The checking result is shown in the following table. Clicking the hyperlinks will download the sound files for your listening.

  • The first column shows the phonemes in phone_list.txt.
  • The second and third columns show the word and its G2P labeling. The word's G2P labeling consists of the phoneme in the first column.
  • The last column contains some checking remarks.
IPA symbol in phone_list.txt Word
G2P labeling result
Note
ɱ abraham a β ɾ a ɱ Incorrect G2P labeling. The phoneme /ɱ/ is not contained in any phoneme tables of LanguageNet or Phoible, and the phoneme labeling should be corrected to /a β ɾ a a m/ after listening.
ɲ argandoña a ɾ ɡ a n d o ɲ a
ɣ achaguas a c a ɣ w u a s Incorrect G2P labeling. The phoneme /ɣ/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /g/ after listening
ɡ atxaga a t͡ʃ a ɡ a The phoneme /ɡ/ is not contained in SPA 164, but contained in UZ 2210
ɾ abraham a β ɾ a ɱ
ʎ fellowship f e ʎ o w ʃ j p The phoneme /ʎ/ is not contained in phoneme tables of LanguageNet, but it sounds correct
ʃ fellowship f e ʎ o w ʃ j p The phoneme /ʃ/ is not contained in SPA 164, but contained in EA 2308
a abraham a β ɾ a ɱ
b bendecir b e n d e θ i ɾ The phoneme /b/ is not contained in SPA 164, but contained in UZ 2210
c achaguas a c a ɣ w u a s Incorrect G2P labeling. The phoneme /c/ is not contained in any phoneme tables of LanguageNet or Phoible, and needs to be corrected to /t͡ʃ/ after listening
d argandoña a ɾ ɡ a n d o ɲ a The phoneme /d/ is not contained in SPA 164, but contained in UZ 2210
ð interjurisdiccional i n t e ɾ x u ɾ i z ð i k s i o n a l Incorrect G2P labeling. The phoneme /ð/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /d/ after listening
e fellowship f e ʎ o w ʃ j p
f fellowship f e ʎ o w ʃ j p
i bendecir b e n d e θ i ɾ
j fellowship f e ʎ o w ʃ j p Incorrect G2P labeling. The phoneme /j/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /i/ after listening
k barrancabermeja b a r a n k a β e ɾ m e x a
l interjurisdiccional i n t e ɾ x u ɾ i z ð i k s i o n a l
m barrancabermeja b a r a n k a β e ɾ m e x a
n interjurisdiccional i n t e ɾ x u ɾ i z ð i k s i o n a l
o fellowship f e ʎ o w ʃ j p
p fellowship f e ʎ o w ʃ j p
r barrancabermeja b a r a n k a β e ɾ m e x a
s achaguas a c a ɣ w u a s
t interjurisdiccional i n t e ɾ x u ɾ i z ð i k s i o n a l
atxaga a t͡ʃ a ɡ a The same pronunciation for /tʃ/ and /t͡ʃ/, so replacing /tʃ/ with /t͡ʃ/
u achaguas a c a ɣ w u a s Incorrect G2P labeling. The letter gua is pronounced as /g w a/, so the phoneme labeling should be corrected to /a t͡ʃ a g w a s/
w achaguas a c a ɣ w u a s
x barrancabermeja b a r a n k a β e ɾ m e x a
z interjurisdiccional i n t e ɾ x u ɾ i z ð i k s i o n a l Incorrect G2P labeling. The phoneme /z/ is not contained in any phoneme tables of LanguageNet or Phoible, and the letter s is pronounced as /s/ so the phoneme /z/ needs to be corrected to /s/
β abraham a β ɾ a ɱ The phoneme /β/ is not contained in phoneme tables of LanguageNet, but it sounds correct
θ bendecir b e n d e θ i ɾ Incorrect G2P labeling. The phoneme /θ/ is not contained in phoneme tables of LanguageNet, and the letter c is pronounced as /s/ so the phoneme /θ/ needs to be corrected to /s/