Author: Ma, Te ([email protected])
(1) The G2P models cannot recognize alien words, so we choose to remove sentences that containing alien words. They are listed in the file Spanish_alien_sentences.txt
.
(2) Before creating lexicon, we need to normalize text. The code of text normalization for Spanish is in the script named text_norm.sh
.
We use the FST (Finite State Transducer) based G2P (Grapheme-to-Phoneme) toolkit, Phonetisaurus, to create the pronunciation lexicon. The trained FSTs for use with Phonetisaurus is provided in LanguageNet.
Note that the above G2P procedure is not perfect. As noted in LanguageNet
, "PERs range from 7% to 45%".
The G2P-generated lexicon needs to be corrected. The correction step is based on the LanguageNet symbol table for Spanish. The code of this step of lexicon correction is in the script named lexicon.sh
.
(1) We remove some special symbols such as accent symbols to enable sharing more phonemes between different languages.
Removed symbols | Note |
---|---|
ː |
Accent |
ˈ |
Long vowel |
ʲ |
Velarization |
(2) A subtle issue is that IPA symbols may be encoded in different forms. So to enforce consistency, the phoneme /g/
is corrected to /ɡ/
.
Phonemes from G2P | Phonemes corrected |
---|---|
g |
ɡ |
Strictly speaking, one phoneme might correspond to multiple phones (those phones are referred to as the allophones). Note that our above procedure removes the diacritic, the notion of phonemes in this work is a looser one.
The generated lexicon from the G2P procedure is named lexicon_es.txt
. The set of IPA phonemes appeared in the lexicon is saved in phone_list.txt
. We further check phone_list.txt
, by referring to the following two phoneme lists and with listening tests.
-
IPA symbol table in LanguageNet, which, thought by LanguageNet, contains all the phones in the language: https://github.com/uiuc-sst/g2ps/blob/master/Spanish/Spanish_wikipedia_symboltable.txt
-
IPA symbol table in Phoible: https://phoible.org/languages/stan1288. For each language, there may exist multiple phoneme inventories, which are archived at the Phoible website. For Spanish, we choose the first one as the main reference for phoneme checking, which is SPA 164.
Note that the G2P procedure is not perfect, the G2P-generated phone_list.txt
is not exactly the same as the ideal IPA symbol table in LanguageNet. Further, the IPA symbol table in LanguageNet may also differ from other IPA symbol tables from other linguistic resources (e.g., Phoible). So we need to check. The inconsistencies are recorded in the following. The lexicon is not modified, since a complete modification of the whole lexicon requires non-trivial manual labor. The final lexicon is not perfect, with some noise.
For each IPA phoneme in phone_list.txt
, its sound obtained from Wikipedia is listened.
A word, which consists of this IPA phoneme, is arbitrarily chosen from the lexicon and listened from Google Translate.
By comparing these two sounds, we could do phoneme check, which is detailed as follows.
Check whether there is any inconsistency between phone_list.txt
, IPA symbol table in LanguageNet, and IPA symbol table in Phoible
A phoneme in phone_list.txt
should appear in both the IPA symbol table in LanguageNet G2P and the IPA symbol tables in Phoible.
The Wikipedia sound of the phoneme should match that appeared in the corresponding position in the Google Translate pronunciation of the word, which consists of this IPA phoneme.
If either of the above two checks fail, it means that the lexicon contains some errors and needs to be further corrected.
The checking result is shown in the following table. Clicking the hyperlinks will download the sound files for your listening.
- The first column shows the phonemes in
phone_list.txt
. - The second and third columns show the word and its G2P labeling. The word's G2P labeling consists of the phoneme in the first column.
- The last column contains some checking remarks.
IPA symbol in phone_list.txt |
Word | G2P labeling result |
Note |
---|---|---|---|
ɱ |
abraham |
a β ɾ a ɱ |
Incorrect G2P labeling. The phoneme /ɱ/ is not contained in any phoneme tables of LanguageNet or Phoible, and the phoneme labeling should be corrected to /a β ɾ a a m/ after listening. |
ɲ |
argandoñ a |
a ɾ ɡ a n d o ɲ a |
|
ɣ |
achag uas |
a c a ɣ w u a s |
Incorrect G2P labeling. The phoneme /ɣ/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /g/ after listening |
ɡ |
atxag a |
a t͡ʃ a ɡ a |
The phoneme /ɡ/ is not contained in SPA 164, but contained in UZ 2210 |
ɾ |
abr aham |
a β ɾ a ɱ |
|
ʎ |
fell owship |
f e ʎ o w ʃ j p |
The phoneme /ʎ/ is not contained in phoneme tables of LanguageNet, but it sounds correct |
ʃ |
fellowsh ip |
f e ʎ o w ʃ j p |
The phoneme /ʃ/ is not contained in SPA 164, but contained in EA 2308 |
a |
a braham |
a β ɾ a ɱ |
|
b |
b endecir |
b e n d e θ i ɾ |
The phoneme /b/ is not contained in SPA 164, but contained in UZ 2210 |
c |
ach aguas |
a c a ɣ w u a s |
Incorrect G2P labeling. The phoneme /c/ is not contained in any phoneme tables of LanguageNet or Phoible, and needs to be corrected to /t͡ʃ/ after listening |
d |
argand oña |
a ɾ ɡ a n d o ɲ a |
The phoneme /d/ is not contained in SPA 164, but contained in UZ 2210 |
ð |
interjurisd iccional |
i n t e ɾ x u ɾ i z ð i k s i o n a l |
Incorrect G2P labeling. The phoneme /ð/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /d/ after listening |
e |
fe llowship |
f e ʎ o w ʃ j p |
|
f |
f ellowship |
f e ʎ o w ʃ j p |
|
i |
bendeci r |
b e n d e θ i ɾ |
|
j |
fellowshi p |
f e ʎ o w ʃ j p |
Incorrect G2P labeling. The phoneme /j/ is not contained in phoneme tables of LanguageNet, and needs to be corrected to /i/ after listening |
k |
barranc abermeja |
b a r a n k a β e ɾ m e x a |
|
l |
interjurisdiccional |
i n t e ɾ x u ɾ i z ð i k s i o n a l |
|
m |
barrancaberm eja |
b a r a n k a β e ɾ m e x a |
|
n |
interjurisdiccion al |
i n t e ɾ x u ɾ i z ð i k s i o n a l |
|
o |
fello wship |
f e ʎ o w ʃ j p |
|
p |
fellowship |
f e ʎ o w ʃ j p |
|
r |
barr ancabermeja |
b a r a n k a β e ɾ m e x a |
|
s |
achaguas |
a c a ɣ w u a s |
|
t |
int erjurisdiccional |
i n t e ɾ x u ɾ i z ð i k s i o n a l |
|
tʃ |
atx aga |
a t͡ʃ a ɡ a |
The same pronunciation for /tʃ/ and /t͡ʃ/ , so replacing /tʃ/ with /t͡ʃ/ |
u |
achagu as |
a c a ɣ w u a s |
Incorrect G2P labeling. The letter gua is pronounced as /g w a/ , so the phoneme labeling should be corrected to /a t͡ʃ a g w a s/ |
w |
achagu as |
a c a ɣ w u a s |
|
x |
barrancabermej a |
b a r a n k a β e ɾ m e x a |
|
z |
interjuris diccional |
i n t e ɾ x u ɾ i z ð i k s i o n a l |
Incorrect G2P labeling. The phoneme /z/ is not contained in any phoneme tables of LanguageNet or Phoible, and the letter s is pronounced as /s/ so the phoneme /z/ needs to be corrected to /s/ |
β |
ab raham |
a β ɾ a ɱ |
The phoneme /β/ is not contained in phoneme tables of LanguageNet, but it sounds correct |
θ |
bendec ir |
b e n d e θ i ɾ |
Incorrect G2P labeling. The phoneme /θ/ is not contained in phoneme tables of LanguageNet, and the letter c is pronounced as /s/ so the phoneme /θ/ needs to be corrected to /s/ |