-
Notifications
You must be signed in to change notification settings - Fork 37
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Michael Hansen
committed
Jun 1, 2021
1 parent
1a7681e
commit c5ada76
Showing
45 changed files
with
20,224 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,8 +7,6 @@ __pycache__/ | |
dist/ | ||
/etc/ | ||
|
||
docs/build/ | ||
|
||
coverage.xml | ||
.coverage | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 90a8d147d7bb8b949280ed46e74eb2cb | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
gruut package | ||
============= | ||
|
||
Submodules | ||
---------- | ||
|
||
gruut.commands module | ||
--------------------- | ||
|
||
.. automodule:: gruut.commands | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.const module | ||
------------------ | ||
|
||
.. automodule:: gruut.const | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.g2p module | ||
---------------- | ||
|
||
.. automodule:: gruut.g2p | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.lang module | ||
----------------- | ||
|
||
.. automodule:: gruut.lang | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.lexicon2db module | ||
----------------------- | ||
|
||
.. automodule:: gruut.lexicon2db | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.phonemize module | ||
---------------------- | ||
|
||
.. automodule:: gruut.phonemize | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.pos module | ||
---------------- | ||
|
||
.. automodule:: gruut.pos | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.toksen module | ||
------------------- | ||
|
||
.. automodule:: gruut.toksen | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
gruut.utils module | ||
------------------ | ||
|
||
.. automodule:: gruut.utils | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: gruut | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
.. gruut documentation master file | ||
gruut | ||
===== | ||
|
||
A tokenizer and `IPA <https://en.wikipedia.org/wiki/International_Phonetic_Alphabet>`_ phonemizer for multiple human languages. | ||
|
||
.. code-block:: python | ||
from gruut import text_to_phonemes | ||
text = 'He wound it around the wound, saying "I read it was $10 to read."' | ||
for sent_idx, word, word_phonemes in text_to_phonemes(text, lang="en-us"): | ||
print(word, *word_phonemes) | ||
Output:: | ||
|
||
he h ˈi | ||
wound w ˈaʊ n d | ||
it ˈɪ t | ||
around ɚ ˈaʊ n d | ||
the ð ə | ||
wound w ˈu n d | ||
, | | ||
saying s ˈeɪ ɪ ŋ | ||
i ˈaɪ | ||
read ɹ ˈɛ d | ||
it ˈɪ t | ||
was w ə z | ||
ten t ˈɛ n | ||
dollars d ˈɑ l ɚ z | ||
to t ə | ||
read ɹ ˈi d | ||
. ‖ | ||
|
||
|
||
Installation | ||
------------ | ||
|
||
To install gruut with U.S. English support only:: | ||
|
||
pip install gruut | ||
|
||
|
||
Additional languages can be added during installation. For example, with French and Italian support:: | ||
|
||
pip install gruut[fr,it] | ||
|
||
|
||
Supported Languages | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
||
* Czech (``cs``) | ||
* German (``de``) | ||
* English (``en``) | ||
* Spanish (``es``) | ||
* Farsi/Persian (``fa``) | ||
* French (``fr``) | ||
* Italian (``it``) | ||
* Dutch (``nl``) | ||
* Russian (``ru``) | ||
* Swedish (``sv``) | ||
|
||
|
||
Usage | ||
----- | ||
|
||
gruut performs two main functions: tokenization and phonemization. | ||
The :py:meth:`gruut.text_to_phonemes` method performs both steps for you. See the :py:class:`~gruut.TextToPhonemesReturn` enum for ways to adjust the ``return_format``. | ||
|
||
If you need more control, see the language-specific classes in :py:mod:`gruut.lang` as well as :py:class:`~gruut.toksen.RegexTokenizer` and :py:class:`~gruut.lang.SqlitePhonemizer`. | ||
|
||
Tokenziation operates on text and does the following: | ||
|
||
* Splits text into words by whitespace | ||
* Expands user-defined abbreviations | ||
* Breaks apart words and sentences further by punctuation (periods, commas, etc.) | ||
* Drops empty/non-word tokens | ||
* Expands numbers into words (100 -> one hundred) | ||
* Applies upper/lower case filter | ||
* Predicts part of speech tags (see :py:mod:`gruut.pos`) | ||
|
||
Once tokenized, phonemization predicts the phonetic pronunciation for each word by: | ||
|
||
* Looking up each word in an SQLite database | ||
* Guessing the pronunciation with a pre-trained model (see :py:mod:`gruut.g2p`) | ||
|
||
In cases where more than one pronunciation is possible for a word, the "best" pronunciation is: | ||
|
||
* Specified by the user with word indexes enabled and a word of the form "word_N" where N is the 1-based pronunciation index | ||
* Whichever pronunciation has the most compatible :ref:`features`. | ||
* The first pronunciation | ||
|
||
|
||
.. _features: | ||
|
||
Features | ||
^^^^^^^^ | ||
|
||
gruut tokens can contain arbitrary features. For now, only part of speech tags are implemented for English and French. | ||
|
||
When determining the "best" pronunciation for a word, a phonemizer may consult these features. In English, for example, some word pronunciations in the lexicon contain "preferred" parts of speech. Words like "wind" may be pronounced different depending on their use as a verb or noun. If a token "wind" is predicted to be a noun during tokenization, then the pronunciation "w ˈɪ n d" is selected instead of "w ˈaɪ n d". | ||
|
||
French uses part of speech tags differently. During the post-processing phase of phonemization, these features are used instead to add liasons between words. For example, in the sentence "J’ai des petites oreilles.", "petites" will be pronounced "p ə t i t z" instead of "p ə t i t". | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
gruut | ||
===== | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
gruut |
Oops, something went wrong.