[TTS] Add tutorial for training audio codecs #8723

rlangman · 2024-03-21T22:34:44Z

What does this PR do ?

Add tutorial about training audio codecs.

Collection: [TTS]

Changelog

Add tutorial showing how to train or fine-tune audio codec models.
Add config file for 22.05khz mel codec for documentation and easy fine-tuning.
Make logging config in different config files consistent, disabling all logging by default except for audio samples. Tutorial shows how to update values to enable different types of logging.

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

tutorials/tts/Audio_Codec_Training.ipynb

nithinraok · 2024-03-25T15:46:25Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "# The total number of training steps will be (epochs * steps_per_epoch)\n",
+        "epochs = 10\n",
+        "steps_per_epoch = 10\n",
+        "\n",


I feel this tutorial would be even more helpful with some info on Audio Codec vs MelCodec differences and explanation of main parameters on how we could switch from RVQ to FSQ and how to control bitrate

I am thinking that level of detail would be better left for a separate tutorial or primer document. Without better supporting documentation/papers, no one would understand it right now unless they are already very familiar with codec research.

what would be separate tutorial?

First we are missing a high-level primer for the domain where we can explain with visuals/examples what high level codec concepts are, like bitrate, quantizers, etc. That is, the codec equivalent of https://github.com/NVIDIA/NeMo/blob/main/tutorials/00_NeMo_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_with_NeMo.ipynb

There might be some information we should include here, its just hard to say without having the supporting documentation already in place. In general I am not sure if architecture overrides should be in a jupyter notebook or some other form of documentation since we expect users to override model definition primarily by modifying the .yaml file directly, not by overriding it in the CLI.

Would be happy to hear others thoughts & opinions.

Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.

I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).

Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?

Except this everything LGTM!

I think additional information can be added later when we have more than 1 codec available

anteju

Thanks @rlangman!
Added a couple comments.

anteju · 2024-04-09T14:14:57Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "\n",
+        "**Note that when training from scratch, the dataset in this tutorial is too small to get good audio quality.**"
+      ]
+    },


At the end, add next steps (in the simplest case can point again to the available configurations, maybe implementations of different quantizers that a user can investigate) and references (point to documentation, model API, pretrained checkpoint, publicly-available papers).

Added an introduction and references.

anteju · 2024-04-09T14:20:07Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "# The total number of training steps will be (epochs * steps_per_epoch)\n",
+        "epochs = 10\n",
+        "steps_per_epoch = 10\n",
+        "\n",


Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.

I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).

anteju · 2024-04-09T14:51:11Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "*   **audio_codec_*.yaml**: Audio codec configurations optimized for various sampling rates.\n",
+        "*   **mel_codec_*.yaml**: A mel-spectrogram based codec designed to maximize the performance of TTS models.\n",
+        "*   **encodec_*.yaml**: A reproduction of the original [EnCodec](https://arxiv.org/abs/2210.13438) model setup.\n",
+        "\n"


Mention that for this tutorial we will use the configuration in audio_codec_16000.yaml for 16kHz input signal.

anteju · 2024-04-16T16:50:35Z

tutorials/tts/images/audio_codec_diagram.png

I think we should not check-in this file in the repo.
The approach in the past was to place a file in github release, and then link to it in the notebook.
For example, something like this:

"<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/encmaskdecoder_model.png\" alt=\"encmaskdecoder_model\" style=\"width: 800px;\"/>

We already have this block scheme here:

https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png

All TTS tutorial diagrams are currently organized in https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts/images (a few of them were previously in the base directory https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts before I moved them to a standalone directory).

Should all of these diagrams be retroactively attached to a future NeMo release?

No need to handle the existing ones.
However, the idea is we should avoid adding new binary files to the repo.

No images should ever be added to NeMo GitHub repo. Only GitHub releases and link via URL

But they don;t seem to load well on notebook viewing on github?

rlangman · 2024-04-22T21:58:09Z

jenkins

Signed-off-by: Ryan <[email protected]>

anteju

LGTM

nithinraok · 2024-04-23T18:16:25Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "Neural audio codecs are deep learning models that compress audio into a low bitrate representation. The compact embedding space created by these models can be useful for various speech tasks, such as TTS and ASR.\n",
+        "\n",
+        "<div>\n",
+        "<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png\" width=\"800\", height=\"400\"/>\n",


I cannot load the image on browser, is it same on your end? But link is accessible.

I just searched "https://github.com/NVIDIA/NeMo/releases" in NeMo and it looks like none of the tutorial images linked from github release work in the browser, because github only provides a download link and not a viewing link. So the images will only display properly in Jupyter and Colab.

@titu1994 what is the best solution here?

From this discussion and my testing, I don't think there is a way to make the image render it github without pushing the image into the repoo.

Beyond the image, there are also other parts of the notebook that also do not render properly in github browser. If someone wants to view it accurately in browser, they probably need to use something like nbviewer.

nithinraok · 2024-04-23T18:25:02Z

tutorials/tts/Audio_Codec_Training.ipynb

+        "# The total number of training steps will be (epochs * steps_per_epoch)\n",
+        "epochs = 10\n",
+        "steps_per_epoch = 10\n",
+        "\n",


Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?

* [TTS] Add tutorial for training audio codecs Signed-off-by: Ryan <[email protected]> * [TTS] Update tutorial Signed-off-by: Ryan <[email protected]> * [TTS] Add diagrams Signed-off-by: Ryan <[email protected]> * [TTS] Add introduction and references Signed-off-by: Ryan <[email protected]> * [TTS] Replace diagram with github release link Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]>

rlangman requested review from XuesongYang, KunalDhawan, nithinraok and anteju March 21, 2024 22:34

github-actions bot added the TTS label Mar 21, 2024

rlangman mentioned this pull request Mar 22, 2024

Dataset metadata for audio_codec #8726

Closed

rlangman force-pushed the codec_tutorial branch from 4a7d0c8 to 79ffef9 Compare March 22, 2024 22:52

nithinraok requested changes Mar 25, 2024

View reviewed changes

anteju reviewed Apr 9, 2024

View reviewed changes

rlangman force-pushed the codec_tutorial branch 2 times, most recently from 95c9dab to cb750aa Compare April 15, 2024 21:59

anteju reviewed Apr 16, 2024

View reviewed changes

rlangman force-pushed the codec_tutorial branch from f6836c5 to 0bd1387 Compare April 17, 2024 16:59

rlangman added 5 commits April 23, 2024 10:12

[TTS] Add tutorial for training audio codecs

cf94068

Signed-off-by: Ryan <[email protected]>

[TTS] Update tutorial

cd3a6be

Signed-off-by: Ryan <[email protected]>

[TTS] Add diagrams

840d8f3

Signed-off-by: Ryan <[email protected]>

[TTS] Add introduction and references

12cb51f

Signed-off-by: Ryan <[email protected]>

[TTS] Replace diagram with github release link

e21cf50

Signed-off-by: Ryan <[email protected]>

rlangman force-pushed the codec_tutorial branch from 0bd1387 to e21cf50 Compare April 23, 2024 17:12

rlangman requested a review from nithinraok April 23, 2024 18:08

anteju self-requested a review April 23, 2024 18:08

anteju approved these changes Apr 23, 2024

View reviewed changes

nithinraok reviewed Apr 23, 2024

View reviewed changes

nithinraok approved these changes May 2, 2024

View reviewed changes

rlangman merged commit f769ad5 into main May 2, 2024
128 checks passed

rlangman deleted the codec_tutorial branch May 2, 2024 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Add tutorial for training audio codecs #8723

[TTS] Add tutorial for training audio codecs #8723

rlangman commented Mar 21, 2024

nithinraok Mar 25, 2024

rlangman Apr 1, 2024

nithinraok Apr 8, 2024

rlangman Apr 8, 2024

anteju Apr 9, 2024

nithinraok Apr 23, 2024

nithinraok Apr 23, 2024

rlangman Apr 23, 2024

anteju left a comment

anteju Apr 9, 2024

rlangman Apr 15, 2024

anteju Apr 9, 2024

anteju Apr 9, 2024

rlangman Apr 15, 2024

anteju Apr 16, 2024

rlangman Apr 16, 2024

anteju Apr 16, 2024

titu1994 Apr 30, 2024

nithinraok Apr 30, 2024

rlangman commented Apr 22, 2024

anteju left a comment

nithinraok Apr 23, 2024

rlangman Apr 23, 2024

nithinraok Apr 23, 2024

rlangman May 2, 2024

nithinraok Apr 23, 2024

[TTS] Add tutorial for training audio codecs #8723

[TTS] Add tutorial for training audio codecs #8723

Conversation

rlangman commented Mar 21, 2024

What does this PR do ?

Changelog

Jenkins CI

Before your PR is "Ready for review"

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anteju left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlangman commented Apr 22, 2024

anteju left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment