Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Add tutorial for training audio codecs #8723

Merged
merged 5 commits into from
May 2, 2024
Merged

[TTS] Add tutorial for training audio codecs #8723

merged 5 commits into from
May 2, 2024

Conversation

rlangman
Copy link
Collaborator

What does this PR do ?

Add tutorial about training audio codecs.

Collection: [TTS]

Changelog

  • Add tutorial showing how to train or fine-tune audio codec models.
  • Add config file for 22.05khz mel codec for documentation and easy fine-tuning.
  • Make logging config in different config files consistent, disabling all logging by default except for audio samples. Tutorial shows how to update values to enable different types of logging.

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
tutorials/tts/Audio_Codec_Training.ipynb Outdated Show resolved Hide resolved
"# The total number of training steps will be (epochs * steps_per_epoch)\n",
"epochs = 10\n",
"steps_per_epoch = 10\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this tutorial would be even more helpful with some info on Audio Codec vs MelCodec differences and explanation of main parameters on how we could switch from RVQ to FSQ and how to control bitrate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that level of detail would be better left for a separate tutorial or primer document. Without better supporting documentation/papers, no one would understand it right now unless they are already very familiar with codec research.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be separate tutorial?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we are missing a high-level primer for the domain where we can explain with visuals/examples what high level codec concepts are, like bitrate, quantizers, etc. That is, the codec equivalent of https://github.com/NVIDIA/NeMo/blob/main/tutorials/00_NeMo_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_with_NeMo.ipynb

There might be some information we should include here, its just hard to say without having the supporting documentation already in place. In general I am not sure if architecture overrides should be in a jupyter notebook or some other form of documentation since we expect users to override model definition primarily by modifying the .yaml file directly, not by overriding it in the CLI.

Would be happy to hear others thoughts & opinions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.

I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except this everything LGTM!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think additional information can be added later when we have more than 1 codec available

Copy link
Collaborator

@anteju anteju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rlangman!
Added a couple comments.

"\n",
"**Note that when training from scratch, the dataset in this tutorial is too small to get good audio quality.**"
]
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end, add next steps (in the simplest case can point again to the available configurations, maybe implementations of different quantizers that a user can investigate) and references (point to documentation, model API, pretrained checkpoint, publicly-available papers).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an introduction and references.

"# The total number of training steps will be (epochs * steps_per_epoch)\n",
"epochs = 10\n",
"steps_per_epoch = 10\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.

I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).

"* **audio_codec_*.yaml**: Audio codec configurations optimized for various sampling rates.\n",
"* **mel_codec_*.yaml**: A mel-spectrogram based codec designed to maximize the performance of TTS models.\n",
"* **encodec_*.yaml**: A reproduction of the original [EnCodec](https://arxiv.org/abs/2210.13438) model setup.\n",
"\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that for this tutorial we will use the configuration in audio_codec_16000.yaml for 16kHz input signal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@rlangman rlangman force-pushed the codec_tutorial branch 2 times, most recently from 95c9dab to cb750aa Compare April 15, 2024 21:59
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not check-in this file in the repo.
The approach in the past was to place a file in github release, and then link to it in the notebook.
For example, something like this:

"<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/encmaskdecoder_model.png\" alt=\"encmaskdecoder_model\" style=\"width: 800px;\"/>

We already have this block scheme here:

https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All TTS tutorial diagrams are currently organized in https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts/images (a few of them were previously in the base directory https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts before I moved them to a standalone directory).

Should all of these diagrams be retroactively attached to a future NeMo release?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to handle the existing ones.
However, the idea is we should avoid adding new binary files to the repo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No images should ever be added to NeMo GitHub repo. Only GitHub releases and link via URL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they don;t seem to load well on notebook viewing on github?

@rlangman
Copy link
Collaborator Author

jenkins

@rlangman rlangman requested a review from nithinraok April 23, 2024 18:08
@anteju anteju self-requested a review April 23, 2024 18:08
Copy link
Collaborator

@anteju anteju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"Neural audio codecs are deep learning models that compress audio into a low bitrate representation. The compact embedding space created by these models can be useful for various speech tasks, such as TTS and ASR.\n",
"\n",
"<div>\n",
"<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png\" width=\"800\", height=\"400\"/>\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot load the image on browser, is it same on your end? But link is accessible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just searched "https://github.com/NVIDIA/NeMo/releases" in NeMo and it looks like none of the tutorial images linked from github release work in the browser, because github only provides a download link and not a viewing link. So the images will only display properly in Jupyter and Colab.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titu1994 what is the best solution here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this discussion and my testing, I don't think there is a way to make the image render it github without pushing the image into the repoo.

Beyond the image, there are also other parts of the notebook that also do not render properly in github browser. If someone wants to view it accurately in browser, they probably need to use something like nbviewer.

"# The total number of training steps will be (epochs * steps_per_epoch)\n",
"epochs = 10\n",
"steps_per_epoch = 10\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?

@rlangman rlangman merged commit f769ad5 into main May 2, 2024
128 checks passed
@rlangman rlangman deleted the codec_tutorial branch May 2, 2024 19:04
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* [TTS] Add tutorial for training audio codecs

Signed-off-by: Ryan <[email protected]>

* [TTS] Update tutorial

Signed-off-by: Ryan <[email protected]>

* [TTS] Add diagrams

Signed-off-by: Ryan <[email protected]>

* [TTS] Add introduction and references

Signed-off-by: Ryan <[email protected]>

* [TTS] Replace diagram with github release link

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants