-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TTS] Add tutorial for training audio codecs #8723
Conversation
4a7d0c8
to
79ffef9
Compare
"# The total number of training steps will be (epochs * steps_per_epoch)\n", | ||
"epochs = 10\n", | ||
"steps_per_epoch = 10\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this tutorial would be even more helpful with some info on Audio Codec vs MelCodec differences and explanation of main parameters on how we could switch from RVQ to FSQ and how to control bitrate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that level of detail would be better left for a separate tutorial or primer document. Without better supporting documentation/papers, no one would understand it right now unless they are already very familiar with codec research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be separate tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we are missing a high-level primer for the domain where we can explain with visuals/examples what high level codec concepts are, like bitrate, quantizers, etc. That is, the codec equivalent of https://github.com/NVIDIA/NeMo/blob/main/tutorials/00_NeMo_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb, https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_with_NeMo.ipynb
There might be some information we should include here, its just hard to say without having the supporting documentation already in place. In general I am not sure if architecture overrides should be in a jupyter notebook or some other form of documentation since we expect users to override model definition primarily by modifying the .yaml file directly, not by overriding it in the CLI.
Would be happy to hear others thoughts & opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.
I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except this everything LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think additional information can be added later when we have more than 1 codec available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rlangman!
Added a couple comments.
"\n", | ||
"**Note that when training from scratch, the dataset in this tutorial is too small to get good audio quality.**" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end, add next steps (in the simplest case can point again to the available configurations, maybe implementations of different quantizers that a user can investigate) and references (point to documentation, model API, pretrained checkpoint, publicly-available papers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an introduction and references.
"# The total number of training steps will be (epochs * steps_per_epoch)\n", | ||
"epochs = 10\n", | ||
"steps_per_epoch = 10\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having all primers would be nice, but we can include a bit more information in this tutorial.
It doesn't have to be exhaustive, but it should at least describe the config shown here.
I think we should at minimum include a block-scheme of the model and briefly describe the individual components.
For a block-scheme, we can re-use this.
Even one/two sentence descriptions of the blocks would be sufficient (what does it do, where is it implemented).
"* **audio_codec_*.yaml**: Audio codec configurations optimized for various sampling rates.\n", | ||
"* **mel_codec_*.yaml**: A mel-spectrogram based codec designed to maximize the performance of TTS models.\n", | ||
"* **encodec_*.yaml**: A reproduction of the original [EnCodec](https://arxiv.org/abs/2210.13438) model setup.\n", | ||
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that for this tutorial we will use the configuration in audio_codec_16000.yaml
for 16kHz input signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
95c9dab
to
cb750aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not check-in this file in the repo.
The approach in the past was to place a file in github release, and then link to it in the notebook.
For example, something like this:
"<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/encmaskdecoder_model.png\" alt=\"encmaskdecoder_model\" style=\"width: 800px;\"/>
We already have this block scheme here:
https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All TTS tutorial diagrams are currently organized in https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts/images (a few of them were previously in the base directory https://github.com/NVIDIA/NeMo/tree/main/tutorials/tts before I moved them to a standalone directory).
Should all of these diagrams be retroactively attached to a future NeMo release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to handle the existing ones.
However, the idea is we should avoid adding new binary files to the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No images should ever be added to NeMo GitHub repo. Only GitHub releases and link via URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they don;t seem to load well on notebook viewing on github?
f6836c5
to
0bd1387
Compare
jenkins |
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Ryan <[email protected]>
Signed-off-by: Ryan <[email protected]>
0bd1387
to
e21cf50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"Neural audio codecs are deep learning models that compress audio into a low bitrate representation. The compact embedding space created by these models can be useful for various speech tasks, such as TTS and ASR.\n", | ||
"\n", | ||
"<div>\n", | ||
"<img src=\"https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/nemo_audio_codec.png\" width=\"800\", height=\"400\"/>\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot load the image on browser, is it same on your end? But link is accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just searched "https://github.com/NVIDIA/NeMo/releases" in NeMo and it looks like none of the tutorial images linked from github release work in the browser, because github only provides a download link and not a viewing link. So the images will only display properly in Jupyter and Colab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titu1994 what is the best solution here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this discussion and my testing, I don't think there is a way to make the image render it github without pushing the image into the repoo.
Beyond the image, there are also other parts of the notebook that also do not render properly in github browser. If someone wants to view it accurately in browser, they probably need to use something like nbviewer.
"# The total number of training steps will be (epochs * steps_per_epoch)\n", | ||
"epochs = 10\n", | ||
"steps_per_epoch = 10\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you chosen not to include details on bitrates, the types of codecs available, and how we can adjust the parameters to obtain various bitrates and codec types?
* [TTS] Add tutorial for training audio codecs Signed-off-by: Ryan <[email protected]> * [TTS] Update tutorial Signed-off-by: Ryan <[email protected]> * [TTS] Add diagrams Signed-off-by: Ryan <[email protected]> * [TTS] Add introduction and references Signed-off-by: Ryan <[email protected]> * [TTS] Replace diagram with github release link Signed-off-by: Ryan <[email protected]> --------- Signed-off-by: Ryan <[email protected]>
What does this PR do ?
Add tutorial about training audio codecs.
Collection: [TTS]
Changelog
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type: