[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

anferico · 2023-10-22T14:18:01Z

What does this PR do ?

Fix bug in nemo.collections.tts.modules.submodules.ConditionalInput.forward()
Fix implementation of Global Style Tokens in FastPitch

Collection: TTS

Changelog

When using ConditionalInput in "concat" mode, concatenate inputs and conditioning along the feature dimension, not the batch dimension as before
- Despite the misleading docstring in ConditionalInput.forward() (corrected in this PR), inputs and conditioning have shapes B x T x H and B x T x C, respectively (to be even more precise, conditioning has shape B x 1 x C before the call to conditioning.repeat() and B x T x C after). Consequently, concatenating the two along the batch dimension has 2 problems:
  - If H != C, the concatenation will fail
  - If H == C, the concatenation results in a 2B x T x C tensor
    - This causes an error downstream when calling self.concat_proj(), which is defined as:
      self.concat_proj = torch.nn.Linear(hidden_dim + condition_dim, hidden_dim)
    - self.concat_proj is meant to map B x T x (H+C) tensors to B x T x H tensors, so passing a 2B x T x (H=C) tensor to it results in the following error:
      RuntimeError: mat1 and mat2 shapes cannot be multiplied ((2B*T)x(H=C) and (H+C)xH) (the actual error message contains constant values, but here I have extracted variables to make my point clearer)
- Solution: concatenate along the feature dimension to obtain a B x T x (H+C) tensor
Fix the implementation of GSTs in FastPitch, specifically regarding the selection of the reference audio for a given training example
- The current NeMo implementation selects the reference audio for a given training example as a random example from the same speaker. Note that this also forces the user to add speaker_id to sup_data_types even when they don't want to use speaker embeddings
- However, the GSTs paper clearly states that:
  
  During training, the reference signal is ground-truth audio.
- While this is enough reason to justify amending the current NeMo implementation, I have expressed my thoughts on what selecting the reference audio at random may lead to in this issue I opened over a month ago (which went totally unnoticed)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Who can review?

@blisc, @okuchaiev

Additional Information

Related to [TTS] Global Style Tokens implementation in FastPitch doesn't follow the original paper #7420
I have run tests, but some failed or errored out. None of the failed tests makes use of my changes though. See test_short_summary.txt for further details

Signed-off-by: anferico <[email protected]>

hsiehjackson · 2023-10-23T16:46:24Z

Thanks for the contributions! Great catch for the "concat" mode bug. I have some questions regarding the selection of reference audio. If possible, I would prefer this PR only including the fix of concat bug unless you provide more reasons or results we should use ground-truth audios for reference audios.

Fix the implementation of GSTs in FastPitch, specifically regarding the selection of the reference audio for a given training example

If people plan to use GSTs in FastPitch, then we think they suppose to use a multi-speaker dataset with speaker id for each sample. With speaker id, we can choose the utterances from the same speaker as reference audio to make sure that model can only obtain speaker information and prevent copying other speech information from ground-truth audio.

While this is enough reason to justify amending the current NeMo implementation, I have expressed my thoughts on what selecting the reference audio at random may lead to in #7420 I opened over a month ago (which went totally unnoticed)

In our experiments, if we use both SpeakerLookup module and GSTs, then model will almost ignore SpeakerLookup module. These two have different scenarios. If people want to use learned speakers for inference, then SpeakerLookup module is a good choice. If people want to use new speakers for inference, then GSTs is a better choice since we learn how to extract speaker information from reference audios.

rlangman · 2023-10-23T18:31:25Z

Thanks for the PR!

The GST implementation here is not the one everyone is familiar with which is intended to capture speaking style information. Rather it is based off a few lesser known papers later that try to use the same system to extract speaker embeddings from the reference spectrogram (to get speaker information, not speaking style information). The purpose of these alternate methods is to train a speaker embedding extractor that makes it easier to fine-tune on new speakers with very little data.

It would be nice to support the original GST though. Instead of requiring one or the other, I suggest adding a configuration which lets user decide whether they want the reference to be from the ground truth, or a random other utterance from the same speaker.

anferico · 2023-10-23T19:11:52Z

@hsiehjackson

Thanks for the contributions! Great catch for the "concat" mode bug. I have some questions regarding the selection of reference audio. If possible, I would prefer this PR only including the fix of concat bug unless you provide more reasons or results we should use ground-truth audios for reference audios.

Sure, that sounds fair 👍🏼 I guess I'll close this PR and open two separate ones, one for the concat bug and one for GSTs.

In our experiments, if we use both SpeakerLookup module and GSTs, then model will almost ignore SpeakerLookup module. These two have different scenarios. If people want to use learned speakers for inference, then SpeakerLookup module is a good choice. If people want to use new speakers for inference, then GSTs is a better choice since we learn how to extract speaker information from reference audios.

I found something very similar in my experiments, except it was the style embeddings (and not the speaker embeddings) that were ignored (cf. #7420).

@rlangman

The GST implementation here is not the one everyone is familiar with which is intended to capture speaking style information. Rather it is based off a few lesser known papers later that try to use the same system to extract speaker embeddings from the reference spectrogram (to get speaker information, not speaking style information). The purpose of these alternate methods is to train a speaker embedding extractor that makes it easier to fine-tune on new speakers with very little data.

That's quite interesting, would you mind linking some of those papers?
By the way, I feel like this implementation choice wasn't documented properly in NeMo. Given that the speaker IDs are required anyway by the GST module for the selection of reference audios, users would think that it makes sense to use both the speaker lookup and the GST modules, however as I pointed out in #7420 that can lead the model to learn to ignore style embeddings altogether, which completely defeats the purpose of having a GST module.
If your suggestion, assuming I understood correctly, is to use the GST module alone without the speaker lookup module, I think it should be made very clear to the end user.

It would be nice to support the original GST though. Instead of requiring one or the other, I suggest adding a configuration which lets user decide whether they want the reference to be from the ground truth, or a random other utterance from the same speaker.

That's a very good point, let me do exactly that in the new PR I'll open.

anferico · 2023-10-23T21:36:51Z

@hsiehjackson @rlangman I've opened 2 new PRs: #7785 and #7788. Let's continue the discussion there, and please feel free to close this PR.

rlangman · 2023-10-23T21:58:59Z

That's quite interesting, would you mind linking some of those papers?

The FastPitch setup is: https://arxiv.org/abs/2211.00585

The idea to use GST system for speaker embeddings is originally from: https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_HuiLU.pdf

github-actions bot added the TTS label Oct 22, 2023

anferico added 3 commits October 23, 2023 07:11

ConditionalInput: cat along feature dimension, not batch

14894d1

Signed-off-by: anferico <[email protected]>

Correct Global Style Tokens (GSTs) implementation

330a5ca

Signed-off-by: anferico <[email protected]>

Fix docstring in ConditionalInput.forward

84b7c9c

Signed-off-by: anferico <[email protected]>

anferico force-pushed the main branch from 8f3f719 to 84b7c9c Compare October 23, 2023 07:11

blisc assigned subhankar-ghosh and hsiehjackson Oct 23, 2023

This was referenced Oct 23, 2023

ConditionalInput: cat along the feature dim, not the batch dim #7785

Merged

Add selection criteria for reference audios in the GlobalStyleToken submodule #7788

Merged

hsiehjackson closed this Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

anferico commented Oct 22, 2023 •

edited

Loading

hsiehjackson commented Oct 23, 2023

rlangman commented Oct 23, 2023

anferico commented Oct 23, 2023

anferico commented Oct 23, 2023

rlangman commented Oct 23, 2023

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

Conversation

anferico commented Oct 22, 2023 • edited Loading

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Who can review?

Additional Information

hsiehjackson commented Oct 23, 2023

rlangman commented Oct 23, 2023

anferico commented Oct 23, 2023

anferico commented Oct 23, 2023

rlangman commented Oct 23, 2023

anferico commented Oct 22, 2023 •

edited

Loading