Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777

Closed
wants to merge 3 commits into from

Conversation

anferico
Copy link
Contributor

@anferico anferico commented Oct 22, 2023

What does this PR do ?

  1. Fix bug in nemo.collections.tts.modules.submodules.ConditionalInput.forward()
  2. Fix implementation of Global Style Tokens in FastPitch

Collection: TTS

Changelog

  • When using ConditionalInput in "concat" mode, concatenate inputs and conditioning along the feature dimension, not the batch dimension as before
    • Despite the misleading docstring in ConditionalInput.forward() (corrected in this PR), inputs and conditioning have shapes B x T x H and B x T x C, respectively (to be even more precise, conditioning has shape B x 1 x C before the call to conditioning.repeat() and B x T x C after). Consequently, concatenating the two along the batch dimension has 2 problems:
      • If H != C, the concatenation will fail
      • If H == C, the concatenation results in a 2B x T x C tensor
        • This causes an error downstream when calling self.concat_proj(), which is defined as:
          self.concat_proj = torch.nn.Linear(hidden_dim + condition_dim, hidden_dim)
        • self.concat_proj is meant to map B x T x (H+C) tensors to B x T x H tensors, so passing a 2B x T x (H=C) tensor to it results in the following error:
          RuntimeError: mat1 and mat2 shapes cannot be multiplied ((2B*T)x(H=C) and (H+C)xH) (the actual error message contains constant values, but here I have extracted variables to make my point clearer)
    • Solution: concatenate along the feature dimension to obtain a B x T x (H+C) tensor
  • Fix the implementation of GSTs in FastPitch, specifically regarding the selection of the reference audio for a given training example
    • The current NeMo implementation selects the reference audio for a given training example as a random example from the same speaker. Note that this also forces the user to add speaker_id to sup_data_types even when they don't want to use speaker embeddings
    • However, the GSTs paper clearly states that:

      During training, the reference signal is ground-truth audio.

    • While this is enough reason to justify amending the current NeMo implementation, I have expressed my thoughts on what selecting the reference audio at random may lead to in this issue I opened over a month ago (which went totally unnoticed)

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

@blisc, @okuchaiev

Additional Information

@github-actions github-actions bot added the TTS label Oct 22, 2023
@hsiehjackson
Copy link
Collaborator

Thanks for the contributions! Great catch for the "concat" mode bug. I have some questions regarding the selection of reference audio. If possible, I would prefer this PR only including the fix of concat bug unless you provide more reasons or results we should use ground-truth audios for reference audios.

Fix the implementation of GSTs in FastPitch, specifically regarding the selection of the reference audio for a given training example

If people plan to use GSTs in FastPitch, then we think they suppose to use a multi-speaker dataset with speaker id for each sample. With speaker id, we can choose the utterances from the same speaker as reference audio to make sure that model can only obtain speaker information and prevent copying other speech information from ground-truth audio.

While this is enough reason to justify amending the current NeMo implementation, I have expressed my thoughts on what selecting the reference audio at random may lead to in #7420 I opened over a month ago (which went totally unnoticed)

In our experiments, if we use both SpeakerLookup module and GSTs, then model will almost ignore SpeakerLookup module. These two have different scenarios. If people want to use learned speakers for inference, then SpeakerLookup module is a good choice. If people want to use new speakers for inference, then GSTs is a better choice since we learn how to extract speaker information from reference audios.

@rlangman
Copy link
Collaborator

Thanks for the PR!

The GST implementation here is not the one everyone is familiar with which is intended to capture speaking style information. Rather it is based off a few lesser known papers later that try to use the same system to extract speaker embeddings from the reference spectrogram (to get speaker information, not speaking style information). The purpose of these alternate methods is to train a speaker embedding extractor that makes it easier to fine-tune on new speakers with very little data.

It would be nice to support the original GST though. Instead of requiring one or the other, I suggest adding a configuration which lets user decide whether they want the reference to be from the ground truth, or a random other utterance from the same speaker.

@anferico
Copy link
Contributor Author

@hsiehjackson

Thanks for the contributions! Great catch for the "concat" mode bug. I have some questions regarding the selection of reference audio. If possible, I would prefer this PR only including the fix of concat bug unless you provide more reasons or results we should use ground-truth audios for reference audios.

Sure, that sounds fair 👍🏼 I guess I'll close this PR and open two separate ones, one for the concat bug and one for GSTs.

In our experiments, if we use both SpeakerLookup module and GSTs, then model will almost ignore SpeakerLookup module. These two have different scenarios. If people want to use learned speakers for inference, then SpeakerLookup module is a good choice. If people want to use new speakers for inference, then GSTs is a better choice since we learn how to extract speaker information from reference audios.

I found something very similar in my experiments, except it was the style embeddings (and not the speaker embeddings) that were ignored (cf. #7420).


@rlangman

The GST implementation here is not the one everyone is familiar with which is intended to capture speaking style information. Rather it is based off a few lesser known papers later that try to use the same system to extract speaker embeddings from the reference spectrogram (to get speaker information, not speaking style information). The purpose of these alternate methods is to train a speaker embedding extractor that makes it easier to fine-tune on new speakers with very little data.

That's quite interesting, would you mind linking some of those papers?
By the way, I feel like this implementation choice wasn't documented properly in NeMo. Given that the speaker IDs are required anyway by the GST module for the selection of reference audios, users would think that it makes sense to use both the speaker lookup and the GST modules, however as I pointed out in #7420 that can lead the model to learn to ignore style embeddings altogether, which completely defeats the purpose of having a GST module.
If your suggestion, assuming I understood correctly, is to use the GST module alone without the speaker lookup module, I think it should be made very clear to the end user.

It would be nice to support the original GST though. Instead of requiring one or the other, I suggest adding a configuration which lets user decide whether they want the reference to be from the ground truth, or a random other utterance from the same speaker.

That's a very good point, let me do exactly that in the new PR I'll open.

@anferico
Copy link
Contributor Author

@hsiehjackson @rlangman I've opened 2 new PRs: #7785 and #7788. Let's continue the discussion there, and please feel free to close this PR.

@rlangman
Copy link
Collaborator

That's quite interesting, would you mind linking some of those papers?

The FastPitch setup is: https://arxiv.org/abs/2211.00585

The idea to use GST system for speaker embeddings is originally from: https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_HuiLU.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants