-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TTS] Fix Global Style Tokens (GSTs) implementation in FastPitch #7777
Conversation
Signed-off-by: anferico <[email protected]>
Signed-off-by: anferico <[email protected]>
Signed-off-by: anferico <[email protected]>
Thanks for the contributions! Great catch for the "concat" mode bug. I have some questions regarding the selection of reference audio. If possible, I would prefer this PR only including the fix of concat bug unless you provide more reasons or results we should use ground-truth audios for reference audios.
If people plan to use GSTs in FastPitch, then we think they suppose to use a multi-speaker dataset with speaker id for each sample. With speaker id, we can choose the utterances from the same speaker as reference audio to make sure that model can only obtain speaker information and prevent copying other speech information from ground-truth audio.
In our experiments, if we use both SpeakerLookup module and GSTs, then model will almost ignore SpeakerLookup module. These two have different scenarios. If people want to use learned speakers for inference, then SpeakerLookup module is a good choice. If people want to use new speakers for inference, then GSTs is a better choice since we learn how to extract speaker information from reference audios. |
Thanks for the PR! The GST implementation here is not the one everyone is familiar with which is intended to capture speaking style information. Rather it is based off a few lesser known papers later that try to use the same system to extract speaker embeddings from the reference spectrogram (to get speaker information, not speaking style information). The purpose of these alternate methods is to train a speaker embedding extractor that makes it easier to fine-tune on new speakers with very little data. It would be nice to support the original GST though. Instead of requiring one or the other, I suggest adding a configuration which lets user decide whether they want the reference to be from the ground truth, or a random other utterance from the same speaker. |
Sure, that sounds fair 👍🏼 I guess I'll close this PR and open two separate ones, one for the concat bug and one for GSTs.
I found something very similar in my experiments, except it was the style embeddings (and not the speaker embeddings) that were ignored (cf. #7420).
That's quite interesting, would you mind linking some of those papers?
That's a very good point, let me do exactly that in the new PR I'll open. |
@hsiehjackson @rlangman I've opened 2 new PRs: #7785 and #7788. Let's continue the discussion there, and please feel free to close this PR. |
The FastPitch setup is: https://arxiv.org/abs/2211.00585 The idea to use GST system for speaker embeddings is originally from: https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_HuiLU.pdf |
What does this PR do ?
nemo.collections.tts.modules.submodules.ConditionalInput.forward()
Collection: TTS
Changelog
ConditionalInput
in"concat"
mode, concatenateinputs
andconditioning
along the feature dimension, not the batch dimension as beforeConditionalInput.forward()
(corrected in this PR),inputs
andconditioning
have shapesB x T x H
andB x T x C
, respectively (to be even more precise,conditioning
has shapeB x 1 x C
before the call toconditioning.repeat()
andB x T x C
after). Consequently, concatenating the two along the batch dimension has 2 problems:H != C
, the concatenation will failH == C
, the concatenation results in a2B x T x C
tensorself.concat_proj()
, which is defined as:self.concat_proj = torch.nn.Linear(hidden_dim + condition_dim, hidden_dim)
self.concat_proj
is meant to mapB x T x (H+C)
tensors toB x T x H
tensors, so passing a2B x T x (H=C)
tensor to it results in the following error:RuntimeError: mat1 and mat2 shapes cannot be multiplied ((2B*T)x(H=C) and (H+C)xH)
(the actual error message contains constant values, but here I have extracted variables to make my point clearer)B x T x (H+C)
tensorspeaker_id
tosup_data_types
even when they don't want to use speaker embeddingsBefore your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
@blisc, @okuchaiev
Additional Information