Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores #1839
Unanswered
misterpathologist
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Environment
Issue Description
Using pyannote/embedding for speaker verification, all speakers are getting perfect similarity scores (1.000) when compared to a reference sample. This occurs even between obviously different speakers in a professional audiobook (Dracula), where speakers have distinct voices despite all being British.
Reproduction Steps
Current Behavior
Code
python
Complete minimal example to reproduce the issue
import torch
import torchaudio
from pyannote.audio import Model
import torch.nn.functional as F
Load reference audio
reference_waveform, sample_rate = torchaudio.load("reference.flac")
reference_waveform = reference_waveform.mean(dim=0, keepdim=True)
Setup model
device = torch.device("cuda")
embedding_model = Model.from_pretrained("pyannote/embedding",
use_auth_token='[REDACTED]').to(device)
Get reference embedding
reference_features = embedding_model(reference_waveform.unsqueeze(0))
reference_features = F.normalize(reference_features, p=2, dim=1)
Process test audio
test_waveform, = torchaudio.load("test.flac")
test_waveform = test_waveform.mean(dim=0, keepdim=True)
speaker_embedding = embedding_model(test_waveform.unsqueeze(0))
speaker_embedding = F.normalize(speaker_embedding, p=2, dim=1)
Calculate similarity
similarity = F.cosine_similarity(reference_features, speaker_embedding, dim=1).mean()
print(f"Similarity: {similarity.item():.6f}")
Debug Information
Model Configuration
print(embedding_model)
[Output of model architecture]
Tensor Shapes and Values
Reference waveform shape: [1, 31246073]
Reference embedding shape: [1, 512]
Test embedding shape: [1, 512]
Example similarity scores between different speakers:
Speaker A vs Reference: 1.000000
Speaker B vs Reference: 0.999998
Speaker C vs Reference: 1.000000
Questions
Additional Notes
Beta Was this translation helpful? Give feedback.
All reactions