Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores #1839

misterpathologist · 2025-02-10T20:26:17Z

misterpathologist
Feb 10, 2025

Environment

pyannote.audio version: 3.1.1
torch version: 2.5.1+cu124
Platform: [your OS]
CUDA: Yes
GPU: [your GPU model]
Python version: [your version]
torchaudio version: [your version]

Issue Description

Using pyannote/embedding for speaker verification, all speakers are getting perfect similarity scores (1.000) when compared to a reference sample. This occurs even between obviously different speakers in a professional audiobook (Dracula), where speakers have distinct voices despite all being British.

Reproduction Steps

Load a 10-minute reference audio of target speaker (FLAC format)
Load full audiobook (4 hours, FLAC format)
Extract embeddings using pyannote/embedding
Compare embeddings using cosine similarity
Result: ALL speakers match with 1.000 similarity

Current Behavior

Every speaker gets similarity scores of 0.999+ to 1.000
This happens consistently across different speakers
Reference and speaker embeddings both have shape [1, 512]
Even clearly different voices (male/female) get perfect matches

Code

python
Complete minimal example to reproduce the issue
import torch
import torchaudio
from pyannote.audio import Model
import torch.nn.functional as F
Load reference audio
reference_waveform, sample_rate = torchaudio.load("reference.flac")
reference_waveform = reference_waveform.mean(dim=0, keepdim=True)
Setup model
device = torch.device("cuda")
embedding_model = Model.from_pretrained("pyannote/embedding",
use_auth_token='[REDACTED]').to(device)
Get reference embedding
reference_features = embedding_model(reference_waveform.unsqueeze(0))
reference_features = F.normalize(reference_features, p=2, dim=1)
Process test audio
test_waveform, = torchaudio.load("test.flac")
test_waveform = test_waveform.mean(dim=0, keepdim=True)
speaker_embedding = embedding_model(test_waveform.unsqueeze(0))
speaker_embedding = F.normalize(speaker_embedding, p=2, dim=1)
Calculate similarity
similarity = F.cosine_similarity(reference_features, speaker_embedding, dim=1).mean()
print(f"Similarity: {similarity.item():.6f}")

Debug Information

Model Configuration

print(embedding_model)
[Output of model architecture]
Tensor Shapes and Values
Reference waveform shape: [1, 31246073]
Reference embedding shape: [1, 512]
Test embedding shape: [1, 512]
Example similarity scores between different speakers:
Speaker A vs Reference: 1.000000
Speaker B vs Reference: 0.999998
Speaker C vs Reference: 1.000000

Questions

Is this expected behavior with the current version?
Could the version mismatch warnings be causing this?
Are there recommended settings to get realistic similarity scores?
Should we be using a different approach for speaker verification?

Additional Notes

Using professional audiobook with high-quality audio
Multiple speakers are clearly different to human ears
Tried with different audio segments and speakers
Consistent 1.000 similarity across all tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores #1839

{{title}}

Replies: 0 comments

Select a reply

Speaker Verification: All Speakers Getting Perfect 1.000 Similarity Scores #1839

misterpathologist Feb 10, 2025

Environment

Issue Description

Reproduction Steps

Current Behavior

Code

Debug Information

Model Configuration

Questions

Additional Notes

Replies: 0 comments

misterpathologist
Feb 10, 2025