PyTorch implementation of speech embedding net and loss described here:
Also contains code to create embeddings compatible as input for the speaker diarization model found at
The TIMIT speech corpus was used to train the model, found here:, or here,
- PyTorch 0.4.1
- python 3.5+
- numpy 1.15.4
- librosa 0.6.1
The python WebRTC VAD found at is required to create run, but not to train the neural network.
Change the following config.yaml key to a regex containing all .WAV files in your downloaded TIMIT dataset. The TIMIT .WAV files must be converted to the standard format (RIFF) for the script, but not for training the neural network.
unprocessed_data: './TIMIT/*/*/*/*.wav'
Run the preprocessing script:
Two folders will be created, train_tisv and test_tisv, containing .npy files containing numpy ndarrays of speaker utterances with a 90%/10% training/testing split.
To train the speaker verification model, run:
with the following config.yaml key set to true:
training: !!bool "true"
for testing, set the key value to:
training: !!bool "false"
The log file and checkpoint save locations are controlled by the following values:
log_file: './speech_id_checkpoint/Stats'
checkpoint_dir: './speech_id_checkpoint'
Only TI-SV is implemented.
EER across 10 epochs: 0.0377
After training and testing the model, run to create the numpy files train_sequence.npy, train_cluster_ids.npy, test_sequence.npy, and test_cluster_ids.npy.
These files can be loaded and used to train the uis-rnn model found at