Tensorflow implementation of Generalized End-to-End Loss for speaker verification, proposed by google on 2017 in https://arxiv.org/pdf/1710.10467.pdf
python3.7,tensorflow-gpu==1.14.0
We use our own dataset which belong to the game customer service.The audio sampled with 8k.We collected the audios with 30000 speakers and segement the audio to short but greater than 6s with Vad technology.
To preprocess the data with dataset/generate_meta.py
python train.py \
--train_meta_files \
/tmp/meta.txt \
--epochs 10000 \
--max_keep_model 100 \
--batch_size 100 \
--utterances_per_speaker 10 \
--lr_decay_type constant \
--process_name sf_wj \
--gpu_devices 0,1,2,3,4 \
--thread_num 10 \
--buffer_size 10 \
--log_dir /tmp/svf/log_wj \
--checkpoint /tmp/svf/checkpoint_wj
1.eval_test/v2/prepare_eval_data/test_data_utterance_vector.py
2.eval_test/v2/eval_eer.py
dataset-audio>=5s | test-data with 240 speakers not found in train-data.Each speaker must have at least 10 pieces of 5s + audio.Sample rate is 8k | testdata from aishell.Resample to 8k | traindata in aishell.Resample to 8k | magicdata_mandarin_chinese's test data resample to 8k |
---|---|---|---|---|
eer | 0.018 | 0.04 | 0.025 | 0.04 |
https://github.com/CorentinJ/Real-Time-Voice-Cloning/tree/master/encoder