awesome-audio-datasets

🎵 A curated collection of open-source audio datasets for Speech and Music.

This repository aims to provide researchers and developers with comprehensive information about publicly available audio datasets, focusing on their size, language coverage, collection methods, and intended use cases. Each dataset is accompanied by detailed descriptions and statistical information to help you choose the right dataset for your specific needs.

Quick Reference Table

Dataset	Lang.	Data Source	Hours	Primary Use Cases
Common Voice	Multilingual	In-the-wild	Varies	ASR
LJSpeech	En	Recording	24	TTS
LibriSpeech	En	Recording	1,000	ASR
LibriTTS	En	Recording	584	TTS
Libri-Light	En	Recording	60,000+	ASR
GigaSpeech	En	In-the-wild	10,000	ASR
VCTK	En	Recording	44	TTS
VoxCeleb	En	In-the-wild	2,000+	Speaker Recognition
WenetSpeech4TTS	Zh	In-the-wild	12,483	TTS
Aishell	Zh	Recording	1,000	ASR
Emilia	En/Zh/De/Fr/Ja/Ko	In-the-wild	101,653	TTS

Detailed Dataset Information

Common Voice

Release Date: 2017 (continuously updated)
Primary Use: Automatic Speech Recognition (ASR)
Description: An open-source speech dataset initiative by Mozilla supporting multiple languages. The data comes from global volunteer contributions and includes rich speaker metadata including age, gender, and accent information.
Distribution: Varies by version and language.

LJSpeech

Release Date: 2017
Primary Use: Text-to-Speech (TTS)
Description: A single female speaker English speech dataset with high-quality, professional studio recordings.
Distribution: 24 hours total, containing 13,100 audio clips.

LibriSpeech

Release Date: 2015
Primary Use: ASR
Description: Derived from LibriVox audiobooks, comprising approximately 1,000 hours of English reading speech.
Distribution: Multiple splits available for different use cases.

LibriTTS

Release Date: 2019
Primary Use: TTS
Description: A reprocessed version of LibriSpeech materials specifically designed for speech synthesis tasks.
Distribution: Approximately 584 hours of English speech.

Libri-Light

Release Date: 2019
Primary Use: ASR
Description: A large-scale unlabeled speech dataset based on LibriVox.
Distribution:

Subset	Hours
small	577h
medium	5,193h
large	51,934h

VCTK

Release Date: 2019
Primary Use: TTS
Description: Audio recordings from 100 English speakers with various accents, each reading approximately 400 sentences.

GigaSpeech

Release Date: 2021
Primary Use: ASR
Description: Large-scale English speech dataset collected from podcasts, YouTube, and audiobooks.
Distribution:

Subset	Audiobook	Podcast	YouTube	Total
XL	2,655h	3,499h	3,846h	10,000h
L	650h	875h	975h	2,500h
M	260h	350h	390h	1,000h
S	65h	87.5h	97.5h	250h
XS	2.6h	3.5h	3.9h	10h

VoxCeleb

Release Date: 2017 (VoxCeleb1), 2018 (VoxCeleb2)
Primary Use: Speaker Recognition and Verification
Description: Large-scale speaker datasets extracted from YouTube videos, featuring speech in various real-world environments.
Distribution:

Version	Hours	Speakers	Segments
VoxCeleb1	352	1,251	153,516
VoxCeleb2	2,442	6,112	1,128,246

WenetSpeech4TTS

Release Date: 2024
Primary Use: TTS
Description: Extended from WenetSpeech dataset, focused on addressing noise, distortion, and semantic completeness issues.
Distribution:

Training Subsets	DNSMOS Threshold	Hours	Avg. Duration (s)
Premium	4.0	945	8.3
Standard	3.8	4056	7.5
Basic	3.6	7226	6.6
Rest	<3.6	5574	N/A

Aishell Series

Aishell-1

Release Date: 2017
Primary Use: ASR
Description: 178 hours of audio recorded by 400 speakers from different regions of China, covering 11 domains including smart home, autonomous driving, and industrial production.

Aishell-2

Release Date: 2018
Primary Use: ASR
Description: 1,000 hours of audio recorded by 1,991 speakers, covering 12 domains including wake words, voice control, smart home, autonomous driving, and industrial production.

Aishell-3

Release Date: 2020
Primary Use: TTS
Description: 85 hours of high-fidelity audio recorded by 218 speakers in a quiet indoor environment.

Emilia

Primary Use: TTS
Description: A large-scale multilingual dataset primarily focused on English and Chinese, with additional coverage of German, French, Japanese, and Korean.
Distribution:

Language	Duration (hours)
English	46,828
Chinese	49,922
German	1,590
French	1,381
Japanese	1,715
Korean	217

Contributing

Feel free to submit pull requests to add new datasets or update existing information. Please ensure that any added dataset is publicly available and includes detailed information about its contents and usage rights.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-audio-datasets

Quick Reference Table

Detailed Dataset Information

Common Voice

LJSpeech

LibriSpeech

LibriTTS

Libri-Light

VCTK

GigaSpeech

VoxCeleb

WenetSpeech4TTS

Aishell Series

Aishell-1

Aishell-2

Aishell-3

Emilia

Contributing

About

Releases

Packages

zruiii/awesome-audio-datasets

Folders and files

Latest commit

History

Repository files navigation

awesome-audio-datasets

Quick Reference Table

Detailed Dataset Information

VoxCeleb

Aishell-1

Aishell-2

Aishell-3

Contributing

About

Resources

Stars

Watchers

Forks