π΅ A curated collection of open-source audio datasets for Speech and Music.
This repository aims to provide researchers and developers with comprehensive information about publicly available audio datasets, focusing on their size, language coverage, collection methods, and intended use cases. Each dataset is accompanied by detailed descriptions and statistical information to help you choose the right dataset for your specific needs.
Dataset | Lang. | Data Source | Hours | Primary Use Cases |
---|---|---|---|---|
Common Voice | Multilingual | In-the-wild | Varies | ASR |
LJSpeech | En | Recording | 24 | TTS |
LibriSpeech | En | Recording | 1,000 | ASR |
LibriTTS | En | Recording | 584 | TTS |
Libri-Light | En | Recording | 60,000+ | ASR |
GigaSpeech | En | In-the-wild | 10,000 | ASR |
VCTK | En | Recording | 44 | TTS |
VoxCeleb | En | In-the-wild | 2,000+ | Speaker Recognition |
WenetSpeech4TTS | Zh | In-the-wild | 12,483 | TTS |
Aishell | Zh | Recording | 1,000 | ASR |
Emilia | En/Zh/De/Fr/Ja/Ko | In-the-wild | 101,653 | TTS |
Release Date: 2017 (continuously updated)
Primary Use: Automatic Speech Recognition (ASR)
Description: An open-source speech dataset initiative by Mozilla supporting multiple languages. The data comes from global volunteer contributions and includes rich speaker metadata including age, gender, and accent information.
Distribution: Varies by version and language.
Release Date: 2017
Primary Use: Text-to-Speech (TTS)
Description: A single female speaker English speech dataset with high-quality, professional studio recordings.
Distribution: 24 hours total, containing 13,100 audio clips.
Release Date: 2015
Primary Use: ASR
Description: Derived from LibriVox audiobooks, comprising approximately 1,000 hours of English reading speech.
Distribution: Multiple splits available for different use cases.
Release Date: 2019
Primary Use: TTS
Description: A reprocessed version of LibriSpeech materials specifically designed for speech synthesis tasks.
Distribution: Approximately 584 hours of English speech.
Release Date: 2019
Primary Use: ASR
Description: A large-scale unlabeled speech dataset based on LibriVox.
Distribution:
Subset | Hours |
---|---|
small | 577h |
medium | 5,193h |
large | 51,934h |
Release Date: 2019
Primary Use: TTS
Description: Audio recordings from 100 English speakers with various accents, each reading approximately 400 sentences.
Release Date: 2021
Primary Use: ASR
Description: Large-scale English speech dataset collected from podcasts, YouTube, and audiobooks.
Distribution:
Subset | Audiobook | Podcast | YouTube | Total |
---|---|---|---|---|
XL | 2,655h | 3,499h | 3,846h | 10,000h |
L | 650h | 875h | 975h | 2,500h |
M | 260h | 350h | 390h | 1,000h |
S | 65h | 87.5h | 97.5h | 250h |
XS | 2.6h | 3.5h | 3.9h | 10h |
Release Date: 2017 (VoxCeleb1), 2018 (VoxCeleb2)
Primary Use: Speaker Recognition and Verification
Description: Large-scale speaker datasets extracted from YouTube videos, featuring speech in various real-world environments.
Distribution:
Version | Hours | Speakers | Segments |
---|---|---|---|
VoxCeleb1 | 352 | 1,251 | 153,516 |
VoxCeleb2 | 2,442 | 6,112 | 1,128,246 |
Release Date: 2024
Primary Use: TTS
Description: Extended from WenetSpeech dataset, focused on addressing noise, distortion, and semantic completeness issues.
Distribution:
Training Subsets | DNSMOS Threshold | Hours | Avg. Duration (s) |
---|---|---|---|
Premium | 4.0 | 945 | 8.3 |
Standard | 3.8 | 4056 | 7.5 |
Basic | 3.6 | 7226 | 6.6 |
Rest | <3.6 | 5574 | N/A |
Release Date: 2017
Primary Use: ASR
Description: 178 hours of audio recorded by 400 speakers from different regions of China, covering 11 domains including smart home, autonomous driving, and industrial production.
Release Date: 2018
Primary Use: ASR
Description: 1,000 hours of audio recorded by 1,991 speakers, covering 12 domains including wake words, voice control, smart home, autonomous driving, and industrial production.
Release Date: 2020
Primary Use: TTS
Description: 85 hours of high-fidelity audio recorded by 218 speakers in a quiet indoor environment.
Primary Use: TTS
Description: A large-scale multilingual dataset primarily focused on English and Chinese, with additional coverage of German, French, Japanese, and Korean.
Distribution:
Language | Duration (hours) |
---|---|
English | 46,828 |
Chinese | 49,922 |
German | 1,590 |
French | 1,381 |
Japanese | 1,715 |
Korean | 217 |
Feel free to submit pull requests to add new datasets or update existing information. Please ensure that any added dataset is publicly available and includes detailed information about its contents and usage rights.