- Models and Methods
- Evaluation
- Survey Papers
- Multimodality Models (language + audio + other modalities)
- Adversarial Attacks
AudioLLM: Audio Large Language Model which means the model can take audio inputs (speech, pure sound, music, etc.) and reason accordingly with a LLM. The output can be multimodality (e.g. text only, speech only, speech+text).
This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!
【Date】【Name】【Affiliations】【Paper】【Link】
-
【2024-12】-【MERaLiON-AudioLLM】-【I2R, A*STAR, Singapore】
-
【2024-09】-【MoWE-Audio】-【A*STAR】
- MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
- Paper
-
【2024-11】-【NTU, Taiwan】
- Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
- Paper
-
【2024-10】-【SPIRIT LM】-【Meta】
-
【2024-10】-【DiVA】-【Georgia Tech, Stanford】
-
【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】
- Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
- Paper
-
【2024-09】-【Moshi】-【Kyutai】
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
- Paper /
-
【2024-09】-【DeSTA2】-【NTU Taiwan】
- Moshi: a speech-text foundation model for real-time dialogue
- Paper /
-
【2024-09】-【LLaMA-Omni】-【CAS】
- LLaMA-Omni: Seamless Speech Interaction with Large Language Models
- Paper /
-
【2024-09】-【Ultravox】-【fixie-ai】
-
【2024-09】-【AudioBERT】-【Postech】
- AudioBERT: Audio Knowledge Augmented Language Model
- Paper /
-
【2024-09】-【-】-【Tsinghua SIGS】
- Comparing Discrete and Continuous Space LLMs for Speech Recognition
- Paper
-
【2024-08】-【Mini-Omni】-【Tsinghua】
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
- Paper /
-
【2024-08】-【MooER】-【Moore Threads】
- MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
- Paper /
-
【2024-07】-【GAMA】-【UMD】
- GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
- Paper /
-
【2024-07】-【LLaST】-【CUHK-SZ】
- LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
- Paper /
-
【2024-07】-【CompA】-【University of Maryland】
-
【2024-07】-【Qwen2-Audio】-【Alibaba】
- Qwen2-Audio Technical Report
- Paper /
-
【2024-07】-【FunAudioLLM】-【Alibaba】
-
【2024-07】-【NTU-Taiwan, Meta】
- Investigating Decoder-only Large Language Models for Speech-to-text Translation
- Paper
-
【2024-06】-【Speech ReaLLM】-【Meta】
- Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
- Paper
-
【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
- Paper /
-
【2024-05】-【Audio Flamingo】-【Nvidia】
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
- Paper /
-
【2024-04】-【SALMONN】-【Tsinghua】
-
【2024-03】-【WavLLM】-【CUHK】
- WavLLM: Towards Robust and Adaptive Speech Large Language Model
- Paper /
-
【2024-02】-【SLAM-LLM】-【SJTU】
- An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- Paper /
-
【2024-01】-【Pengi】-【Microsoft】
- Pengi: An Audio Language Model for Audio Tasks
- Paper /
-
【2023-12】-【Qwen-Audio】-【Alibaba】
-
【2023-10】-【UniAudio】-【CUHK】
-
【2023-09】-【LLaSM】-【LinkSoul.AI】
- LLaSM: Large Language and Speech Model
- Paper /
-
【2023-09】-【Segment-level Q-Former】-【Tsinghua】
- Connecting Speech Encoder and Large Language Model for ASR
- Paper
-
【2023-07】-【Meta】
- Prompting Large Language Models with Speech Recognition Abilities
- Paper
-
【2023-05】-【SpeechGPT】-【Fudan】
-
【2023-04】-【AudioGPT】-【Zhejiang Uni】
- AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
- Paper /
【Date】【Name】【Affiliations】【Paper】【Link】
-
【2024-06】-【AudioBench】-【A*STAR, Singapore】
- AudioBench: A Universal Benchmark for Audio Large Language Models
- Paper / LeaderBoard /
-
【2024-12】-【ADU-Bench】-【Tsinghua, Oxford】
- Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
- Paper
-
【2024-10】-【VoiceBench】-【NUS】
- VoiceBench: Benchmarking LLM-Based Voice Assistants
- Paper /
-
【2024-09】-【Salmon】-【Hebrew University of Jerusalem】
-
【2024-07】-【AudioEntailment】-【CMU, Microsoft】
- Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
- Paper /
-
【2024-06】-【SD-Eval】-【CUHK, Bytedance】
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
- Paper /
-
【2024-06】-【Audio Hallucination】-【NTU-Taiwan】
- Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
- Paper /
-
【2024-05】-【AIR-Bench】-【ZJU, Alibaba】
- AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
- Paper /
-
【2024-08】-【MuChoMusic】-【UPF, QMUL, UMG】
- MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
- Paper /
-
【2023-09】-【Dynamic-SUPERB】-【NTU-Taiwan, etc.】
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
- Paper /
-
【2024-11】-【Zhejiang University】
- WavChat: A Survey of Spoken Dialogue Models
- Paper
-
【2024-10】-【CUHK, Tencent】
- Recent Advances in Speech Language Models: A Survey
- Paper
-
【2024-10】-【SJTU, AISpeech】
- A Survey on Speech Large Language Models
- Paper
To list out some multimodal models that could process audio (speech, non-speech, music, audio-scene, sound, etc.) and text inputs.
Date | Model | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-09 | EMOVA | HKUST | EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions | Paper / Demo |
2023-11 | CoDi-2 | UC Berkeley | CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation | Paper / Code / Demo |
2023-06 | Macaw-LLM | Tencent | Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration | Paper / Code |
Date | Name | Key Affiliations | Paper | Link |
---|---|---|---|---|
2024-05 | VoiceJailbreak | CISPA | Voice Jailbreak Attacks Against GPT-4o | Paper |
- Update the table to text format to allow a more structured display