GitHub - sagarrakshe/Awesome-Audio-Large-Language-Models: Audio Large Language Models

🌟🌟🌟 Find interesting work or want your work to be on-board? Raise an Issue or Pull Requests! :)

Introduction

AudioLLM: Audio Large Language Model which means the model can take audio inputs (speech, pure sound, music, etc.) and reason accordingly with a LLM. The output can be multimodality (e.g. text only, speech only, speech+text).

This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!

Models and Methods

【Date】【Name】【Affiliations】【Paper】【Link】

【2024-12】-【MERaLiON-AudioLLM】-【I2R, A*STAR, Singapore】
- MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
- Paper / HF Model
【2024-09】-【MoWE-Audio】-【A*STAR】
- MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
- Paper
【2024-11】-【NTU, Taiwan】
- Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
- Paper
【2024-10】-【SPIRIT LM】-【Meta】
- SPIRIT LM: Interleaved Spoken and Written Language Model
- Paper / Project /
【2024-10】-【DiVA】-【Georgia Tech, Stanford】
- Distilling an End-to-End Voice Assistant Without Instruction Training Data
- Paper / Project
【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】
- Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
- Paper
【2024-09】-【Moshi】-【Kyutai】
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
- Paper /
【2024-09】-【DeSTA2】-【NTU Taiwan】
- Moshi: a speech-text foundation model for real-time dialogue
- Paper /
【2024-09】-【LLaMA-Omni】-【CAS】
- LLaMA-Omni: Seamless Speech Interaction with Large Language Models
- Paper /
【2024-09】-【Ultravox】-【fixie-ai】
- GitHub Open Source
【2024-09】-【AudioBERT】-【Postech】
- AudioBERT: Audio Knowledge Augmented Language Model
- Paper /
【2024-09】-【-】-【Tsinghua SIGS】
- Comparing Discrete and Continuous Space LLMs for Speech Recognition
- Paper
【2024-08】-【Mini-Omni】-【Tsinghua】
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
- Paper /
【2024-08】-【MooER】-【Moore Threads】
- MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
- Paper /
【2024-07】-【GAMA】-【UMD】
- GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
- Paper /
【2024-07】-【LLaST】-【CUHK-SZ】
- LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
- Paper /
【2024-07】-【CompA】-【University of Maryland】
- CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
- Paper / / Project
【2024-07】-【Qwen2-Audio】-【Alibaba】
- Qwen2-Audio Technical Report
- Paper /
【2024-07】-【FunAudioLLM】-【Alibaba】
- FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
- Paper / / Demo
【2024-07】-【NTU-Taiwan, Meta】
- Investigating Decoder-only Large Language Models for Speech-to-text Translation
- Paper
【2024-06】-【Speech ReaLLM】-【Meta】
- Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
- Paper
【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
- Paper /
【2024-05】-【Audio Flamingo】-【Nvidia】
- Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
- Paper /
【2024-04】-【SALMONN】-【Tsinghua】
- SALMONN: Towards Generic Hearing Abilities for Large Language Models
- Paper / / Demo
【2024-03】-【WavLLM】-【CUHK】
- WavLLM: Towards Robust and Adaptive Speech Large Language Model
- Paper /
【2024-02】-【SLAM-LLM】-【SJTU】
- An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
- Paper /
【2024-01】-【Pengi】-【Microsoft】
- Pengi: An Audio Language Model for Audio Tasks
- Paper /
【2023-12】-【Qwen-Audio】-【Alibaba】
- Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
- Paper / / Demo
【2023-10】-【UniAudio】-【CUHK】
- An Audio Foundation Model Toward Universal Audio Generation
- Paper / / Demo
【2023-09】-【LLaSM】-【LinkSoul.AI】
- LLaSM: Large Language and Speech Model
- Paper /
【2023-09】-【Segment-level Q-Former】-【Tsinghua】
- Connecting Speech Encoder and Large Language Model for ASR
- Paper
【2023-07】-【Meta】
- Prompting Large Language Models with Speech Recognition Abilities
- Paper
【2023-05】-【SpeechGPT】-【Fudan】
- SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
- Paper / / Demo
【2023-04】-【AudioGPT】-【Zhejiang Uni】
- AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
- Paper /

Evaluation

【Date】【Name】【Affiliations】【Paper】【Link】

【2024-06】-【AudioBench】-【A*STAR, Singapore】
- AudioBench: A Universal Benchmark for Audio Large Language Models
- Paper / LeaderBoard /
【2024-12】-【ADU-Bench】-【Tsinghua, Oxford】
- Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
- Paper
【2024-10】-【VoiceBench】-【NUS】
- VoiceBench: Benchmarking LLM-Based Voice Assistants
- Paper /
【2024-09】-【Salmon】-【Hebrew University of Jerusalem】
- A Suite for Acoustic Language Model Evaluation
- Paper / Code
【2024-07】-【AudioEntailment】-【CMU, Microsoft】
- Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
- Paper /
【2024-06】-【SD-Eval】-【CUHK, Bytedance】
- SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
- Paper /
【2024-06】-【Audio Hallucination】-【NTU-Taiwan】
- Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
- Paper /
【2024-05】-【AIR-Bench】-【ZJU, Alibaba】
- AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
- Paper /
【2024-08】-【MuChoMusic】-【UPF, QMUL, UMG】
- MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
- Paper /
【2023-09】-【Dynamic-SUPERB】-【NTU-Taiwan, etc.】
- Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
- Paper /

Survey Papers

【2024-11】-【Zhejiang University】
- WavChat: A Survey of Spoken Dialogue Models
- Paper
【2024-10】-【CUHK, Tencent】
- Recent Advances in Speech Language Models: A Survey
- Paper
【2024-10】-【SJTU, AISpeech】
- A Survey on Speech Large Language Models
- Paper

Multimodality Models (language + audio + other modalities)

To list out some multimodal models that could process audio (speech, non-speech, music, audio-scene, sound, etc.) and text inputs.

Date	Model	Key Affiliations	Paper	Link
2024-09	EMOVA	HKUST	EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	Paper / Demo
2023-11	CoDi-2	UC Berkeley	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	Paper / Code / Demo
2023-06	Macaw-LLM	Tencent	Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration	Paper / Code

Adversarial Attacks

Date	Name	Key Affiliations	Paper	Link
2024-05	VoiceJailbreak	CISPA	Voice Jailbreak Attacks Against GPT-4o	Paper

TODO

Update the table to text format to allow a more structured display

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟🌟🌟 Find interesting work or want your work to be on-board? Raise an Issue or Pull Requests! :)

Table of Contents

Introduction

Models and Methods

Evaluation

Survey Papers

Multimodality Models (language + audio + other modalities)

Adversarial Attacks

TODO

About

Releases

Packages

sagarrakshe/Awesome-Audio-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

🌟🌟🌟 Find interesting work or want your work to be on-board? Raise an Issue or Pull Requests! :)

Table of Contents

Introduction

Models and Methods

Evaluation

Survey Papers

Multimodality Models (language + audio + other modalities)

Adversarial Attacks

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages