Skip to content

sagarrakshe/Awesome-Audio-Large-Language-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 

Repository files navigation

🌟🌟🌟 Find interesting work or want your work to be on-board? Raise an Issue or Pull Requests! :)

Table of Contents

Introduction

AudioLLM: Audio Large Language Model which means the model can take audio inputs (speech, pure sound, music, etc.) and reason accordingly with a LLM. The output can be multimodality (e.g. text only, speech only, speech+text).

This repository is a curated collection of research papers focused on the development, implementation, and evaluation of language models for audio data. Our goal is to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. Contributions and suggestions for new papers are highly encouraged!


Models and Methods

【Date】【Name】【Affiliations】【Paper】【Link】
  • 【2024-12】-【MERaLiON-AudioLLM】-【I2R, A*STAR, Singapore】

    • MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
    • Paper / HF Model
  • 【2024-09】-【MoWE-Audio】-【A*STAR】

    • MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
    • Paper
  • 【2024-11】-【NTU, Taiwan】

    • Building a Taiwanese Mandarin Spoken Language Model: A First Attempt
    • Paper
  • 【2024-10】-【SPIRIT LM】-【Meta】

    • SPIRIT LM: Interleaved Spoken and Written Language Model
    • Paper / Project / GitHub stars
  • 【2024-10】-【DiVA】-【Georgia Tech, Stanford】

    • Distilling an End-to-End Voice Assistant Without Instruction Training Data
    • Paper / Project
  • 【2024-10】-【SpeechEmotionLlama】-【MIT, Meta】

    • Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech
    • Paper
  • 【2024-09】-【Moshi】-【Kyutai】

    • Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
    • Paper / GitHub stars
  • 【2024-09】-【DeSTA2】-【NTU Taiwan】

    • Moshi: a speech-text foundation model for real-time dialogue
    • Paper / GitHub stars
  • 【2024-09】-【LLaMA-Omni】-【CAS】

    • LLaMA-Omni: Seamless Speech Interaction with Large Language Models
    • Paper / GitHub stars
  • 【2024-09】-【Ultravox】-【fixie-ai】

    • GitHub Open Source
    • GitHub stars
  • 【2024-09】-【AudioBERT】-【Postech】

    • AudioBERT: Audio Knowledge Augmented Language Model
    • Paper / GitHub stars
  • 【2024-09】-【-】-【Tsinghua SIGS】

    • Comparing Discrete and Continuous Space LLMs for Speech Recognition
    • Paper
  • 【2024-08】-【Mini-Omni】-【Tsinghua】

    • Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
    • Paper / GitHub stars
  • 【2024-08】-【MooER】-【Moore Threads】

    • MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
    • Paper / GitHub stars
  • 【2024-07】-【GAMA】-【UMD】

    • GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
    • Paper / GitHub stars
  • 【2024-07】-【LLaST】-【CUHK-SZ】

    • LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
    • Paper / GitHub stars
  • 【2024-07】-【CompA】-【University of Maryland】

    • CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
    • Paper / GitHub stars / Project
  • 【2024-07】-【Qwen2-Audio】-【Alibaba】

    • Qwen2-Audio Technical Report
    • Paper / GitHub stars
  • 【2024-07】-【FunAudioLLM】-【Alibaba】

    • FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
    • Paper / GitHub stars / Demo
  • 【2024-07】-【NTU-Taiwan, Meta】

    • Investigating Decoder-only Large Language Models for Speech-to-text Translation
    • Paper
  • 【2024-06】-【Speech ReaLLM】-【Meta】

    • Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time
    • Paper
  • 【2024-06】-【DeSTA】-【NTU-Taiwan, Nvidia】

    • DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
    • Paper / GitHub stars
  • 【2024-05】-【Audio Flamingo】-【Nvidia】

    • Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
    • Paper / GitHub stars
  • 【2024-04】-【SALMONN】-【Tsinghua】

    • SALMONN: Towards Generic Hearing Abilities for Large Language Models
    • Paper / GitHub stars / Demo
  • 【2024-03】-【WavLLM】-【CUHK】

    • WavLLM: Towards Robust and Adaptive Speech Large Language Model
    • Paper / GitHub stars
  • 【2024-02】-【SLAM-LLM】-【SJTU】

    • An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
    • Paper / GitHub stars
  • 【2024-01】-【Pengi】-【Microsoft】

    • Pengi: An Audio Language Model for Audio Tasks
    • Paper / GitHub stars
  • 【2023-12】-【Qwen-Audio】-【Alibaba】

    • Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
    • Paper / GitHub stars / Demo
  • 【2023-10】-【UniAudio】-【CUHK】

    • An Audio Foundation Model Toward Universal Audio Generation
    • Paper / GitHub stars / Demo
  • 【2023-09】-【LLaSM】-【LinkSoul.AI】

    • LLaSM: Large Language and Speech Model
    • Paper / GitHub stars
  • 【2023-09】-【Segment-level Q-Former】-【Tsinghua】

    • Connecting Speech Encoder and Large Language Model for ASR
    • Paper
  • 【2023-07】-【Meta】

    • Prompting Large Language Models with Speech Recognition Abilities
    • Paper
  • 【2023-05】-【SpeechGPT】-【Fudan】

    • SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
    • Paper / GitHub stars / Demo
  • 【2023-04】-【AudioGPT】-【Zhejiang Uni】

    • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
    • Paper / GitHub stars

Evaluation

【Date】【Name】【Affiliations】【Paper】【Link】
  • 【2024-06】-【AudioBench】-【A*STAR, Singapore】

    • AudioBench: A Universal Benchmark for Audio Large Language Models
    • Paper / LeaderBoard / GitHub stars
  • 【2024-12】-【ADU-Bench】-【Tsinghua, Oxford】

    • Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
    • Paper
  • 【2024-10】-【VoiceBench】-【NUS】

    • VoiceBench: Benchmarking LLM-Based Voice Assistants
    • Paper / GitHub stars
  • 【2024-09】-【Salmon】-【Hebrew University of Jerusalem】

    • A Suite for Acoustic Language Model Evaluation
    • Paper / Code
  • 【2024-07】-【AudioEntailment】-【CMU, Microsoft】

    • Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
    • Paper / GitHub stars
  • 【2024-06】-【SD-Eval】-【CUHK, Bytedance】

    • SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
    • Paper / GitHub stars
  • 【2024-06】-【Audio Hallucination】-【NTU-Taiwan】

    • Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
    • Paper / GitHub stars
  • 【2024-05】-【AIR-Bench】-【ZJU, Alibaba】

    • AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
    • Paper / GitHub stars
  • 【2024-08】-【MuChoMusic】-【UPF, QMUL, UMG】

    • MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
    • Paper / GitHub stars
  • 【2023-09】-【Dynamic-SUPERB】-【NTU-Taiwan, etc.】

    • Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
    • Paper / GitHub stars

Survey Papers

  • 【2024-11】-【Zhejiang University】

    • WavChat: A Survey of Spoken Dialogue Models
    • Paper
  • 【2024-10】-【CUHK, Tencent】

    • Recent Advances in Speech Language Models: A Survey
    • Paper
  • 【2024-10】-【SJTU, AISpeech】

    • A Survey on Speech Large Language Models
    • Paper

Multimodality Models (language + audio + other modalities)

To list out some multimodal models that could process audio (speech, non-speech, music, audio-scene, sound, etc.) and text inputs.

Date Model Key Affiliations Paper Link
2024-09 EMOVA HKUST EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper / Demo
2023-11 CoDi-2 UC Berkeley CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Paper / Code / Demo
2023-06 Macaw-LLM Tencent Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration Paper / Code

Adversarial Attacks

Date Name Key Affiliations Paper Link
2024-05 VoiceJailbreak CISPA Voice Jailbreak Attacks Against GPT-4o Paper

TODO

  • Update the table to text format to allow a more structured display

About

Audio Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published