Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

henryzhongsc · 2025-02-27T06:32:52Z

We really appreciate the great turnout and engagement during our AAAI 25 tutorial! Given that it was a highly technical talk, we’re genuinely glad that the audience received our message well and asked so many great questions.

As promised, here are the slides: https://github.com/henryzhongsc/longctx_bench/blob/main/visualization/slides/aaai25_tutorial_tq08.pdf. This version is slightly different from the one we used in the talk — mostly just removing some joke slides to maintain a more serious online presence and fixing a few typos (since the one we used somehow wasn’t the final version).

I did my best to include as many citations as possible, but during the talk, I also mentioned several works on the fly. Some were brought up spontaneously, so I can’t recall all of them, and others I might forgot to mention but still nevertheless some good reads; so here are some relevant ones:

State of GPT | BRK216HFS | Andrej Karpathy

A great source for understanding the stages of LLM training, along with an excellent introduction to LLM training basics.

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Covers KV cache challenges in long context scenarios.

SnapKV: LLM Knows What You are Looking for Before Generation
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Mentioned in the context of more modern token-dropping-like methods that are NIAH-capable.

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
TurboAttention: Efficient Attention Approximation For High Throughput LLMs
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

As nice complements with KIVI.

Linear Attention and Beyond (Interactive Tutorial with Songlin Yang) | Sasha Rush https://www.youtube.com/watch?v=d0HJvGSWw8A
MiniMax-01: Scaling Foundation Models with Lightning Attention

An in-depth tutorial on linear attention. The latter explores how linear attention perform at scale.

A Controlled Study on Long Context Extension and Generalization in LLMs
Rectified Rotary Position Embeddings (ReRoPE) https://github.com/bojone/rerope
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Related to questions about positional embeddings and their robustness to different types of modifications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

henryzhongsc commented Feb 27, 2025 •

edited

Loading

Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

Comments

henryzhongsc commented Feb 27, 2025 • edited Loading

henryzhongsc commented Feb 27, 2025 •

edited

Loading