Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slides for AAAI 25 Tutorial TQ08: KV Cache Compression for Efficient Long Context LLM Inference: Challenges, Trade-Offs, and Opportunities. #6

Open
henryzhongsc opened this issue Feb 27, 2025 · 0 comments

Comments

@henryzhongsc
Copy link
Owner

henryzhongsc commented Feb 27, 2025

We really appreciate the great turnout and engagement during our AAAI 25 tutorial! Given that it was a highly technical talk, we’re genuinely glad that the audience received our message well and asked so many great questions.

promo

As promised, here are the slides: https://github.com/henryzhongsc/longctx_bench/blob/main/visualization/slides/aaai25_tutorial_tq08.pdf. This version is slightly different from the one we used in the talk — mostly just removing some joke slides to maintain a more serious online presence and fixing a few typos (since the one we used somehow wasn’t the final version).


I did my best to include as many citations as possible, but during the talk, I also mentioned several works on the fly. Some were brought up spontaneously, so I can’t recall all of them, and others I might forgot to mention but still nevertheless some good reads; so here are some relevant ones:

  • State of GPT | BRK216HFS | Andrej Karpathy

A great source for understanding the stages of LLM training, along with an excellent introduction to LLM training basics.

  • Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Covers KV cache challenges in long context scenarios.

  • SnapKV: LLM Knows What You are Looking for Before Generation
  • PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
  • RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
  • DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
  • Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Mentioned in the context of more modern token-dropping-like methods that are NIAH-capable.

  • GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
  • TurboAttention: Efficient Attention Approximation For High Throughput LLMs
  • MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

As nice complements with KIVI.

An in-depth tutorial on linear attention. The latter explores how linear attention perform at scale.

  • A Controlled Study on Long Context Extension and Generalization in LLMs
  • Rectified Rotary Position Embeddings (ReRoPE) https://github.com/bojone/rerope
  • LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Related to questions about positional embeddings and their robustness to different types of modifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant