๐ข New Benchmark Released (2025-02-18): "Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [PDF][Dataset]" โ proposing a long NumericBench to assess LLMs' numerical reasoning! ๐
A Survey on Large Language Model Acceleration based on KV Cache Management [PDF]
Haoyang Li 1, Yiming Li 2, Anxin Tian 2, Tinahao Tang 2, Zhanchao Xu 4, Xuejia Chen 4, Nicole Hu 3, Wei Dong 5, Qing Li 1, Lei Chen 2
1Hong Kong Polytechnic University, 2Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Huazhong University of Science and Technology, 5Nanyang Technological University.
- This repository is dedicated to recording KV Cache Management papers for LLM acceleration. The survey will be updated regularly. If you find this survey helpful for your work, please consider citing it.
@article{li2024surveylargelanguagemodel,
title={A Survey on Large Language Model Acceleration based on KV Cache Management},
author={Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen},
journal={arXiv preprint arXiv:2412.19442},
year={2024}
}
- If you would like to include your paper or any modifications in this survey and repository, please feel free to send email to ([email protected]) or open an issue with your paper's title, category, and a brief summary highlighting its key techniques. Thank you!
- Awesome-KV-Cache-Management
- Token-level Optimization
- Model-level Optimization
- System-level Optimization
- Datasets and Benchmarks
Static KV Cache Selection (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs | Static KV Cache Selection | ICLR | Link | |
2024 | SnapKV: LLM Knows What You are Looking for Before Generation | Static KV Cache Selection | NeurIPS | Link | Link |
2024 | In-context KV-Cache Eviction for LLMs via Attention-Gate | Static KV Cache Selection | arXiv | Link |
Dynamic Selection with Permanent Eviction (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | Dynamic Selection with Permanent Eviction | MLSys | Link | |
2024 | BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference | Dynamic Selection with Permanent Eviction | arXiv | Link | Link |
2024 | NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | Dynamic Selection with Permanent Eviction | ACL | Link | Link |
2023 | H2O: heavy-hitter oracle for efficient generative inference of large language models | Dynamic Selection with Permanent Eviction | NeurIPS | Link | Link |
2023 | Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | Dynamic Selection with Permanent Eviction | NeurIPS | Link |
Dynamic Selection without Permanent Eviction (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference | Dynamic Selection without Permanent Eviction | ICML | Link | Link |
2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | Link | |
2024 | Squeezed Attention: Accelerating Long Context Length LLM Inference | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval | Dynamic Selection without Permanent Eviction | arXiv | Link | Link |
2024 | Human-like Episodic Memory for Infinite Context LLMs | Dynamic Selection without Permanent Eviction | arXiv | Link | |
2024 | ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression | Dynamic Selection without Permanent Eviction | arXiv | Link |
Layer-wise Budget Allocation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling | Layer-wise Budget Allocation | arXiv | Link | Link |
2024 | PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference | Layer-wise Budget Allocation | Findings | Link | Link |
2024 | DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs | Layer-wise Budget Allocation | ICLR sub. | Link | |
2024 | PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation | Layer-wise Budget Allocation | arXiv | Link | Link |
2024 | SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction | Layer-wise Budget Allocation | arXiv | Link | Link |
Head-wise Budget Allocation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference | Head-wise Budget Allocation | arXiv | Link | |
2024 | Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective | Head-wise Budget Allocation | ICLR sub. | Link | |
2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Head-wise Budget Allocation | arXiv | Link | |
2024 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | Head-wise Budget Allocation | arXiv | Link | |
2024 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | Head-wise Budget Allocation | arXiv | Link | Link |
2024 | DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads | Head-wise Budget Allocation | arXiv | Link | Link |
Intra-layer Merging (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Compressed Context Memory for Online Language Model Interaction | Intra-layer Merging | ICLR | Link | Link |
2024 | LoMA: Lossless Compressed Memory Attention | Intra-layer Merging | arXiv | Link | |
2024 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | Intra-layer Merging | ICML | Link | Link |
2024 | CaM: Cache Merging for Memory-efficient LLMs Inference | Intra-layer Merging | ICML | Link | Link |
2024 | D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | Intra-layer Merging | arXiv | Link | |
2024 | AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning | Intra-layer Merging | arXiv | Link | Link |
2024 | LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference | Intra-layer Merging | EMNLP | Link | Link |
2024 | Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks | Intra-layer Merging | arXiv | Link | |
2024 | CHAI: Clustered Head Attention for Efficient LLM Inference | Intra-layer Merging | arXiv | Link |
Cross-layer Merging (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | Cross-layer Merging | arXiv | Link | Link |
2024 | KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cross-Layer Sharing | Cross-layer Merging | arXiv | Link | Link |
Fixed-precision Quantization (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead | Fixed-precision Quantization | arXiv | Link | Link |
2024 | PQCache: Product Quantization-based KVCache for Long Context LLM Inference | Fixed-precision Quantization | arXiv | Link | |
2023 | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | Fixed-precision Quantization | ICML | Link | Link |
2022 | ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | Fixed-precision Quantization | NIPS | Link | Link |
Mixed-precision Quantization (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | Mixed-precision Quantization | arXiv | Link | Link |
2024 | IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact | Mixed-precision Quantization | arXiv | Link | Link |
2024 | SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models | Mixed-precision Quantization | arXiv | Link | Link |
2024 | KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | Mixed-precision Quantization | arXiv | Link | Link |
2024 | WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More | Mixed-precision Quantization | arXiv | Link | |
2024 | GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM | Mixed-precision Quantization | arXiv | Link | Link |
2024 | No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization | Mixed-precision Quantization | arXiv | Link | |
2024 | ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification | Mixed-precision Quantization | arXiv | Link | |
2024 | ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification | Mixed-precision Quantization | arXiv | Link | Link |
2024 | PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs | Mixed-precision Quantization | arXiv | Link | Link |
2024 | MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache | Mixed-precision Quantization | arXiv | Link |
Outlier Redistribution (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Massive Activations in Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2024 | QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs | Outlier Redistribution | arXiv | Link | Link |
2024 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | Outlier Redistribution | arXiv | Link | Link |
2024 | SpinQuant: LLM Quantization with Learned Rotations | Outlier Redistribution | arXiv | Link | Link |
2024 | DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs | Outlier Redistribution | NeurIPS | Link | Link |
2024 | SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | Outlier Redistribution | ICML | Link | Link |
2024 | Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling | Outlier Redistribution | EMNLP | Link | Link |
2024 | AffineQuant: Affine Transformation Quantization for Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2024 | FlatQuant: Flatness Matters for LLM Quantization | Outlier Redistribution | arXiv | Link | Link |
2024 | AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration | Outlier Redistribution | MLSys | Link | Link |
2023 | OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models | Outlier Redistribution | arXiv | Link | Link |
2023 | Training Transformers with 4-bit Integers | Outlier Redistribution | NeurIPS | Link | Link |
Singular Value Decomposition (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Effectively Compress KV Heads for LLM | Singular Value Decomposition | arXiv | Link | |
2024 | Eigen Attention: Attention in Low-Rank Space for KV Cache Compression | Singular Value Decomposition | arXiv | Link | Link |
2024 | Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference | Singular Value Decomposition | arXiv | Link | |
2024 | LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy | Singular Value Decomposition | arXiv | Link | |
2024 | ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | Singular Value Decomposition | arXiv | Link | Link |
2024 | Palu: Compressing KV-Cache with Low-Rank Projection | Singular Value Decomposition | arXiv | Link | Link |
Tensor Decomposition (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression | Tensor Decomposition | ACL | Link | Link |
Learned Low-rank Approximation (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference | Learned Low-rank Approximation | arXiv | Link | Link |
2024 | MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection | Learned Low-rank Approximation | arXiv | Link |
Intra-Layer Grouping (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2019 | Fast Transformer Decoding: One Write-Head is All You Need | Intra-Layer Grouping | arXiv | Link | |
2023 | GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Intra-Layer Grouping | EMNLP | Link | Link |
2024 | Optimised Grouped-Query Attention Mechanism for Transformers | Intra-Layer Grouping | ICML | Link | |
2024 | Weighted Grouped Query Attention in Transformers | Intra-Layer Grouping | arXiv | Link | |
2024 | QCQA: Quality and Capacity-aware grouped Query Attention | Intra-Layer Grouping | arXiv | Link | Non-official Link |
2024 | Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention | Intra-Layer Grouping | arXiv | Link | Link |
2023 | GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values | Intra-Layer Grouping | NeurIPS | Link |
Cross-Layer Sharing (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Reducing Transformer Key-Value Cache Size with Cross-Layer Attention | Cross-Layer Sharing | arXiv | Link | Non-official Link |
2024 | Layer-Condensed KV Cache for Efficient Inference of Large Language Models | Cross-Layer Sharing | ACL | Link | Link |
2024 | Beyond KV Caching: Shared Attention for Efficient LLMs | Cross-Layer Sharing | arXiv | Link | Link |
2024 | MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding | Cross-Layer Sharing | arXiv | Link | Link |
2024 | Cross-layer Attention Sharing for Large Language Models | Cross-Layer Sharing | arXiv | Link | |
2024 | A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference | Cross-Layer Sharing | arXiv | Link | |
2024 | Lossless KV Cache Compression to 2% | Cross-Layer Sharing | arXiv | Link | |
2024 | DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion | Cross-Layer Sharing | NeurIPS | Link | |
2024 | Value Residual Learning For Alleviating Attention Concentration In Transformers | Cross-Layer Sharing | arXiv | Link | Link |
Enhanced Attention (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model | Enhanced Attention | arXiv | Link | Link |
2022 | Transformer Quality in Linear Time | Enhanced Attention | ICML | Link | |
2024 | Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Enhanced Attention | arXiv | Link |
Augmented Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | You Only Cache Once: Decoder-Decoder Architectures for Language Models | Augmented Architecture | arXiv | Link | Link |
2024 | Long-Context Language Modeling with Parallel Context Encoding | Augmented Architectures | ACL | Link | Link |
2024 | XC-CACHE: Cross-Attending to Cached Context for Efficient LLM Inference | Augmented Architectures | Findings | Link | |
2024 | Block Transformer: Global-to-Local Language Modeling for Fast Inference | Augmented Architectures | arXiv | Link | Link |
Adaptive Sequence Processing Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2023 | RWKV: Reinventing RNNs for the Transformer Era | Adaptive Sequence Processing Architecture | Findings | Link | Link |
2024 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Adaptive Sequence Processing Architecture | arXiv | Link | Link |
2023 | Retentive Network: A Successor to Transformer for Large Language Models | Adaptive Sequence Processing Architecture | arXiv | Link | Link |
2024 | MCSD: An Efficient Language Model with Diverse Fusion | Adaptive Sequence Processing Architecture | arXiv | Link |
Hybrid Architecture (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | MixCon: A Hybrid Architecture for Efficient and Adaptive Sequence Modeling | Hybrid Architecture | IOS Press | Link | |
2024 | GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression | Hybrid Architecture | arXiv | Link | Link |
2024 | RecurFormer: Not All Transformer Heads Need Self-Attention | Hybrid Architecture | arXiv | Link |
Architectural Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Architectural Design | arXiv | Link | Link |
2024 | Unifying KV Cache Compression for Large Language Models with LeanKV | Architectural Design | arXiv | Link | |
2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Architectural Design | SOSP | Link | Link |
Prefix-aware Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | Prefix-aware Design | ACL | Link | Link |
2024 | MemServe:FlexibleMemPoolforBuilding DisaggregatedLLMServingwithCaching | Prefix-aware Design | arXiv | Link |
Prefix-aware Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | Prefix-aware Scheduling | arXiv | Link | |
2024 | SGLang: Efficient Execution of Structured Language Model Programs | Prefix-aware Scheduling | NeurIPS | Link | Link |
Preemptive and Fairness-oriented Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Fast Distributed Inference Serving for Large Language Models | Preemptive and Fairness-oriented Scheduling | arXiv | Link | |
2024 | FASTSWITCH: OPTIMIZING CONTEXT SWITCHING EFFICIENCY IN FAIRNESS-AWARE LARGE LANGUAGE MODEL SERVING | Preemptive and Fairness-oriented Scheduling | arXiv | Link |
Layer-specific and Hierarchical Scheduling (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management | Layer-specific and Hierarchical Scheduling | arXiv | Link | Link |
2024 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | Layer-specific and Hierarchical Scheduling | USENIX ATC | Link | |
2024 | ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching | Layer-specific and Hierarchical Scheduling | ISCA | Link | |
2024 | Fast Inference for Augmented Large Language Models | Layer-specific and Hierarchical Scheduling | arXiv | Link |
Single/Multi-GPU Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Hydragen: High-Throughput LLM Inference with Shared Prefixes | Single/Multi-GPU Design | arXiv | Link | Link |
2024 | DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference | Single/Multi-GPU Design | arXiv | Link | |
2024 | DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Single/Multi-GPU Design | OSDI | Link | Link |
2024 | Multi-Bin Batching for Increasing LLM Inference Throughput | Single/Multi-GPU Design | arXiv | Link | |
2024 | Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters | Single/Multi-GPU Design | arXiv | Link | Link |
2023 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Single/Multi-GPU Design | SOSP | Link | Link |
2022 | Orca: A Distributed Serving System for Transformer-Based Generative Models | Single/Multi-GPU Design | OSDI | Link |
I/O-based Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs | I/O-based Design | arXiv | Link | Link |
2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | I/O-based Design | arXiv | Link | |
2024 | Fast State Restoration in LLM Serving with HCache | I/O-based Design | arXiv | Link | |
2024 | Compute Or Load KV Cache? Why Not Both? | I/O-based Design | arXiv | Link | |
2024 | FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving | I/O-based Design | arXiv | Link | |
2022 | FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | I/O-based Design | NeurIPS | Link | Link |
Heterogeneous Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | Heterogeneous Design | arXiv | Link | |
2024 | FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Heterogeneous Design | arXiv | Link | |
2024 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Heterogeneous Design | arXiv | Link | |
2024 | InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | Heterogeneous Design | arXiv | Link | |
2024 | Fast Distributed Inference Serving for Large Language Models | Heterogeneous Design | arXiv | Link | |
2024 | Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation | Heterogeneous Design | arXiv | Link | |
2023 | Stateful Large Language Model Serving with Pensieve | Heterogeneous Design | arXiv | Link |
SSD-based Design (To Top๐๐ป)
Year | Title | Type | Venue | Paper | code |
---|---|---|---|---|---|
2024 | InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference | SSD-based Design | arXiv | Link | |
2023 | FlexGen: High-Throughput Generative Inference of Large Language Models | SSD-based Design | ICML | Link | Link |
Please refer to our paper for detailed information on this section.