MagicEnc

Introduction

MagicEnc is a lightweight, huggingface-compatible package for long context LLMs. The basic usage is to avoid CUDA OOM when encoding long contexts. With layer-wise automatic iterative encoding, MagicEnc can encode long context models within 24GB VRAM. The result of encoding, i.e. the prefilled KV cache is moved to CPU RAM during encoding for later usage. Notice that MagicEnc generates exact results and do not conduct any approximation. Only batch size = 1 is supported.

Support Models and Performance

Model	Context	Encode Speed (token/s)
meta-llama/Meta-Llama-3.1-8B	128k	2470
gradientai/Llama-3-8B-Instruct-Gradient-1048k	256k	1400

Decoding Optimization

For meta-llama/Meta-Llama-3.1-8B(-Instruct) model, we can swap KV cache and model parameters to leverage GPU to compute GQA. With implementation of Huggingface, we can get 6.5tokens/s for meta-llama/Meta-Llama-3.1-8B(-Instruct) with 124k context.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
bench.py		bench.py
llama.py		llama.py
prefill_cache.py		prefill_cache.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagicEnc

Introduction

Support Models and Performance

Decoding Optimization

About

Releases

Packages

Contributors 2

Languages

dreaming-panda/MagicEnc

Folders and files

Latest commit

History

Repository files navigation

MagicEnc

Introduction

Support Models and Performance

Decoding Optimization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages