Token Recycling ♻️

(Unofficial) implementation of the self-speculative LLM decoding method described in Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling.

🚀 Fast: ~2x speedup over baseline SpecBench on A100. (MAT 2.5)

🎮 Plug n Play: no training and no architecture changes.

🔮 Self-Speculative: no draft model needed.

Installation

pip install -r requirements.txt

Usage

python -m src.cli

or

from src.token_recycling import TokenRecycling

model = TokenRecycling.from_pretrained("HuggingFaceTB/SmolLM2-135M")
output = model.generate("Your prompt here")

Benchmarks

Spec-Bench
Device: a single NVIDIA A100 GPU (40GB) with 30 CPU cores
Testing environment: Pytorch 2.5.1, under CUDA 12.4
Experimental Settings: greedy decoding, FP16 precision, batch size = 1
Single run (not average of 3 runs like the official leaderboard)
Cold Start means the Token Recycling adjacency matrix was reset for each prompt.

Vicuna-7B-v1.3

Note

This only includes methods that don't require extra parameters. Other methods like EAGLE and Hydra score better (+0.01-0.21x). Refer to the official Leaderboard.

Models	Multi-turn Conversation	Translation	Summa-rization	Question Answering	Mathematical Reasoning	Retrieval-aug. Generation	#Mean Accepted Tokens	Overall
Recycling	2.24x	1.87x	2.08x	1.99x	2.50x	1.80x	2.67	2.08x
Recycling Cold Start	2.07x	1.30x	2.23x	1.70x	2.30x	1.95x	2.55	1.93x
PLD	1.56x	1.00x	2.54x	1.13x	1.55x	1.80x	1.75	1.60x
Lookahead	1.45x	1.13x	1.31x	1.20x	1.50x	1.16x	1.64	1.30x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Token Recycling ♻️

Installation

Usage

Benchmarks

Vicuna-7B-v1.3

Files

README.md

Latest commit

History

README.md

File metadata and controls

Token Recycling ♻️

Installation

Usage

Benchmarks

Vicuna-7B-v1.3