Papers-Reading

Algorithm

Date	Paper	Key Words
2017.6.12	Attention Is All You Need	Transformer & Attention
2022.5.27	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	Flash Attention
2022.8.15	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	LLM.int8
2023.7.18	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	Flash Attention 2
2024.3.19	When Do We Not Need Larger Vision Models?	Scaling on Scales
2024.7.10	PaliGemma: A versatile 3B VLM for transfer	Google small VLM: Paligemma
2024.7.12	FlashAttention-3 is optimized for Hopper GPUs (e.g. H100)	Flash Attention 3
2024.7.28	Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights	Advertising with Multimodal
2024.8.22	NanoFlow: Towards Optimal Large Language Model Serving Throughput	A novel serving framework: NanoFlow
2024.10.3	SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration	Sage Attention
2024.11.17	SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization	Sage Attention 2