
-
Anyscale
- San Francisco
-
16:07
- 8h behind - https://www.hongpeng-guo.com/
Highlights
- Pro
Stars
verl: Volcano Engine Reinforcement Learning for LLMs
Official Repo for Open-Reasoner-Zero
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
PyTorch per step fault tolerance (actively under development)
A curated list of reinforcement learning with human feedback resources (continually updated)
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
Janus-Series: Unified Multimodal Understanding and Generation Models
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
A Python implementation of global optimization with gaussian processes.
A self-learning tutorail for CUDA High Performance Programing.
The Amazon S3 Connector for PyTorch delivers high throughput for PyTorch training jobs that access and store data in Amazon S3.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
how to optimize some algorithm in cuda.
My learning notes/codes for ML SYS.
🧑🏫 60+ Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, sophia, ...), ga…
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also …
A highly optimized LLM inference acceleration engine for Llama and its variants.
Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL
10x Faster Long-Context LLM By Smart KV Cache Optimizations