From 74c0a595247f7898ec566cb03225599092c21e34 Mon Sep 17 00:00:00 2001 From: Yusong Gao Date: Fri, 2 Aug 2024 23:32:52 +0800 Subject: [PATCH 01/10] add llm.cpp link to notable forks in readme --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index a3536e0bf..a1c545df8 100644 --- a/README.md +++ b/README.md @@ -215,6 +215,9 @@ Lastly, I will be a lot more sensitive to complexity in the root folder of the p - WebGPU C++ - [gpu.cpp](https://github.com/AnswerDotAI/gpu.cpp) by @[austinvhuang](https://github.com/austinvhuang): a library for portable GPU compute in C++ using native WebGPU. Aims to be a general-purpose library, but also porting llm.c kernels to WGSL. + +- C++ + - [llm.cpp](https://github.com/GaoYusong/llm.cpp) by @[GaoYusong](https://github.com/GaoYusong): a port of this project featuring a C++ single-header [tinytorch.hpp](https://github.com/GaoYusong/llm.cpp/blob/main/tinytorch.hpp) library - Go - [llm.go](https://github.com/joshcarp/llm.go) by @[joshcarp](https://github.com/joshcarp): a Go port of this project From ee0a92f9159350dbe0bbf8ab59734d1bec0e2b6e Mon Sep 17 00:00:00 2001 From: zhangpiu Date: Sat, 10 Aug 2024 17:45:21 +0800 Subject: [PATCH 02/10] Add llm.cpp fork --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index b6c6128dd..b91666e61 100644 --- a/README.md +++ b/README.md @@ -184,6 +184,9 @@ Lastly, I will be a lot more sensitive to complexity in the root folder of the p - [llm.cpp](https://github.com/gevtushenko/llm.c) by @[gevtushenko](https://github.com/gevtushenko): a port of this project using the [CUDA C++ Core Libraries](https://github.com/NVIDIA/cccl) - A presentation this fork was covered in [this lecture](https://www.youtube.com/watch?v=WiB_3Csfj_Q) in the [CUDA MODE Discord Server](https://discord.gg/cudamode) +- C++/CUDA + - [llm.cpp](https://github.com/zhangpiu/llm.cpp/tree/master/llmcpp) by @[zhangpiu](https://github.com/zhangpiu): a port of this project using the [Eigen](https://gitlab.com/libeigen/eigen), supporting CPU/CUDA. + - Go - [llm.go](https://github.com/joshcarp/llm.go) by @[joshcarp](https://github.com/joshcarp): a Go port of this project From 25a302fde69041df358a6e413b7153b1d38dd5f9 Mon Sep 17 00:00:00 2001 From: Biao Zhang Date: Sat, 10 Aug 2024 17:53:12 +0800 Subject: [PATCH 03/10] Update README.md --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index 797cd832a..5c1827547 100644 --- a/README.md +++ b/README.md @@ -213,14 +213,12 @@ Lastly, I will be a lot more sensitive to complexity in the root folder of the p - [llm.cpp](https://github.com/gevtushenko/llm.c) by @[gevtushenko](https://github.com/gevtushenko): a port of this project using the [CUDA C++ Core Libraries](https://github.com/NVIDIA/cccl) - A presentation this fork was covered in [this lecture](https://www.youtube.com/watch?v=WiB_3Csfj_Q) in the [CUDA MODE Discord Server](https://discord.gg/cudamode) - - C++/CUDA - [llm.cpp](https://github.com/zhangpiu/llm.cpp/tree/master/llmcpp) by @[zhangpiu](https://github.com/zhangpiu): a port of this project using the [Eigen](https://gitlab.com/libeigen/eigen), supporting CPU/CUDA. - WebGPU C++ - [gpu.cpp](https://github.com/AnswerDotAI/gpu.cpp) by @[austinvhuang](https://github.com/austinvhuang): a library for portable GPU compute in C++ using native WebGPU. Aims to be a general-purpose library, but also porting llm.c kernels to WGSL. - - Go - [llm.go](https://github.com/joshcarp/llm.go) by @[joshcarp](https://github.com/joshcarp): a Go port of this project From a1278731a3e9ba06a71bc3de6a4454a9f2f4d1a5 Mon Sep 17 00:00:00 2001 From: Aleksa Gordic Date: Sat, 10 Aug 2024 17:43:50 +0200 Subject: [PATCH 04/10] Minor refactor --- train_llama3.py | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/train_llama3.py b/train_llama3.py index 31596c306..364b5f66f 100644 --- a/train_llama3.py +++ b/train_llama3.py @@ -16,17 +16,17 @@ TODO: add the actual commands """ +import argparse import os import math import glob import inspect from contextlib import nullcontext from dataclasses import dataclass -import json from pathlib import Path +import time from typing import ( AbstractSet, - Callable, Collection, Dict, Iterator, @@ -402,7 +402,7 @@ def unpermute(w, n_heads, dim1, dim2): def from_pretrained_llama3_hf(cls, model_id): """Loads pretrained LLaMA model weights from HuggingFace""" from transformers import AutoModelForCausalLM, AutoTokenizer - assert model_id == "meta-llama/Meta-Llama-3.1-8B", "Only the 8B-bae model is supported for now" + assert model_id == "meta-llama/Meta-Llama-3.1-8B", "Only the 8B-base model is supported for now" model_args = LlamaConfig() model = AutoModelForCausalLM.from_pretrained(model_id) @@ -956,16 +956,14 @@ def print0(*args, **kwargs): print(*args, **kwargs) if __name__ == "__main__": - import time - import argparse print0(f"Running pytorch {torch.version.__version__}") # default settings will overfit a tiny batch of data # and save model weights and debug state to disk on the first iteration parser = argparse.ArgumentParser() parser.add_argument("--use_hf", type=int, default=1, help="use HuggingFace (default) or use Meta's model") - parser.add_argument("--ckpt_dir", type=str, default=None, help="path to llama3 model checkpoint") - parser.add_argument("--tokenizer_path", type=str, default=None, help="path to llama3 tokenizer") + parser.add_argument("--ckpt_dir", type=str, default=None, help="path to llama3 model checkpoint (needed if use_hf=0)") + parser.add_argument("--tokenizer_path", type=str, default=None, help="path to llama3 tokenizer (needed if use_hf=0)") # file system input / output parser.add_argument("--input_bin", type=str, default="dev/data/tinyshakespeare/tiny_shakespeare_val.bin", help="input .bin to train on") parser.add_argument("--input_val_bin", type=str, default="", help="input .bin to eval validation loss on") @@ -1049,9 +1047,9 @@ def print0(*args, **kwargs): device = "cuda" elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = "mps" - print(f"using device: {device}") device_type = 'cuda' if 'cuda' in device else 'cpu' - assert device_type in {'cuda'} # we need to load LLaMA as bf16 on CUDA + assert device_type in {'cuda'}, "GPU required to run LLaMA 3" # we need to load LLaMA as bf16 on CUDA + print(f"using device: {device}") # calculate gradient accumulation from the desired total batch size and the current run configuration tokens_per_fwdbwd = B * T * ddp_world_size @@ -1079,11 +1077,11 @@ def print0(*args, **kwargs): FLASH = args.flash # init the model - assert args.ckpt_dir is not None and os.path.exists(args.ckpt_dir), f"llama3 ckpt dir {args.ckpt_dir} does not exist" - assert args.tokenizer_path is not None and os.path.exists(args.tokenizer_path), f"llama3 tokenizer path {args.tokenizer_path} does not exist" if args.use_hf: model = LLaMA.from_pretrained_llama3_hf(args.model) else: # use Meta's checkpoint + assert args.ckpt_dir is not None and os.path.exists(args.ckpt_dir), f"llama3 ckpt dir {args.ckpt_dir} does not exist" + assert args.tokenizer_path is not None and os.path.exists(args.tokenizer_path), f"llama3 tokenizer path {args.tokenizer_path} does not exist" model = LLaMA.from_pretrained_llama3_meta(args.ckpt_dir, args.tokenizer_path) model.train() From 92cc4ebf36e8dc7e6dca8bfc3e3ff066825117a1 Mon Sep 17 00:00:00 2001 From: Aleksa Gordic Date: Sat, 10 Aug 2024 18:04:59 +0200 Subject: [PATCH 05/10] Add msg to asserts --- train_llama3.py | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/train_llama3.py b/train_llama3.py index 364b5f66f..e1d09819c 100644 --- a/train_llama3.py +++ b/train_llama3.py @@ -500,18 +500,18 @@ def generate( """ bsz = len(prompt_tokens) - assert bsz <= self.config.max_gen_batch_size, (bsz, self.config.max_gen_batch_size) + assert bsz <= self.config.max_gen_batch_size, f"Batch size {bsz} exceeds the maximum generation batch size {self.config.max_gen_batch_size}" device = next(self.parameters()).device min_prompt_len = min(len(t) for t in prompt_tokens) max_prompt_len = max(len(t) for t in prompt_tokens) - assert max_prompt_len <= self.config.block_size + assert max_prompt_len <= self.config.block_size, f"Prompt length {max_prompt_len} exceeds the maximum block size {self.config.block_size}" total_len = min(self.config.block_size, max_gen_len + max_prompt_len) pad_id = self.tokenizer.pad_id tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=device) - for k, t in enumerate(prompt_tokens): - tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device=device) + for idx, t in enumerate(prompt_tokens): + tokens[idx, : len(t)] = torch.tensor(t, dtype=torch.long, device=device) if logprobs: token_logprobs = torch.zeros_like(tokens, dtype=torch.float) @@ -549,9 +549,7 @@ def generate( reduction="none", ignore_index=pad_id, ) - eos_reached |= (~input_text_mask[:, cur_pos]) & ( - torch.isin(next_token, stop_tokens) - ) + eos_reached |= ~input_text_mask[:, cur_pos] & torch.isin(next_token, stop_tokens) prev_pos = cur_pos if all(eos_reached): break From a0c666bfbb9fd22818da0626b957babdbc765861 Mon Sep 17 00:00:00 2001 From: Aleksa Gordic Date: Sat, 10 Aug 2024 18:24:51 +0200 Subject: [PATCH 06/10] Remove logprobs - like andrej in nano llama 3 --- train_llama3.py | 33 ++++----------------------------- 1 file changed, 4 insertions(+), 29 deletions(-) diff --git a/train_llama3.py b/train_llama3.py index e1d09819c..a4c366324 100644 --- a/train_llama3.py +++ b/train_llama3.py @@ -477,7 +477,6 @@ def generate( max_gen_len: int, temperature: float = 0.6, top_p: float = 0.9, - logprobs: bool = False, echo: bool = False, ) -> Tuple[List[List[int]], Optional[List[List[float]]]]: """ @@ -488,15 +487,13 @@ def generate( max_gen_len (int): Maximum length of the generated text sequence. temperature (float, optional): Temperature value for controlling randomness in sampling. Defaults to 0.6. top_p (float, optional): Top-p probability threshold for nucleus sampling. Defaults to 0.9. - logprobs (bool, optional): Flag indicating whether to compute token log probabilities. Defaults to False. echo (bool, optional): Flag indicating whether to include prompt tokens in the generated output. Defaults to False. Returns: - Tuple[List[List[int]], Optional[List[List[float]]]]: A tuple containing generated token sequences and, if logprobs is True, corresponding token log probabilities. + Tuple[List[List[int]], Optional[List[List[float]]]]: A tuple containing generated token sequences. Note: This method uses the provided prompts as a basis for generating text. It employs nucleus sampling to produce text with controlled randomness. - If logprobs is True, token log probabilities are computed for each generated token. """ bsz = len(prompt_tokens) @@ -512,8 +509,6 @@ def generate( tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=device) for idx, t in enumerate(prompt_tokens): tokens[idx, : len(t)] = torch.tensor(t, dtype=torch.long, device=device) - if logprobs: - token_logprobs = torch.zeros_like(tokens, dtype=torch.float) prev_pos = 0 eos_reached = torch.tensor([False] * bsz, device=device) @@ -521,12 +516,6 @@ def generate( if min_prompt_len == total_len: logits, _ = self.forward(tokens, start_pos=prev_pos) - token_logprobs = -F.cross_entropy( - input=logits.transpose(1, 2), - target=tokens, - reduction="none", - ignore_index=pad_id, - ) stop_tokens = torch.tensor(list(self.tokenizer.stop_tokens)).to(device) @@ -542,39 +531,25 @@ def generate( # only replace token if prompt has already been generated next_token = torch.where(input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token) tokens[:, cur_pos] = next_token - if logprobs: - token_logprobs[:, prev_pos + 1 : cur_pos + 1] = -F.cross_entropy( - input=logits.transpose(1, 2), - target=tokens[:, prev_pos + 1 : cur_pos + 1], - reduction="none", - ignore_index=pad_id, - ) eos_reached |= ~input_text_mask[:, cur_pos] & torch.isin(next_token, stop_tokens) prev_pos = cur_pos if all(eos_reached): break - if logprobs: - token_logprobs = token_logprobs.tolist() - out_tokens, out_logprobs = [], [] + out_tokens = [] for i, toks in enumerate(tokens.tolist()): # cut to max gen len start = 0 if echo else len(prompt_tokens[i]) toks = toks[start : len(prompt_tokens[i]) + max_gen_len] - probs = None - if logprobs: - probs = token_logprobs[i][start : len(prompt_tokens[i]) + max_gen_len] # cut to after eos tok if any for stop_token in self.tokenizer.stop_tokens: try: eos_idx = toks.index(stop_token) toks = toks[:eos_idx] - probs = probs[:eos_idx] if logprobs else None except ValueError: pass out_tokens.append(toks) - out_logprobs.append(probs) - return (out_tokens, out_logprobs if logprobs else None) + return out_tokens # ----------------------------------------------------------------------------- # sampling utils @@ -1194,7 +1169,7 @@ def get_lr(it): else: # Meta prompt_tokens = [model.tokenizer.encode(x, bos=True, eos=False) for x in prompts] - generation_tokens, _ = model.generate(prompt_tokens, max_gen_len=64, temperature=0.6, top_p=0.9, logprobs=False, echo=False) + generation_tokens = model.generate(prompt_tokens, max_gen_len=64, temperature=0.6, top_p=0.9, echo=False) results = [{"generation": model.tokenizer.decode(t)} for t in generation_tokens] for prompt, result in zip(prompts, results): print(prompt, end="") From b0bc864320517cdfc4edc74b1974c2673219056d Mon Sep 17 00:00:00 2001 From: Aleksa Gordic Date: Sat, 10 Aug 2024 18:51:56 +0200 Subject: [PATCH 07/10] Refactor flash logic --- train_llama3.py | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/train_llama3.py b/train_llama3.py index a4c366324..b8d860920 100644 --- a/train_llama3.py +++ b/train_llama3.py @@ -55,9 +55,6 @@ # ----------------------------------------------------------------------------- # PyTorch nn.Module definitions for the LLaMA 3.x model -# using a global to toggle flash-attention -FLASH = 0 - # Used in Grouped Query Attention (GQA), broadcasts the key and value tensors def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor: """torch.repeat_interleave(x, dim=2, repeats=n_rep)""" @@ -157,6 +154,7 @@ def __init__(self, config): self.n_rep = self.n_head // self.n_kv_head self.hd = config.n_embd // config.n_head self.use_kv = config.use_kv + self.flash = config.flash self.c_attn = nn.Linear(config.n_embd, (config.n_head + 2 * config.n_kv_head) * self.hd, bias=False) # key, query, value projections self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False) # output projection @@ -186,9 +184,12 @@ def forward(self, x, freqs_cis=None, start_pos=None, mask=None): q, k, v = map(lambda t: t.transpose(1, 2), (q, k, v)) # (B, NH, T, HD) - if FLASH: + if self.flash: # flashattention - y = F.scaled_dot_product_attention(q, k, v, mask) + # if T == 1 no need to mask, otherwise the function complains + # scaled_dot_product_attention expects a mask where value of True indicates that the element should take part in attention + # our mask is the opposite, so we need to invert it + y = F.scaled_dot_product_attention(q, k, v, mask == 0 if T > 1 else None) else: # manual implementation of attention # this materializes the large (T,T) matrix for all the queries and keys @@ -257,6 +258,7 @@ class LlamaConfig: use_scaled_rope: bool = True max_gen_batch_size: int = 4 use_kv: bool = True + flash: bool = False # use flashattention? def __init__(self, **kwargs): for k, v in kwargs.items(): @@ -966,7 +968,6 @@ def print0(*args, **kwargs): # memory management parser.add_argument("--device", type=str, default="", help="by default we autodetect, or set it here") parser.add_argument("--compile", type=int, default=0, help="torch.compile the model") - parser.add_argument("--flash", type=int, default=0, help="use flash attention") parser.add_argument("--dtype", type=str, default="bfloat16", help="float32|float16|bfloat16") parser.add_argument("--zero_stage", type=int, default=0, help="zero redundancy optimizer stage (0/1/2/3)") # python -> C bridge @@ -1045,10 +1046,6 @@ def print0(*args, **kwargs): if args.tensorcores: torch.set_float32_matmul_precision('high') - # turn on/off flash attention - assert args.flash in {0, 1} - FLASH = args.flash - # init the model if args.use_hf: model = LLaMA.from_pretrained_llama3_hf(args.model) From bd8c6045be3e108f1c221a53012631643b85e873 Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Fri, 13 Sep 2024 19:04:08 +0000 Subject: [PATCH 08/10] change default params: use tinyshakespeare and decrease LR --- train_llama3.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/train_llama3.py b/train_llama3.py index 2c64039fc..f9daafde0 100644 --- a/train_llama3.py +++ b/train_llama3.py @@ -943,7 +943,7 @@ def print0(*args, **kwargs): parser.add_argument("--ckpt_dir", type=str, default=None, help="path to llama3 model checkpoint (needed if use_hf=0)") parser.add_argument("--tokenizer_path", type=str, default=None, help="path to llama3 tokenizer (needed if use_hf=0)") # file system input / output - parser.add_argument("--input_bin", type=str, default="dev/data/tinystories/TinyStories_val.bin", help="input .bin to train on") + parser.add_argument("--input_bin", type=str, default="dev/data/tinyshakespeare/tiny_shakespeare_val.bin", help="input .bin to train on") parser.add_argument("--input_val_bin", type=str, default="", help="input .bin to eval validation loss on") parser.add_argument("--output_dir", type=str, default="", help="output directory to which to write logs and checkpoints") parser.add_argument("--model", type=str, default="meta-llama/Meta-Llama-3.1-8B", help="chose the llama model") @@ -955,7 +955,7 @@ def print0(*args, **kwargs): parser.add_argument("--num_iterations", type=int, default=10, help="number of iterations to run") parser.add_argument("--inference_only", type=int, default=0, help="only run inference") # optimization - parser.add_argument("--learning_rate", type=float, default=1e-4, help="learning rate warmup iterations") + parser.add_argument("--learning_rate", type=float, default=1e-5, help="learning rate warmup iterations") parser.add_argument("--warmup_iters", type=int, default=0, help="learning rate warmup iterations") parser.add_argument("--learning_rate_decay_frac", type=float, default=1.0, help="learning rate warmup iterations") parser.add_argument("--weight_decay", type=float, default=0.0, help="weight decay") From 5b2e3180fbb5668a0d3c8cecd07b0732ebad330a Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 23 Sep 2024 17:09:45 -0700 Subject: [PATCH 09/10] cuda mode -> gpu mode This is a documentation only change. Hoping this is OK to merge. See this tweet for more context on why we made this change https://x.com/jeremyphoward/status/1838341110344880637 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 02cfc13d6..2f8b9194b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # llm.c -LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https://github.com/openai/gpt-2) and [GPT-3](https://arxiv.org/abs/2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https://github.com/karpathy/nanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the "notable forks" section. Developer coordination happens in the [Discussions](https://github.com/karpathy/llm.c/discussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https://discord.gg/3zy8kqD9Cp) channel, or on `#llmdotc` on CUDA MODE Discord. +LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https://github.com/openai/gpt-2) and [GPT-3](https://arxiv.org/abs/2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https://github.com/karpathy/nanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the "notable forks" section. Developer coordination happens in the [Discussions](https://github.com/karpathy/llm.c/discussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https://discord.gg/3zy8kqD9Cp) channel, or on `#llmdotc` on GPU MODE Discord. ## quick start @@ -211,7 +211,7 @@ Lastly, I will be a lot more sensitive to complexity in the root folder of the p - CUDA C++ - [llm.cpp](https://github.com/gevtushenko/llm.c) by @[gevtushenko](https://github.com/gevtushenko): a port of this project using the [CUDA C++ Core Libraries](https://github.com/NVIDIA/cccl) - - A presentation this fork was covered in [this lecture](https://www.youtube.com/watch?v=WiB_3Csfj_Q) in the [CUDA MODE Discord Server](https://discord.gg/cudamode) + - A presentation this fork was covered in [this lecture](https://www.youtube.com/watch?v=WiB_3Csfj_Q) in the [GPU MODE Discord Server](https://discord.gg/cudamode) - C++/CUDA - [llm.cpp](https://github.com/zhangpiu/llm.cpp/tree/master/llmcpp) by @[zhangpiu](https://github.com/zhangpiu): a port of this project using the [Eigen](https://gitlab.com/libeigen/eigen), supporting CPU/CUDA. From 315b8d1f626885db31a895510cb299cb282616cc Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Tue, 24 Sep 2024 13:08:11 -0700 Subject: [PATCH 10/10] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2f8b9194b..d2d107821 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # llm.c -LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https://github.com/openai/gpt-2) and [GPT-3](https://arxiv.org/abs/2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https://github.com/karpathy/nanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the "notable forks" section. Developer coordination happens in the [Discussions](https://github.com/karpathy/llm.c/discussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https://discord.gg/3zy8kqD9Cp) channel, or on `#llmdotc` on GPU MODE Discord. +LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https://github.com/openai/gpt-2) and [GPT-3](https://arxiv.org/abs/2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https://github.com/karpathy/nanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the "notable forks" section. Developer coordination happens in the [Discussions](https://github.com/karpathy/llm.c/discussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https://discord.gg/3zy8kqD9Cp) channel, or on `#llmdotc` on [GPU MODE](https://discord.gg/gpumode) Discord. ## quick start