Add LLaMA 3 Python support #725

gordicaleksa · 2024-08-02T21:13:19Z

Add LLaMA 3 support in our Python code acting as a reference.

The code supports only inference right now and is equivalent with nano llama 3.

karpathy · 2024-08-08T16:25:41Z

train_llama3.py

+# -----------------------------------------------------------------------------
+# LLaMA building blocks
+
+class RMSNorm(torch.nn.Module):


Add a comment about why we're not using nn.RMSNorm maybe?

just a tiny bit different numerics compared to llama's reference (tested), will leave a comment, we can later swap in nn.RMSNorm tbh

karpathy · 2024-08-08T16:26:21Z

train_llama3.py

+        self.c_fc = nn.Linear(config.n_embd, hidden_dim, bias=False)
+        self.c_fc2 = nn.Linear(config.n_embd, hidden_dim, bias=False)
+        self.c_proj = nn.Linear(hidden_dim, config.n_embd, bias=False)
+        self.c_proj.LLMC_RESIDUAL_SCALE_FLAG = 1


unused. two places

karpathy · 2024-08-08T16:27:41Z

train_llama3.py

+        return logits, loss
+
+    @staticmethod
+    def adapt_llama_state_dict_keys(checkpoint, config: LlamaConfig):


Maybe a small comment/docs on these defs?

wasn't sure what to add and make it useful, should be fairly obvious (?)

I don't think so. Why adapt them?

Because their LLaMA class has different variable names compared to ours (we derive naming from GPT-2) (?)

kk will add it but tbh feels redundant as on a first skim people can see we're renaming keys in the checkpoint dict.

karpathy · 2024-08-08T16:30:28Z

train_llama3.py

+        self.eom_id: int = self.special_tokens["<|eom_id|>"]
+        self.python_tag_id = self.special_tokens["<|python_tag|>"]
+        self.pad_id: int = self.special_tokens["<|finetune_right_pad_id|>"]
+        self.stop_tokens = [


i fixed this in llama31. these stop tokens are incorrect for model.

i fixed it just in a different place, by adding the right eos token; see from_pretrained_llama3_hf and from_pretrained_llama3_meta

we can refactor later once we support chat model?

Ohh you override it there. Hmm I think leaving this here is a bit dangerous and possibly confusing, just as setting class attributes is. Maybe we take it as an arg in our code here?

I fixed it using nano llama 3's solution

karpathy · 2024-08-08T16:31:30Z

train_llama3.py

+# -----------------------------------------------------------------------------
+# Our own simple Distributed Data Loader
+
+def _peek_data_shard(filename):


this code is all messed up and outdated atm here.
we need llama3 tokenizer encoded data. this is actively introducing bugs if someone tries to run it, reading GPT-2 tokenized data in uint16

karpathy · 2024-08-08T16:34:17Z

train_llama3.py

+# 2) https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/api/model.py
+# 3) https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/generation.py
+
+Example launches to only benchmark the speed of bfloat16 compiled GPU training:


delete launch commands that are incorrect atm

karpathy · 2024-08-08T16:35:23Z

train_llama3.py

+    xk: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))


can come later in a PR, but let's delete the use of complex. The use of complex here by Meta was a clear mistake, it created a lot of complexity for no good reason and iirc it broke torch.compile for me once earlier. In their latest code (for llamachat) Meta fixed this and they're now using a fully real-valued impl.

karpathy · 2024-08-08T16:38:05Z

train_llama3.py

+            self.cache_k = torch.zeros((config.max_gen_batch_size, config.block_size, config.n_kv_head, self.hd))
+            self.cache_v = torch.zeros((config.max_gen_batch_size, config.block_size, config.n_kv_head, self.hd))
+
+    def forward(self, x, freqs_cis=None, start_pos=None, mask: Optional[torch.Tensor] = None):


Random thought I think type hints are dumb, if you delete them places I will basically always accept that change.
I think they are ok for cases where the type is not obvious.
And I prefer comments always strictly, because e.g. in Tensors the important thing is not that it's a tensor, but what its shape is, or what the dtype is, etc.

kind of agree, esp. because they don't help catch an error, types are not enforced like in C

deleted the mask annotation everywhere

karpathy · 2024-08-08T16:39:08Z

train_llama3.py

+
+            next_token = next_token.reshape(-1)
+            # only replace token if prompt has already been generated
+            next_token = torch.where(


black pollution

gordicaleksa added 30 commits August 2, 2024 21:27

Equivalent with nano llama 3

41bf8e0

Refactor

838cd13

Minor refactor

c414d02

Equivalent to nano llama 3 reference code

465aac4

Refactor attn, change numerics but equivalent

f50f2de

Have prompts in a file instead of inline, prompt 4 is different

c0c08ba

Refactor checkpoint state dict map func

de879d1

Refactor MLP

0199e51

Refactor attn mechanism

fdd5345

One more minor attn fix

fa7bcc3

Unify generate and generate_llama

180215f

Fix generate for gpt-2

8919b66

Going towards pure llama 3 file - fixed attn

ccdbdfd

MLP GPT2->LLaMA3

8a48df7

Removed from pretrained for GPT-2

c1d2b7f

Refactoring - got to main

d855c96

Got to llama 3 inference (end)

b1acb59

Done - need to test train loop and saving model

bad7857

Remove init weights as it's gpt-2 specific

879cc5f

Add prompts file

7768a36

Fix saving model / state logic

cd90273

Test training loop works

4b386a2

Minor refactor - remove wpe pos array from fwd

0749a4a

Support HF & Meta models

8e55d16

Remove float(-inf)

72dcfeb

Remove llmc_py, single file

d4ef9c5

Add explicit external mask

b25e325

Add llama config error check

b7c98c9

Rename the new file to train llama3

624ed3c

Remove prompts.json

dfd459b

gordicaleksa added 2 commits August 8, 2024 18:20

Remove the whole llmc_py

ac01536

Remove pycache

89addd3

karpathy reviewed Aug 8, 2024

View reviewed changes

gordicaleksa added 4 commits August 8, 2024 18:45

Address Andrej's PR comments

f1c91f8

Add data loader not implemented exception

8b672ff

Add comments, fix stop tokens

c5c87fc

Remove unnecessary comment

d773c88

karpathy merged commit 6e6a528 into karpathy:master Aug 8, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLaMA 3 Python support #725

Add LLaMA 3 Python support #725

gordicaleksa commented Aug 2, 2024 •

edited

Loading

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

karpathy Aug 8, 2024

karpathy Aug 8, 2024

karpathy Aug 8, 2024

gordicaleksa Aug 8, 2024

gordicaleksa Aug 8, 2024

karpathy Aug 8, 2024

Add LLaMA 3 Python support #725

Add LLaMA 3 Python support #725

Conversation

gordicaleksa commented Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gordicaleksa commented Aug 2, 2024 •

edited

Loading