Improving memory efficiency further 🚀 #30860

Cyrilvallez · 2024-05-16T13:54:11Z

Feature request

Removing the line logits = logits.float() in most ModelForCausalLM. This would allow to save a lot of memory for models with large vocabulary size. This allows to divide the memory peak by more than 2 on Llama3.

Motivation

This is in relation to my work in #30536.
I noticed that almost all ModelForCausalLM contain the following line in the forward:

logits = logits.float()

Now, since most models are now used in (b)float16, or even quantized, that line will almost always double the memory footprint of the logits. As the vocabulary size can be quite big (e.g. Llama3), this result in a lot of memory being used.
I suspect that it was originally introduced so that later manipulations of the logits (processors, warpers...) can be applied without losing too much precision. However, in generate() we only ever use the last token logits, not the whole logit matrix. So this is a huge waste of memory.

Your contribution

If the casting of the logits to float is indeed only used for not losing precision in their manipulations, I propose to only cast the last token to float in each decoding strategy function.

So, instead of:

logits = logits.float()

in forward(), do

next_token_logits = outputs.logits[:, -1, :].clone().float()

in each decoding strategy function. It would only cast the last token vector to float which is negligible in term of memory overhead.

As an example of the potential memory gains, running this very simple code snippet on Llama3 8B (vocabulary size 128256):

import torch
from transformers import AutoModelForCausalLM

model_name = 'meta-llama/Meta-Llama-3-8B'
dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation='flash_attention_2',
                                            torch_dtype=dtype, low_cpu_mem_usage=True).cuda(1)

memory = []
sizes = [100] + [500*i for i in range(1, 14)]
for size in sizes:

    input = torch.randint(1, 120000, (1, size), device=1)

    torch.cuda.reset_peak_memory_stats(1)
    actual_peak = torch.cuda.max_memory_allocated(1) / 1024**3

    # Single forward pass (first iteration of `generate()`)
    with torch.no_grad():
        out = model(input)

    memory_used = (torch.cuda.max_memory_allocated(1) / 1024**3) - actual_peak
    memory.append(memory_used)

    del out

gives:

That is, more than dividing by 2 the memory footprint. This is because the vocabulary size is so large that computing the logits from the hidden states is actually more costly than computing the hidden states themselves. Thus when casting to float(), we more than double the memory requirements (double for the new logits + the overhead when actually copying).

Of course, other models usually have smaller vocabulary size so will not benefit as much, but still the memory peak will decrease by a non-negligible portion for all applicable models (see below for Mistral, ~30% memory gain). And Llama3, which is I believe the hottest open-source model at the moment will be much more efficient.
mistral_ratio_example.pdf

Of course, if this casting to float is made for something else that I overlooked, this may not be applicable. Otherwise, I would be happy to make the change.

@ArthurZucker @gante

Cheers,
Cyril

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-05-20T12:43:47Z

That's actually something we should really do, in the light of #29943 which has this:

transformers/src/transformers/models/jamba/modeling_jamba.py

Lines 1657 to 1662 in e2ecd86

    
           hidden_states = outputs[0] 
        
           if num_logits_to_keep is None: 
        
               logits = self.lm_head(hidden_states) 
        
           else: 
        
               logits = self.lm_head(hidden_states[..., -num_logits_to_keep:, :]) 
        
           logits = logits.float()

ArthurZucker · 2024-05-20T12:44:01Z

(clone is missing)

gante · 2024-05-21T14:33:08Z

@Cyrilvallez

However, in generate() we only ever use the last token logits, not the whole logit matrix.

This is true except in assisted generation, where we want the logits for all candidate tokens 😛 But we can generalize to "we only ever want as many logits as input tokens".

👉 regarding keeping all the logits at prefill time: in our generate refactor plans, we will be separating the prefill stage. The prefill stage is meant to compute the KV caches without returning the associated logits, so it should solve this part of the problem

👉 regarding casting the logits with .float(): I agree we should move this part of the logic to generate. It would be a breaking PR (because it changes the type of an output), but it could save considerable memory at prefill time. Even after separating the prefill stage (see above), the upcast of the large logits tensor would still happen in forward. No upcast = no need to materialize the FP32 prefill logits tensor = memory savings. @ArthurZucker are you okay with this breaking change?

ArthurZucker · 2024-06-03T08:44:43Z

Yeah I think it should be okay. Our tests are gonna fail but I would want a bench result to see if the break is worth it!

Cyrilvallez · 2024-06-03T10:41:02Z

Great! I'll open a PR soon and will provide benchmarks.

gante · 2024-08-23T17:47:56Z

@Cyrilvallez this issue is complete, correct?

Cyrilvallez · 2024-08-23T18:10:05Z

Indeed! Closing it

## Summary The analogous `logits.float()` calls were moved in the Hugging Face modeling source code to be inside the `if labels is not None` block to avoid upcasting logits unless they are being used in a loss calculation; this avoids a memory spike during inference if the model is in lower precision. * https://github.com/huggingface/transformers/blob/37ea04013b34b39c01b51aeaacd8d56f2c62a7eb/src/transformers/models/llama/modeling_llama.py#L1211-L1212 * https://github.com/huggingface/transformers/blob/37ea04013b34b39c01b51aeaacd8d56f2c62a7eb/src/transformers/models/mixtral/modeling_mixtral.py#L1329-L1330 * https://github.com/huggingface/transformers/blob/37ea04013b34b39c01b51aeaacd8d56f2c62a7eb/src/transformers/models/phi3/modeling_phi3.py#L1303-L1304 * https://github.com/huggingface/transformers/blob/37ea04013b34b39c01b51aeaacd8d56f2c62a7eb/src/transformers/models/qwen2/modeling_qwen2.py#L1206-L1207 Some of your models already have this change: https://github.com/linkedin/Liger-Kernel/blob/ff6650bbcef5d31b7522694cbeb73a21169460e9/src/liger_kernel/transformers/model/mistral.py#L114-L116 https://github.com/linkedin/Liger-Kernel/blob/ff6650bbcef5d31b7522694cbeb73a21169460e9/src/liger_kernel/transformers/model/gemma.py#L114-L116 See also: * huggingface/transformers#30860  ## Testing Done   - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence

Cyrilvallez mentioned this issue May 16, 2024

Reduce by 2 the memory requirement in generate() 🔥🔥🔥 #30536

Merged

5 tasks

LuJunru mentioned this issue May 27, 2024

相比qwen第一版，显存占用为什么增加了很多？ QwenLM/Qwen2.5#240

Closed

Cyrilvallez mentioned this issue Jun 6, 2024

Reducing memory usage: removing useless logits computation in generate() #31292

Merged

huggingface deleted a comment from github-actions bot Jun 28, 2024

Chandler-Bing mentioned this issue Jul 9, 2024

为什么logits要在softmax前变为fp32 QwenLM/Qwen2.5#570

Closed

huggingface deleted a comment from github-actions bot Jul 24, 2024

huggingface deleted a comment from github-actions bot Aug 19, 2024

Cyrilvallez closed this as completed Aug 23, 2024

regisss mentioned this issue Oct 2, 2024

Remove additional float/clone() for perf huggingface/optimum-habana#1374

Closed

ringohoffman mentioned this issue Oct 2, 2024

Remove logits.float() #33902

Merged

5 tasks

ringohoffman mentioned this issue Oct 13, 2024

Move logits.float() call linkedin/Liger-Kernel#308

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving memory efficiency further 🚀 #30860

Improving memory efficiency further 🚀 #30860

Cyrilvallez commented May 16, 2024 •

edited

Loading

ArthurZucker commented May 20, 2024

ArthurZucker commented May 20, 2024

gante commented May 21, 2024 •

edited

Loading

ArthurZucker commented Jun 3, 2024

Cyrilvallez commented Jun 3, 2024

gante commented Aug 23, 2024

Cyrilvallez commented Aug 23, 2024

Improving memory efficiency further 🚀 #30860

Improving memory efficiency further 🚀 #30860

Comments

Cyrilvallez commented May 16, 2024 • edited Loading

Feature request

Motivation

Your contribution

ArthurZucker commented May 20, 2024

ArthurZucker commented May 20, 2024

gante commented May 21, 2024 • edited Loading

ArthurZucker commented Jun 3, 2024

Cyrilvallez commented Jun 3, 2024

gante commented Aug 23, 2024

Cyrilvallez commented Aug 23, 2024

Cyrilvallez commented May 16, 2024 •

edited

Loading

gante commented May 21, 2024 •

edited

Loading