Add llama implementation with no tensor parallel linears #1561

jerryzh168 · 2024-10-03T20:59:58Z

Summary:
Trying to demo llama with normal linear + quantized model + tensor parallelism works

verified correctness against original llama3 model
supported json-model-override-args in bench_latency script

Next: add pytorch native tensor parallelism test code for int8 weight only in torchao, diff from current llama model def: https://gist.github.com/jerryzh168/692ff83735d4ca298c1aad2424b2c225

Test Plan:

Using json-model-override-args to overwrite the name of the model

python3 -m sglang.bench_latency --correct --model meta-llama/Meta-Llama-3-8B --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}'
Init nccl begin.
Load weight begin. avail mem=94.48 GB
INFO 10-04 15:00:53 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.46it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.26it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  3.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.75it/s]

Load weight end. type=TorchNativeLlamaForCausalLM, dtype=torch.bfloat16, avail mem=79.41 GB

performance check

python3 -m sglang.bench_latency --model jerryzh168/llama3-8B --batch-size 1 --input 128 --output 8
python3 -m sglang.bench_latency --model jerryzh168/llama3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128

max_total_num_tokens=631444
Warmup ...
Prefill. latency: 0.09536 s, throughput:   1342.32 token/s
Decode.  latency: 0.00538 s, throughput:    185.80 token/s
Decode.  latency: 0.00476 s, throughput:    209.91 token/s
Decode.  latency: 0.00466 s, throughput:    214.38 token/s
Decode.  median latency: 0.00476 s, median throughput:    209.91 token/s
Total. latency:  0.110 s, throughput:   1198.18 token/s
Benchmark ...
Prefill. latency: 0.06534 s, throughput:   1958.93 token/s
Decode.  latency: 0.00502 s, throughput:    199.16 token/s
Decode.  latency: 0.00476 s, throughput:    210.03 token/s
Decode.  latency: 0.00469 s, throughput:    213.19 token/s
Decode.  latency: 0.00466 s, throughput:    214.77 token/s
Decode.  latency: 0.00466 s, throughput:    214.74 token/s
Decode.  median latency: 0.00469 s, median throughput:    213.19 token/s
Total. latency:  0.098 s, throughput:   1381.24 token/s

Accuracy check:


# python3 scripts/playground/reference_hf.py --model meta-llama/Meta-Llama-3-8B
========== Prompt 0 ==========
prefill logits (final) tensor([ 5.0195,  3.0801,  0.7422,  ..., -7.4805, -7.4805, -7.4805],
       device='cuda:0')
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. The city is situated
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 1 ==========
prefill logits (final) tensor([ 5.2109,  4.2344,  1.8408,  ..., -7.5195, -7.5195, -7.5195],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 2 ==========
prefill logits (final) tensor([ 9.5391,  3.1914,  0.8188,  ..., -7.0469, -7.0469, -7.0469],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to go out and enjoy the sun. I am going to the beach with my

# python3 scripts/playground/reference_hf.py --model jerryzh168/llama3-8B
========== Prompt 0 ==========
prefill logits (final) tensor([ 5.0195,  3.0801,  0.7422,  ..., -7.4805, -7.4805, -7.4805],
       device='cuda:0')
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. The city is situated
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 1 ==========
prefill logits (final) tensor([ 5.2109,  4.2344,  1.8408,  ..., -7.5195, -7.5195, -7.5195],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

========== Prompt 2 ==========
prefill logits (final) tensor([ 9.5391,  3.1914,  0.8188,  ..., -7.0469, -7.0469, -7.0469],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to go out and enjoy the sun. I am going to the beach with my


# python3 -m sglang.bench_latency --correct --model meta-llama/Meta-Llama-3-8B
max_total_num_tokens=557684

input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]

prefill logits (first half): tensor([[ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 2.2969,  2.9531,  2.1406,  ..., -8.3750, -8.3750, -8.3750]],
       device='cuda:0')

prefill logits (final): tensor([[ 5.0312,  3.1094,  0.7500,  ..., -7.4375, -7.4375, -7.4375],
        [ 5.2188,  4.2188,  1.8359,  ..., -7.5312, -7.5312, -7.5312],
        [ 9.5000,  3.1406,  0.7891,  ..., -7.0938, -7.0938, -7.0938]],
       device='cuda:0')

========== Prompt 0 ==========
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. It is the largest

========== Prompt 1 ==========
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the

========== Prompt 2 ==========
<|begin_of_text|>Today is a sunny day and I like to go out for a walk. I am going to the park. I am


# python3 -m sglang.bench_latency --correct --model jerryzh168/llama3-8B
Load weight end. type=TorchNativeLlamaForCausalLM, dtype=torch.bfloat16, avail mem=79.41 GB
Memory pool end. avail mem=11.16 GB
Capture cuda graph begin. This can take up to several minutes.
max_total_num_tokens=557684

input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]

prefill logits (first half): tensor([[ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 1.9609,  2.1094, -1.2500,  ..., -5.5000, -5.5000, -5.5000],
        [ 2.2969,  2.9531,  2.1406,  ..., -8.3750, -8.3750, -8.3750]],
       device='cuda:0')

prefill logits (final): tensor([[ 5.0312,  3.1094,  0.7500,  ..., -7.4375, -7.4375, -7.4375],
        [ 5.2188,  4.2188,  1.8359,  ..., -7.5312, -7.5312, -7.5312],
        [ 9.5000,  3.1406,  0.7891,  ..., -7.0938, -7.0938, -7.0938]],
       device='cuda:0')

========== Prompt 0 ==========
<|begin_of_text|>The capital of France is Paris. It is located in the north of the country. Paris is the largest

========== Prompt 1 ==========
<|begin_of_text|>The capital of the United Kindom is London. It is the largest city in the UK and the largest city in the

========== Prompt 2 ==========
<|begin_of_text|>Today is a sunny day and I like to go out for a walk. I am going to the park. I am

Reviewers:

Subscribers:

Tasks:

Tags:

merrymercy

Is `TorchNativeLlamaForCausalLM a better name?

Did you test the correctness?

sglang/docs/en/model_support.md

Lines 13 to 14 in 04b262c

    
           - Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]` 
        
           - Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`

Maybe we can add some arguments that allow using this model implementation without using a new checkpoint. We have some arguments like

sglang/python/sglang/srt/server_args.py

Lines 429 to 435 in 04b262c

    
           # Model override args 
        
           parser.add_argument( 
        
               "--json-model-override-args", 
        
               type=str, 
        
               help="A dictionary in JSON string format used to override default model configurations.", 
        
               default=ServerArgs.json_model_override_args, 
        
           )

to override the model configs. I am not sure whether it works.

Summary: Trying to demo llama with normal lineaer + quantized model + tensor parallelism works Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

merrymercy · 2024-10-05T18:22:37Z

@jerryzh168 Thanks! It is merged.

jerryzh168 force-pushed the raw-llama-tp branch from 966a3d6 to a328ee5 Compare October 3, 2024 23:25

merrymercy reviewed Oct 4, 2024

View reviewed changes

jerryzh168 added 4 commits October 4, 2024 15:06

Add llama implementation with no tensor parallel linears

c9b829c

Summary: Trying to demo llama with normal lineaer + quantized model + tensor parallelism works Test Plan: TODO Reviewers: Subscribers: Tasks: Tags:

typo

f3ff7e8

format

444ad55

address comments

296eabd

jerryzh168 force-pushed the raw-llama-tp branch from a328ee5 to 296eabd Compare October 4, 2024 22:07

format

1448854

jerryzh168 requested a review from merrymercy October 5, 2024 00:00

merrymercy merged commit 9b0926c into sgl-project:main Oct 5, 2024
2 of 11 checks passed

merrymercy mentioned this pull request Oct 19, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

jerryzh168 mentioned this pull request Jan 3, 2025

[Bug] How to load weight with torchao #2721

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama implementation with no tensor parallel linears #1561

Add llama implementation with no tensor parallel linears #1561

jerryzh168 commented Oct 3, 2024 •

edited

Loading

merrymercy left a comment

merrymercy commented Oct 5, 2024

	- Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]`
	- Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`

	# Model override args
	parser.add_argument(
	"--json-model-override-args",
	type=str,
	help="A dictionary in JSON string format used to override default model configurations.",
	default=ServerArgs.json_model_override_args,
	)

Add llama implementation with no tensor parallel linears #1561

Add llama implementation with no tensor parallel linears #1561

Conversation

jerryzh168 commented Oct 3, 2024 • edited Loading

Using json-model-override-args to overwrite the name of the model

performance check

Accuracy check:

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy commented Oct 5, 2024

jerryzh168 commented Oct 3, 2024 •

edited

Loading