SEA-LION GPTQ quantization method

1.Purpose

This repository provides a guide and a collection of scripts to help with the quantization and inference of the SEA-LION 7B Instruct Model instruct model developed by AI Singapore. The goal is to further democratise access to SEA-LION by allowing it to run on consumer grade hardware (e.g. common GPU like Nvidia GTX and RTX series) thanks to quantization.

The 4-bit, 128 group size quantized model can be found here.

2.Quantization

The main work is done by the AutoGPTQ library. As of March 2024 however, the library does not directly support GPTQ quantization for the MPT architecture which SEA-LION is based on. Instead, a specific fork of the library is used (MPT branch), created by LaaZa. For convenience, we have forked Laaza's patch, which can be found here.

In the quantize.py file, please change the value of the two following variables to the appropriate path for your system.

# quantize.py

# ...
base_model_path = "path/to/base"
quant_mode_path = "path/to/quant"
#...

The class AutoGPTQForCausalLM is very similar to AutoModelForCausalLM by HuggingFace's Transformers library, except that you have to pass in a quantization config.

# quantize.py

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# ...

tokenizer = AutoTokenizer.from_pretrained( # will be loaded to GPU
        base_model_path,
        trust_remote_code=True,
        device_map = "cuda"
        )

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128
        )

model = AutoGPTQForCausalLM.from_pretrained( # will be loaded to CPU by default
        base_model_path,
        quantize_config,
        trust_remote_code=True,
        )

# ...

model.to("cuda:0") # load model to GPU

# ...

The GPTQ algorithm requires some input data. Due to the multilingual nature of SEA-LION, we used data from each language available in SEA-LION.

# quantize.py

# ...

import random

from datasets import load_dataset

# ...

seed = 0
random.seed(seed)

quantize_dataset = []
n_samples = 128 # from paper
seqlen = 2048
chunk_size = 100 # arbitrary value, to make sure there is enough data to reach a sequence length of 2048

# ...

paths = []
data = load_dataset("json", data_files=paths, split="train")

for _ in range(n_samples):
    i = random.randint(0, data.num_rows - chunk_size - 1)
    chunk = "".join(data["text"][i:i+chunk_size])
    token_data = tokenizer(chunk, return_tensors="pt")
    inp = token_data.input_ids[:, :seqlen]
    attention_mask = torch.ones_like(inp)
    quantize_dataset.append({"input_ids": inp, "attention_mask": attention_mask})

# ...

Finally, we can quantize and save our model.

# quantize.py

# ...

model.quantize(quantize_dataset, batch_size=10)
model.save_quantized(
        quant_model_path,
        use_safetensors=True
        )

Inference

To prepare your folder for inference, please make sure that you have all the base file from the SEA-LION 7B Instruct Model but replaced all the .safetensors files with the new .safetensors files you generated from quantization.

Go into inference.py and change the following variable to your appropriate model path.

# inference.py

# ...
model_path = "path/to/model"
# ...

Create your tokenizer, quantization config and model:

# inference.py

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

# ...

tokenizer = AutoTokenizer.from_pretrained( # will be loaded to GPU
        model_path,
        trust_remote_code=True,
        device_map = "cuda",
        )

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128
        )

model = AutoGPTQForCausalLM.from_quantized( # will be loaded to GPU
        model_path,
        device = "cuda:0",
        quantize_config = quantize_config,
        torch_dtype=torch.float16,
        trust_remote_code = True
        )

generation_kwargs = {
        "do_sample": False,  # set to true if temperature is not 0
        "temperature": None,
        "max_new_tokens": 256,
        "top_k": 50,
        "top_p": 0.7,
        "repetition_penalty": 1.2,
        "eos_token_id": tokenizer.eos.token_id
        }

# ...

Create your prompt:

# inference.py

# ...

prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
prompt_in = """Apa sentimen dari kalimat berikut ini?
Kalimat: Buku ini sangat membosankan.
Jawaban: """

full_prompt = prompt_template.format(human_prompt=prompt_in)

# ...

Tokenize your prompt and pass it into your model to generate your response!

# inference.py

# ...

tokens = tokenizer(full_prompt, return_tensors="pt")

input_ids = tokens["input_ids"].to("cuda:0") # move tokenized input to GPU

# Remove unneeded kwargs
if generation_kwargs["do_sample"] == False:
    generation_kwargs.pop("temperature")
    generation_kwargs.pop("top_k")
    generation_kwargs.pop("top_p")

output = model.generate(
		input_ids = input_ids,
		**generation_kwargs
		)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Benchmark

Model	ARC	HellaSwag	MMLU	TruthfulQA	Average
SEA-LION 7B Instruct (FP16)	40.78	68.20	27.12	36.29	43.10
SEA-LION 7B Instruct (4-Bit, 128 group size)	39.93	67.32	27.11	36.32	42.67

Although the evaluations were run with the same n-shot values as Hugging Face's LLM Leaderboard, the evaluations were run using version 0.4.1 of the Language Model Evaluation Harness by EleutherAI.

Tasks	n-shots
ARC (arc_challenge)	25
HellaSwag (hellaswag)	10
MMLU (mmlu)	5
TruthfulQA (truthfulqa_mc2)	0

Work In Progress (WIP)

Inference time comparisons on A100
Inference time of quantized model on GTX1070 (8GB)
Inference time of quantized model on RTX3080 (10GB)

Acknowledgements

Thank you to the AI Singapore team for their guidance and resources, with special thanks to:

Ng Boon Cheong Raymond
Teng Walter
Siow Bryan

Name	Name	Last commit message	Last commit date
Latest commit jasshanK Update README.md Apr 25, 2024 a73d385 · Apr 25, 2024 History 22 Commits
eval	eval	update	Mar 16, 2024
LICENSE	LICENSE	Initial commit	Mar 14, 2024
README.md	README.md	Update README.md	Apr 25, 2024
inference.py	inference.py	inference script	Mar 16, 2024
quantize.py	quantize.py	remove scratch_dir	Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEA-LION GPTQ quantization method

1.Purpose

2.Quantization

Inference

Benchmark

Work In Progress (WIP)

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

caviato/sealion-gptq

Folders and files

Latest commit

History

Repository files navigation

SEA-LION GPTQ quantization method

1.Purpose

2.Quantization

Inference

Benchmark

Work In Progress (WIP)

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages