Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2 7b mmlu stdcase #211

Merged
merged 5 commits into from
Aug 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
### 1. 推理数据集

* 下载地址:`https://huggingface.co/datasets/Stevross/mmlu/tree/main`
1. 下载其中的data.tar
2. 将.tar文件还原为目录
3. 将解压后的data目录放置在config.data_dir/config.mmlu_dir

### 2. 模型与权重

* 模型实现
* pytorch:transformers.LlamaForCausalLM
* 权重加载
* pytorch:LlamaForCausalLM.from_pretrained(config.data_dir/config.weight_dir)
* 权重获取方式
1. 填写申请表,向meta ai申请获取llama2模型权重,并同意相关协议
2. 下载其中的llama2-7b权重(注意不是chat)
3. 使用huggingface提供的convert.py将权重转化为huggingface格式,并保存在config.data_dir/config.weight_dir

### 3. 软硬件配置与运行信息参考

#### 3.1 Nvidia A100

- ##### 硬件环境
- 机器、加速卡型号: NVIDIA_A100-SXM4-40GB
- 多机网络类型、带宽: InfiniBand,200Gb/s

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.4.0-113-generic
- 加速卡驱动版本:470.129.06
- Docker 版本:20.10.16
- 训练框架版本:pytorch-2.1.0a0+4136153
- 依赖软件版本:
- cuda: 12.1

- 推理工具包
- Inductor (torch._dynamo) pytorch-2.1.0a0+4136153

- ##### 优化策略

- None

- ##### 并行策略

- None

### 4. 运行情况(Llama2_7b_MMLU)

* 指标列表

| 指标名称 | 指标值索引 | 特殊说明 |
| ------------------ | ----------------- | ----------------------------------------------------------- |
| 数据精度 | precision | 可选fp32/fp16 |
| 硬件存储使用 | mem | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time | 总时间+Perf初始化等时间 |
| 验证总吞吐量 | p_val_whole | 实际验证序列数除以总验证时间 |
| 验证计算吞吐量 | p_val_core | 不包含IO部分耗时 |
| 推理总吞吐量 | p_infer_whole | 实际推理序列数除以总推理时间 |
| **推理计算吞吐量** | **\*p_infer_core** | 不包含IO部分耗时 |
| **计算卡使用率** | **\*MFU** | model flops utilization |
| 推理结果 | acc(推理/验证) | 单位为MMLU回答准确率 |

* 指标值


| 推理工具 | precision | e2e_time | p_val_whole | p_val_core | p_infer_whole | \*p_infer_core | \*MFU | acc | mem |
| ----------- | --------- | ---- | ---- | -------- | ----------- | ---------- | ------------- | ------------ | ----------- | ----------- |
| inductor | fp16 | 2558 | 8596.9 | 8630.3 | 9230.8 | 10052.2 | 45.1% | 45.8%/45.8% | 28.0/40.0 |
| inductor | fp32 | 4143 | 5455.3 | 5469.4 | 5675.7 | 5951.8 | 53.4% | 45.8%/45.8% | 35.0/40.0 |
5 changes: 5 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from .dataloader import build_dataloader
from .model import create_model
from .export import export_model
from .evaluator import evaluator
from .forward import model_forward, engine_forward
144 changes: 144 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
import os
import pandas as pd
from transformers import AutoTokenizer
import torch
from torch.utils.data import DataLoader, Dataset
from loguru import logger

TASKS = [
'abstract_algebra',
'anatomy',
'astronomy',
'business_ethics',
'clinical_knowledge',
'college_biology',
'college_chemistry',
'college_computer_science',
'college_mathematics',
'college_medicine',
'college_physics',
'computer_security',
'conceptual_physics',
'econometrics',
'electrical_engineering',
'elementary_mathematics',
'formal_logic',
'global_facts',
'high_school_biology',
'high_school_chemistry',
'high_school_computer_science',
'high_school_european_history',
'high_school_geography',
'high_school_government_and_politics',
'high_school_macroeconomics',
'high_school_mathematics',
'high_school_microeconomics',
'high_school_physics',
'high_school_psychology',
'high_school_statistics',
'high_school_us_history',
'high_school_world_history',
'human_aging',
'human_sexuality',
'international_law',
'jurisprudence',
'logical_fallacies',
'machine_learning',
'management',
'marketing',
'medical_genetics',
'miscellaneous',
'moral_disputes',
'moral_scenarios',
'nutrition',
'philosophy',
'prehistory',
'professional_accounting',
'professional_law',
'professional_medicine',
'professional_psychology',
'public_relations',
'security_studies',
'sociology',
'us_foreign_policy',
'virology',
'world_religions'
]
choices = ["A", "B", "C", "D"]

def format_subject(subject):
l = subject.split("_")
s = ""
for entry in l:
s += " " + entry
return s


def gen_prompt(train_df, subject, k=-1):
prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(format_subject(subject))
if k == -1:
k = train_df.shape[0]
for i in range(k):
prompt += format_example(train_df, i)
return prompt


def format_example(df, idx, include_answer=True):
prompt = df.iloc[idx, 0]
k = df.shape[1] - 2
for j in range(k):
prompt += "\n{}. {}".format(choices[j], df.iloc[idx, j+1])
prompt += "\nAnswer:"
if include_answer:
prompt += " {}\n\n".format(df.iloc[idx, k + 1])
return prompt


class mmlu(Dataset):

def __init__(self, config):
self.tokenizer = AutoTokenizer.from_pretrained(os.path.join(config.data_dir, config.weight_dir))
self.records = []
self.length = 0

for task in TASKS:

logger.debug("Loading 5-shot " + str(task))

dev_df = pd.read_csv(os.path.join(config.data_dir, config.mmlu_dir, "dev", task + "_dev.csv"), header=None)[:config.few_shots]
test_df = pd.read_csv(os.path.join(config.data_dir, config.mmlu_dir, "test", task + "_test.csv"), header=None)

for i in range(test_df.shape[0]):
k = config.few_shots
prompt_end = format_example(test_df, i, include_answer=False)
train_prompt = gen_prompt(dev_df, task, k)
prompt = train_prompt + prompt_end
while len(self.tokenizer.tokenize(prompt)) + 1> 2048:
prompt_split = prompt.split("\n\n")
prompt_split.pop(1)
prompt = "\n\n".join(prompt_split)
label = test_df.iloc[i, test_df.shape[1]-1]
token_prompt = self.tokenizer(prompt, return_tensors="pt")
token_label = self.tokenizer([label], return_tensors="pt")
self.records.append({"prompt":token_prompt, "answer":token_label.input_ids})
self.length += 1


def __len__(self):
return self.length

def __getitem__(self, idx):
return self.records[idx]


def build_dataloader(config):
dataset = mmlu(config)
assert config.batch_size == 1
loader = DataLoader(dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=False,
num_workers=config.num_workers,
pin_memory=True)

return loader
11 changes: 11 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/evaluator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import torch


def evaluator(pred, y):
gt = float(y[0][0][1])
predict = pred[:,-1,:]
answer = float(torch.argmax(predict, dim=1))
if answer == gt:
return 1
else:
return 0
9 changes: 9 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import torch
import os


def export_model(model, config):
if config.exist_onnx_path is not None:
yuzhou03 marked this conversation as resolved.
Show resolved Hide resolved
return config.exist_onnx_path

return None
117 changes: 117 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/forward.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
from loguru import logger
import torch
import numpy as np
import time
from tools import torch_sync


def cal_perf(config, tokens, duration, core_time, str_prefix):
model_forward_perf = config.repeat * tokens / duration
logger.info(str_prefix + "(" + config.framework + ") Perf: " +
str(model_forward_perf) + " tps")
model_forward_core_perf = config.repeat * tokens / core_time
logger.info(str_prefix + "(" + config.framework + ") core Perf: " +
str(model_forward_core_perf) + " tps")
return round(model_forward_perf, 3), round(model_forward_core_perf, 3)


def model_forward(model, dataloader, evaluator, config):
if config.no_validation:
return None, None, None
start = time.time()
core_time = 0.0

token_cnt = 0
correct = 0
whole = 0

for times in range(config.repeat):

logger.debug("Repeat: " + str(times + 1))

for step, item in enumerate(dataloader):
if step % config.log_freq == 0:
logger.debug("Step: " + str(step) + " / " +
str(len(dataloader)))

tokens = item["prompt"].input_ids.cuda()[0]

with torch.no_grad():

torch_sync(config)
core_time_start = time.time()

y = model(tokens)

torch_sync(config)
core_time += time.time() - core_time_start

token_cnt += len(tokens[0])

pred = y[0]
r = evaluator(pred, item["answer"])

correct += r
whole += 1

logger.info("MMLU" + str(config.few_shots) + "-shots Acc: " + str(correct / whole))

duration = time.time() - start
model_forward_perf, model_forward_core_perf = cal_perf(
config, token_cnt, duration, core_time, "Validation")

return model_forward_perf, model_forward_core_perf, round(correct / whole, 3)


def engine_forward(model, dataloader, evaluator, config):
if config.no_validation:
return None, None, None
start = time.time()
core_time = 0.0
foo_time = 0.0

token_cnt = 0
correct = 0
whole = 0

for times in range(config.repeat):

logger.debug("Repeat: " + str(times + 1))

for step, item in enumerate(dataloader):
if step % config.log_freq == 0:
logger.debug("Step: " + str(step) + " / " +
str(len(dataloader)))

tokens = item["prompt"].input_ids[0]
model_inputs = [tokens]

with torch.no_grad():

torch_sync(config)
core_time_start = time.time()

y = model(model_inputs)

torch_sync(config)
core_time += time.time() - core_time_start

foo_time += y[1]
model_outputs = y[0]

token_cnt += len(tokens[0])

y = model_outputs[0]
pred = y[0]
r = evaluator(pred, item["answer"])

correct += r
whole += 1

logger.info("MMLU" + str(config.few_shots) + "-shots Acc: " + str(correct / whole))

duration = time.time() - start
model_forward_perf, model_forward_core_perf = cal_perf(
config, token_cnt, duration, core_time - foo_time, "Inference")

return model_forward_perf, model_forward_core_perf, round(correct / whole, 3)
11 changes: 11 additions & 0 deletions inference/benchmarks/llama2_7b_mmlu/pytorch/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from transformers import LlamaForCausalLM


def create_model(config):
model = LlamaForCausalLM.from_pretrained(config.data_dir + "/" +
config.weight_dir).eval().cuda().float()

if config.fp16:
model.half()

return model
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
transformers
16 changes: 16 additions & 0 deletions inference/configs/llama2_7b_mmlu/configurations.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
batch_size: 1
# 1 item(like 1 sequence, 1 image) flops
# Attention! For transformer decoder like bert, 1 token cause 2*param flops, so we need 2*length*params like 2*512*0.33B here
# format: a_1*a*2*...*a_nea_0,like 2*512*0.33e9(bert) or 4.12e9(resnet50)
flops: 2*7e9
fp16: true
compiler: inductor
num_workers: 8
log_freq: 100
repeat: 1
# skip validation(will also skip create_model, export onnx). Assert exist_onnx_path != null
no_validation: false
# set a real onnx_path to use exist, or set it to anything but null to avoid export onnx manually(like torch-tensorrt)
exist_onnx_path: null
# set a exist path of engine file like resnet50.trt/resnet50.plan/resnet50.engine
exist_compiler_path: null
3 changes: 3 additions & 0 deletions inference/configs/llama2_7b_mmlu/parameters.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
weight_dir: "llama2_7b_hf"
mmlu_dir: "mmlu_dataset/data"
few_shots: 5
Loading