Skip to content

Commit

Permalink
[mthreads] deepspeed llama2 (#354)
Browse files Browse the repository at this point in the history
* [kunlunxin] fix tacotron2 running error and add 1x1 & 2x8 config (#346)

* [kunlunxin] fix tacotron2 running error and add 1x1 & 2x8 config

* [kunlunxin] modify tacotron2 test_config

* [kunlunxin] update tacotron2 readme

* [kunlunxin] modify tacotron2 torch.load()

* [iluvatar] swin_transformer-pytorch 1x1 2x8 (#340)

* update iluvatar/swin_transformer-pytorch

* update

* update

* update

* fix batch size mistake in readme

* correct val_loss to final acc1

* add finnal_acc1 and mem in readme

* correct readme mem

---------

Co-authored-by: 魏杰 <[email protected]>
Co-authored-by: 杨智超 <[email protected]>
Co-authored-by: clveryang <[email protected]>

* fix get_system_info for iluvatar_monitor (#351)

Co-authored-by: zhouyu <[email protected]>

* update iluvatar mobilenetv2 config (#356)

Co-authored-by: sen.li <[email protected]>

* Update README.md (#357)

* Update README.md

* Update README.md

* [iluvatar] bertlarge inference case (#353)

* iluvatar bertlarge MLM inference case

* update ixrt readme

---------

Co-authored-by: 杨智超 <[email protected]>

* [mthreads] bert_hf 1x8 (#350)

* support bert_hf fp32/amp/bf16 training for mthreads

* update readme

* prevent overrun

* 1x1/2x8 not support

* 【mthreads】【block】resnet50 training (#246)

* support resnet50 training on mthreads

* fix typo

* support rn50 amp training on mthreads

* add test config (should revert this commit)

* update config & readme

* add get_system_info fn

* update

* 1x1/2x8 not support

---------

Co-authored-by: Zhou Yu <[email protected]>

* fix llama, add TFLOPS log (#358)

* fixllama

* add t/tflops

* [mthreads] deepspeed llama2

* update readme for sdpa

---------

Co-authored-by: jamesruio <[email protected]>
Co-authored-by: swish swish <[email protected]>
Co-authored-by: 魏杰 <[email protected]>
Co-authored-by: 杨智超 <[email protected]>
Co-authored-by: clveryang <[email protected]>
Co-authored-by: Zhou Yu <[email protected]>
Co-authored-by: zhouyu <[email protected]>
Co-authored-by: forestlee95 <[email protected]>
Co-authored-by: sen.li <[email protected]>
Co-authored-by: uuup <[email protected]>
Co-authored-by: clveryang <[email protected]>
Co-authored-by: mingyuanw-mt <[email protected]>
Co-authored-by: shh2000 <[email protected]>
  • Loading branch information
14 people authored Dec 21, 2023
1 parent 57fc52e commit 9c64606
Show file tree
Hide file tree
Showing 49 changed files with 934 additions and 74 deletions.
21 changes: 21 additions & 0 deletions inference/benchmarks/bertLarge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,25 @@ bert_reference_results_text_md5.txt

- XTCL 2.1

#### 2.3 天数智芯 MR-100

- ##### 硬件环境
- 机器、加速卡型号: MR-100

- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 5.15.0-89-generic
- 加速卡驱动版本:3.2.0
- Docker 版本:24.0.4
- 依赖软件版本:
- torch-1.13.1+corex.3.2.1
- onnxsim

- 推理工具包

- IXRT: ixrt-0.8.0+corex.3.2.1


### 4. 运行情况(BERT-Large)

* 指标列表
Expand All @@ -83,3 +102,5 @@ bert_reference_results_text_md5.txt
| tensorrt | fp16 | 32 | 1283.9 | 257.3 | 260.4 | 408.3 | 418.1 | 45.3% | 0.600/0.638 | 17.4/40.0 |
| tensorrt | fp32 | 32 | 1868.8 | 150.4 | 152.2 | 190.4 | 194.1 | 42.0% | 0.638/0.638 | 16.9/40.0 |
| kunlunxin_xtcl| W32A16 | 32 |/ | / | / | / | / | / | 0.638/0.638| /|
| iluvatar_ixrt| fp16 | 32 |/ | / | / | / | / | / | 0.599/0.638| /|

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers
onnxsim
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ixrt_tmp_path: iluvatar_tmp/bertLarge.trt
compiler: ixrt
# no_validation: true
has_dynamic_axis: false
torchtrt_full_compile: true
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

>联系邮箱: [email protected]
ixrt-0.7.0+corex.latest.version-cp310-cp310-linux_x86_64.whl
ixrt-0.8.0+corex.latest.version-cp310-cp310-linux_x86_64.whl

torchvision-0.14.1+corex.3.2.1.20231006.892-cp310-cp310-linux_x86_64.whl

Expand Down
26 changes: 15 additions & 11 deletions inference/inference_engine/iluvatar/ixrt.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
import time
import subprocess


class InferModel:

class HostDeviceMem(object):
Expand Down Expand Up @@ -66,27 +65,32 @@ def __init__(self, config, onnx_path, model):

def build_engine(self, config, onnx_path):
if config.exist_compiler_path is None:
trt_path = config.log_dir + "/" + config.ixrt_tmp_path
ixrt_path = config.log_dir + "/" + config.ixrt_tmp_path

dir_trt_path = os.path.dirname(trt_path)
dir_trt_path = os.path.dirname(ixrt_path)
os.makedirs(dir_trt_path, exist_ok=True)

time.sleep(10)

trtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + trt_path
onnxsim_cmd = f"onnxsim {onnx_path} {onnx_path}"

onnxsim_cmd = subprocess.Popen(onnxsim_cmd, shell=True)
onnxsim_cmd.wait()

ixrtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + ixrt_path
if config.fp16:
trtexec_cmd += " --precision fp16"
ixrtexec_cmd += " --precision fp16"
if config.has_dynamic_axis:
trtexec_cmd += " --minShapes=" + config.minShapes
trtexec_cmd += " --optShapes=" + config.optShapes
trtexec_cmd += " --maxShapes=" + config.maxShapes
ixrtexec_cmd += " --minShapes=" + config.minShapes
ixrtexec_cmd += " --optShapes=" + config.optShapes
ixrtexec_cmd += " --maxShapes=" + config.maxShapes

p = subprocess.Popen(trtexec_cmd, shell=True)
p = subprocess.Popen(ixrtexec_cmd, shell=True)
p.wait()
else:
trt_path = config.exist_compiler_path
ixrt_path = config.exist_compiler_path

with open(trt_path, "rb") as f:
with open(ixrt_path, "rb") as f:
return self.runtime.deserialize_cuda_engine(f.read())

def allocate_buffers(self, engine):
Expand Down
6 changes: 4 additions & 2 deletions training/benchmarks/aquila2_7b/flagscale/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ aquila2是北京人工智能研究院开源的语言模型,包含基础语言

## 模型配置及tokenizer准备

本测试样例为预训练case,需要下载tokenizer,下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer。需要在data_dir下创建tokenizer目录,将上述链接中的三个文件下载到此目录中
本测试样例为预训练case,需要下载tokenizer,下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer

此tokenizer需要下载FlagScale仓库ed55532这一commit版本,需要在data_dir下创建tokenizer目录,将上述链接中的三个文件下载到此目录中

## 数据准备

Expand All @@ -14,4 +16,4 @@ https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.bin

https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.idx

将上述两个文件放置于data_dir下。
将上述两个文件放置于data_dir下。
16 changes: 1 addition & 15 deletions training/benchmarks/bert_hf/pytorch/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,21 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
dist_pytorch.barrier(self.config.vendor)
pure_start_time = time.time()

if scaler is not None:
with torch.cuda.amp.autocast(enabled=True):
output = model(input_ids=input_ids, labels=labels)
loss = output.loss

scaler.scale(loss).backward()
if step % self.config.gradient_accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
else:
output = model(input_ids=input_ids, labels=labels)
loss = output.loss
loss.backward()
if step % self.config.gradient_accumulation_steps == 0:
optimizer.step()
loss = self.adapter.train_one_step(model, (input_ids, labels), optimizer, step, scaler)

if step % self.config.log_freq == 0:
print("Train Step " + str(step) + "/" + str(len(data_loader)) +
Expand Down
21 changes: 21 additions & 0 deletions training/benchmarks/bert_hf/pytorch/train/trainer_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,24 @@ def create_grad_scaler():
"""create_grad_scaler for mixed precision training"""
scaler = torch.cuda.amp.GradScaler() if config.amp else None
return scaler


def train_one_step(model, batch_data, optimizer, cur_step, scaler=None):
input_ids, labels = batch_data
if scaler:
with torch.cuda.amp.autocast(enabled=True):
output = model(input_ids=input_ids, labels=labels)
loss = output.loss

scaler.scale(loss).backward()
if cur_step % config.gradient_accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
else:
output = model(input_ids=input_ids, labels=labels)
loss = output.loss
loss.backward()
if cur_step % config.gradient_accumulation_steps == 0:
optimizer.step()

return loss
19 changes: 19 additions & 0 deletions training/benchmarks/driver/dist_pytorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,8 @@ def barrier(vendor="nvidia"):
if torch.distributed.is_available() and torch.distributed.is_initialized():
if vendor == "kunlunxin":
torch.distributed.barrier()
elif vendor == "mthreads":
torch.distributed.barrier()
else:
torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
torch.cuda.synchronize()
Expand All @@ -172,6 +174,23 @@ def init_dist_training_env(config):
rank=rank,
world_size=world_size)
config.n_device = torch.distributed.get_world_size()
elif config.vendor == "mthreads":
import torch_musa
if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
config.device = torch.device("musa")
config.n_device = 1
else:
torch.musa.set_device(config.local_rank)
host_addr_full = 'tcp://' + os.environ[
"MASTER_ADDR"] + ':' + os.environ["MASTER_PORT"]
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
torch.distributed.init_process_group(backend=config.dist_backend,
init_method=host_addr_full,
rank=rank,
world_size=world_size)
config.device = torch.device("musa", config.local_rank)
config.n_device = torch.distributed.get_world_size()
else: # nvidia
if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
config.device = torch.device("cuda")
Expand Down
6 changes: 6 additions & 0 deletions training/benchmarks/driver/helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ def set_seed(self, seed: int, vendor: str = None):
elif lower_vendor == "ascend":
import mindspore
mindspore.set_seed(seed)
elif lower_vendor == "mthreads":
import torch
import torch_musa
torch.manual_seed(seed)
torch.musa.manual_seed(seed)
torch.musa.manual_seed_all(seed)
else:
# TODO 其他厂商设置seed,在此扩展
pass
28 changes: 21 additions & 7 deletions training/benchmarks/llama2_7b/deepspeed/run_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
from importlib import import_module

import torch
try:
import torch_musa
DEVICE = 'musa'
except:
DEVICE = 'cuda'
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

Expand Down Expand Up @@ -54,29 +59,32 @@ def get_argument_parser():

def train(model_engine, dataloader):
model_engine.train()
device = torch.device(f"{DEVICE}:{args.local_rank}")
ave_loss = 0.0
for step, data in enumerate(dataloader):

fake_data = torch.tensor(data).long()
input_ids = fake_data.to(args.local_rank)
labels = fake_data.to(args.local_rank)
input_ids = fake_data.to(device)
labels = fake_data.to(device)
loss = model_engine(input_ids=input_ids, labels=labels).loss
model_engine.backward(loss)
model_engine.step()

ave_loss += loss
if step % 10 == 0 and args.local_rank == 0:
if step > 0 and step % 10 == 0 and args.local_rank == 0:
print('Step {}/{}, Loss: {}'.format(step, len(dataloader),
ave_loss / 10))
ave_loss = 0.0


def get_deepspeed_engine(args, model_config_dir, flashattn):
def get_deepspeed_engine(args, model_config_dir):
with deepspeed.zero.Init(config_dict_or_path=args.deepspeed_config,
enabled=True,
mem_efficient_linear=False,
mpu=None):
model = get_llama_model(model_config_dir, flashattn)
model = get_llama_model(model_config_dir, args.flashattn)
if args.gradient_checkpointing_enable:
model.gradient_checkpointing_enable()

model_engine, _, _, _ = deepspeed.initialize(
args=args, model=model, model_parameters=model.parameters())
Expand Down Expand Up @@ -107,10 +115,12 @@ def get_metric(texts):
theoryflops = getattr(module, 'theoryflops')
epochs = getattr(module, 'epochs')
flashattn = getattr(module, 'flashattn')
gradient_checkpointing_enable = getattr(module, 'gradient_checkpointing_enable', False)
args.flashattn = flashattn
args.gradient_checkpointing_enable = gradient_checkpointing_enable

deepspeed.init_distributed()
model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"),
flashattn)
model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"))
dataset = get_llama_dataset(args, seqlength, datafilename)

logger = logging.getLogger("DeepSpeed")
Expand Down Expand Up @@ -138,4 +148,8 @@ def get_metric(texts):
chip_tps = whole_tps / args.nproc * args.nnodes
print("System tokens per second: ", whole_tps)
print("Tokens/p/s: ", chip_tps)

TFLOPS = int(theoryflops/1000000000000)
print("Theory TFLOPS: ", TFLOPS)
print("Tokens/TFLOPS: ", chip_tps / TFLOPS)
print("MFU: ", chip_tps * 7000000000.0 * 6 / theoryflops)
17 changes: 1 addition & 16 deletions training/benchmarks/resnet50/pytorch/train/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,22 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
pure_start_time = time.time()
optimizer.zero_grad()

images, target = batch
if scaler is not None:
with torch.cuda.amp.autocast(enabled=True):
output = model(images)
loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = model(images)

criterion = torch.nn.CrossEntropyLoss()
loss = criterion(output, target)
loss.backward()
optimizer.step()
loss = self.adapter.train_step(model, batch, optimizer, scaler)

if step % self.config.log_freq == 0:
print("Train Step " + str(step) + "/" + str(len(data_loader)) +
Expand Down
20 changes: 20 additions & 0 deletions training/benchmarks/resnet50/pytorch/train/trainer_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,23 @@ def create_grad_scaler():
"""create_grad_scaler for mixed precision training"""
scaler = torch.cuda.amp.GradScaler() if config.amp else None
return scaler


def train_step(model, batch, optimizer, scaler=None):
"""train one step"""
images, target = batch
criterion = torch.nn.CrossEntropyLoss()
if scaler:
with torch.cuda.amp.autocast(enabled=True):
output = model(images)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = model(images)
loss = criterion(output, target)
loss.backward()
optimizer.step()

return loss
2 changes: 1 addition & 1 deletion training/iluvatar/iluvatar_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ def get_system_info():
cmd = cmd + r"echo ;"

cmd = cmd + r"echo Accelerator Model:;"
cmd = cmd + r"ixsmi -L;"
cmd = cmd + r"export PATH=/usr/local/corex/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/corex/lib; ixsmi -L;"
cmd = cmd + r"echo ;"

cmd = cmd + r"echo Accelerator Driver version:;"
Expand Down
3 changes: 2 additions & 1 deletion training/iluvatar/mobilenetv2-pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | acc | mem |
| --------------------- | --------- | -------------- | -------- | ------- | ------- | ------ | ------ | ----------- |
| BI-V100单机8卡(1x8) | fp32 | bs=256,lr=0.72 | 103759 | 3520 | 3604 | 3651 | 68.61% | 21.6 / 32.0 |
| BI-V100单机8卡(1x8) | fp32 | / | 174534 | 1857 | 1876 | 1885 | 68.52% | 3.6/32.0 |
| BI-V100单机8卡(1x8) | fp32 | bs=256,lr=0.72 | 87559 | 4390 | 4543 | 4625 | 61.92% | 21.6 / 32.0 |
| BI-V100单机8卡(1x1) | fp32 | bs=256,lr=0.72 | / | 624 | 632 | 633 | / | 21.4 / 32.0 |
| BI-V100单机8卡(2x8) | fp32 | bs=256,lr=0.72 | / | 6835 | 7058 | 7219 | / | 22.2 / 32.0 |

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from config_common import *

train_batch_size = 256
eval_batch_size = 256
train_batch_size = 32
eval_batch_size = 32

Loading

0 comments on commit 9c64606

Please sign in to comment.