[mthreads] deepspeed llama2 (#354)

* [kunlunxin] fix tacotron2 running error and add 1x1 & 2x8 config (#346) * [kunlunxin] fix tacotron2 running error and add 1x1 & 2x8 config * [kunlunxin] modify tacotron2 test_config * [kunlunxin] update tacotron2 readme * [kunlunxin] modify tacotron2 torch.load() * [iluvatar] swin_transformer-pytorch 1x1 2x8 (#340) * update iluvatar/swin_transformer-pytorch * update * update * update * fix batch size mistake in readme * correct val_loss to final acc1 * add finnal_acc1 and mem in readme * correct readme mem --------- Co-authored-by: 魏杰 <[email protected]> Co-authored-by: 杨智超 <[email protected]> Co-authored-by: clveryang <[email protected]> * fix get_system_info for iluvatar_monitor (#351) Co-authored-by: zhouyu <[email protected]> * update iluvatar mobilenetv2 config (#356) Co-authored-by: sen.li <[email protected]> * Update README.md (#357) * Update README.md * Update README.md * [iluvatar] bertlarge inference case (#353) * iluvatar bertlarge MLM inference case * update ixrt readme --------- Co-authored-by: 杨智超 <[email protected]> * [mthreads] bert_hf 1x8 (#350) * support bert_hf fp32/amp/bf16 training for mthreads * update readme * prevent overrun * 1x1/2x8 not support * 【mthreads】【block】resnet50 training (#246) * support resnet50 training on mthreads * fix typo * support rn50 amp training on mthreads * add test config (should revert this commit) * update config & readme * add get_system_info fn * update * 1x1/2x8 not support --------- Co-authored-by: Zhou Yu <[email protected]> * fix llama, add TFLOPS log (#358) * fixllama * add t/tflops * [mthreads] deepspeed llama2 * update readme for sdpa --------- Co-authored-by: jamesruio <[email protected]> Co-authored-by: swish swish <[email protected]> Co-authored-by: 魏杰 <[email protected]> Co-authored-by: 杨智超 <[email protected]> Co-authored-by: clveryang <[email protected]> Co-authored-by: Zhou Yu <[email protected]> Co-authored-by: zhouyu <[email protected]> Co-authored-by: forestlee95 <[email protected]> Co-authored-by: sen.li <[email protected]> Co-authored-by: uuup <[email protected]> Co-authored-by: clveryang <[email protected]> Co-authored-by: mingyuanw-mt <[email protected]> Co-authored-by: shh2000 <[email protected]>
FlagOpen · Dec 21, 2023 · 9c64606 · 9c64606
1 parent 57fc52e
commit 9c64606
Show file tree

Hide file tree

Showing 49 changed files with 934 additions and 74 deletions.
diff --git a/inference/benchmarks/bertLarge/README.md b/inference/benchmarks/bertLarge/README.md
@@ -58,6 +58,25 @@ bert_reference_results_text_md5.txt
 
    - XTCL 2.1
 
+#### 2.3 天数智芯 MR-100
+
+- ##### 硬件环境
+    - 机器、加速卡型号: MR-100
+
+- ##### 软件环境
+   - OS版本：Ubuntu 20.04
+   - OS kernel版本: 5.15.0-89-generic
+   - 加速卡驱动版本：3.2.0
+   - Docker 版本：24.0.4
+   - 依赖软件版本：
+      - torch-1.13.1+corex.3.2.1
+      - onnxsim
+
+- 推理工具包
+
+   - IXRT: ixrt-0.8.0+corex.3.2.1
+
+
 ### 4. 运行情况（BERT-Large）
 
 * 指标列表
@@ -83,3 +102,5 @@ bert_reference_results_text_md5.txt
 | tensorrt | fp16      | 32 | 1283.9   | 257.3       | 260.4      | 408.3         | 418.1          | 45.3% | 0.600/0.638 | 17.4/40.0 |
 | tensorrt | fp32   | 32 | 1868.8   | 150.4       | 152.2      | 190.4         | 194.1       | 42.0% | 0.638/0.638 | 16.9/40.0 |
 | kunlunxin_xtcl| W32A16   | 32 |/ | /          | /       | /          | /          | / | 0.638/0.638| /|
+| iluvatar_ixrt| fp16  | 32 |/ | /          | /       | /          | /          | / | 0.599/0.638| /|
+
diff --git a/inference/benchmarks/bertLarge/pytorch/iluvatar_requirements.txt b/inference/benchmarks/bertLarge/pytorch/iluvatar_requirements.txt
@@ -0,0 +1,2 @@
+transformers
+onnxsim
diff --git a/inference/configs/bertLarge/vendor_config/iluvatar_configurations.yaml b/inference/configs/bertLarge/vendor_config/iluvatar_configurations.yaml
@@ -0,0 +1,5 @@
+ixrt_tmp_path: iluvatar_tmp/bertLarge.trt
+compiler: ixrt
+# no_validation: true
+has_dynamic_axis: false
+torchtrt_full_compile: true
diff --git a/inference/docker_images/iluvatar/pytorch/packages/README.md b/inference/docker_images/iluvatar/pytorch/packages/README.md
@@ -2,7 +2,7 @@
 
 >联系邮箱: [email protected]
 
-ixrt-0.7.0+corex.latest.version-cp310-cp310-linux_x86_64.whl
+ixrt-0.8.0+corex.latest.version-cp310-cp310-linux_x86_64.whl
 
 torchvision-0.14.1+corex.3.2.1.20231006.892-cp310-cp310-linux_x86_64.whl
 

diff --git a/inference/inference_engine/iluvatar/ixrt.py b/inference/inference_engine/iluvatar/ixrt.py
@@ -9,7 +9,6 @@
 import time
 import subprocess
 
-
 class InferModel:
 
     class HostDeviceMem(object):
@@ -66,27 +65,32 @@ def __init__(self, config, onnx_path, model):
 
     def build_engine(self, config, onnx_path):
         if config.exist_compiler_path is None:
-            trt_path = config.log_dir + "/" + config.ixrt_tmp_path
+            ixrt_path = config.log_dir + "/" + config.ixrt_tmp_path
 
-            dir_trt_path = os.path.dirname(trt_path)
+            dir_trt_path = os.path.dirname(ixrt_path)
             os.makedirs(dir_trt_path, exist_ok=True)
 
             time.sleep(10)
 
-            trtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + trt_path
+            onnxsim_cmd = f"onnxsim {onnx_path} {onnx_path}"
+
+            onnxsim_cmd = subprocess.Popen(onnxsim_cmd, shell=True)
+            onnxsim_cmd.wait()
+
+            ixrtexec_cmd = "ixrtexec --onnx=" + onnx_path + " --save_engine=" + ixrt_path
             if config.fp16:
-                trtexec_cmd += " --precision fp16"
+                ixrtexec_cmd += " --precision fp16"
             if config.has_dynamic_axis:
-                trtexec_cmd += " --minShapes=" + config.minShapes
-                trtexec_cmd += " --optShapes=" + config.optShapes
-                trtexec_cmd += " --maxShapes=" + config.maxShapes
+                ixrtexec_cmd += " --minShapes=" + config.minShapes
+                ixrtexec_cmd += " --optShapes=" + config.optShapes
+                ixrtexec_cmd += " --maxShapes=" + config.maxShapes
 
-            p = subprocess.Popen(trtexec_cmd, shell=True)
+            p = subprocess.Popen(ixrtexec_cmd, shell=True)
             p.wait()
         else:
-            trt_path = config.exist_compiler_path
+            ixrt_path = config.exist_compiler_path
 
-        with open(trt_path, "rb") as f:
+        with open(ixrt_path, "rb") as f:
             return self.runtime.deserialize_cuda_engine(f.read())
 
     def allocate_buffers(self, engine):

diff --git a/training/benchmarks/aquila2_7b/flagscale/README.md b/training/benchmarks/aquila2_7b/flagscale/README.md
@@ -4,7 +4,9 @@ aquila2是北京人工智能研究院开源的语言模型，包含基础语言
 
 ## 模型配置及tokenizer准备
 
-本测试样例为预训练case，需要下载tokenizer，下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer。需要在data_dir下创建tokenizer目录，将上述链接中的三个文件下载到此目录中
+本测试样例为预训练case，需要下载tokenizer，下载链接为https://github.com/FlagOpen/FlagScale/tree/main/examples/aquila/tokenizer
+
+此tokenizer需要下载FlagScale仓库ed55532这一commit版本，需要在data_dir下创建tokenizer目录，将上述链接中的三个文件下载到此目录中
 
 ## 数据准备
 
@@ -14,4 +16,4 @@ https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.bin
 
 https://model.ks3-cn-beijing.ksyuncs.com/nlpdata/pile_wikipedia_demo.idx
 
-将上述两个文件放置于data_dir下。
+将上述两个文件放置于data_dir下。
diff --git a/training/benchmarks/bert_hf/pytorch/train/trainer.py b/training/benchmarks/bert_hf/pytorch/train/trainer.py
@@ -82,21 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
             dist_pytorch.barrier(self.config.vendor)
             pure_start_time = time.time()
 
-            if scaler is not None:
-                with torch.cuda.amp.autocast(enabled=True):
-                    output = model(input_ids=input_ids, labels=labels)
-                    loss = output.loss
-
-                scaler.scale(loss).backward()
-                if step % self.config.gradient_accumulation_steps == 0:
-                    scaler.step(optimizer)
-                    scaler.update()
-            else:
-                output = model(input_ids=input_ids, labels=labels)
-                loss = output.loss
-                loss.backward()
-                if step % self.config.gradient_accumulation_steps == 0:
-                    optimizer.step()
+            loss = self.adapter.train_one_step(model, (input_ids, labels), optimizer, step, scaler)
 
             if step % self.config.log_freq == 0:
                 print("Train Step " + str(step) + "/" + str(len(data_loader)) +

diff --git a/training/benchmarks/bert_hf/pytorch/train/trainer_adapter.py b/training/benchmarks/bert_hf/pytorch/train/trainer_adapter.py
@@ -41,3 +41,24 @@ def create_grad_scaler():
     """create_grad_scaler for mixed precision training"""
     scaler = torch.cuda.amp.GradScaler() if config.amp else None
     return scaler
+
+
+def train_one_step(model, batch_data, optimizer, cur_step, scaler=None):
+    input_ids, labels = batch_data
+    if scaler:
+        with torch.cuda.amp.autocast(enabled=True):
+            output = model(input_ids=input_ids, labels=labels)
+            loss = output.loss
+
+        scaler.scale(loss).backward()
+        if cur_step % config.gradient_accumulation_steps == 0:
+            scaler.step(optimizer)
+            scaler.update()
+    else:
+        output = model(input_ids=input_ids, labels=labels)
+        loss = output.loss
+        loss.backward()
+        if cur_step % config.gradient_accumulation_steps == 0:
+            optimizer.step()
+
+    return loss
diff --git a/training/benchmarks/driver/dist_pytorch.py b/training/benchmarks/driver/dist_pytorch.py
@@ -149,6 +149,8 @@ def barrier(vendor="nvidia"):
     if torch.distributed.is_available() and torch.distributed.is_initialized():
         if vendor == "kunlunxin":
             torch.distributed.barrier()
+        elif vendor == "mthreads":
+            torch.distributed.barrier()
         else:
             torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
             torch.cuda.synchronize()
@@ -172,6 +174,23 @@ def init_dist_training_env(config):
                                                  rank=rank,
                                                  world_size=world_size)
             config.n_device = torch.distributed.get_world_size()
+    elif config.vendor == "mthreads":
+        import torch_musa
+        if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
+            config.device = torch.device("musa")
+            config.n_device = 1
+        else:
+            torch.musa.set_device(config.local_rank)
+            host_addr_full = 'tcp://' + os.environ[
+                "MASTER_ADDR"] + ':' + os.environ["MASTER_PORT"]
+            rank = int(os.environ["RANK"])
+            world_size = int(os.environ["WORLD_SIZE"])
+            torch.distributed.init_process_group(backend=config.dist_backend,
+                                                 init_method=host_addr_full,
+                                                 rank=rank,
+                                                 world_size=world_size)
+            config.device = torch.device("musa", config.local_rank)
+            config.n_device = torch.distributed.get_world_size()
     else:  # nvidia
         if int(os.environ.get("WORLD_SIZE", 1)) <= 1:
             config.device = torch.device("cuda")

diff --git a/training/benchmarks/driver/helper.py b/training/benchmarks/driver/helper.py
@@ -74,6 +74,12 @@ def set_seed(self, seed: int, vendor: str = None):
         elif lower_vendor == "ascend":
             import mindspore
             mindspore.set_seed(seed)
+        elif lower_vendor == "mthreads":
+            import torch
+            import torch_musa
+            torch.manual_seed(seed)
+            torch.musa.manual_seed(seed)
+            torch.musa.manual_seed_all(seed)
         else:
             # TODO 其他厂商设置seed，在此扩展
             pass
diff --git a/training/benchmarks/llama2_7b/deepspeed/run_pretraining.py b/training/benchmarks/llama2_7b/deepspeed/run_pretraining.py
@@ -10,6 +10,11 @@
 from importlib import import_module
 
 import torch
+try:
+    import torch_musa
+    DEVICE = 'musa'
+except:
+    DEVICE = 'cuda'
 from torch.utils.data import DataLoader
 from torch.utils.data.distributed import DistributedSampler
 
@@ -54,29 +59,32 @@ def get_argument_parser():
 
 def train(model_engine, dataloader):
     model_engine.train()
+    device = torch.device(f"{DEVICE}:{args.local_rank}")
     ave_loss = 0.0
     for step, data in enumerate(dataloader):
 
         fake_data = torch.tensor(data).long()
-        input_ids = fake_data.to(args.local_rank)
-        labels = fake_data.to(args.local_rank)
+        input_ids = fake_data.to(device)
+        labels = fake_data.to(device)
         loss = model_engine(input_ids=input_ids, labels=labels).loss
         model_engine.backward(loss)
         model_engine.step()
 
         ave_loss += loss
-        if step % 10 == 0 and args.local_rank == 0:
+        if step > 0 and step % 10 == 0 and args.local_rank == 0:
             print('Step {}/{}, Loss: {}'.format(step, len(dataloader),
                                                 ave_loss / 10))
             ave_loss = 0.0
 
 
-def get_deepspeed_engine(args, model_config_dir, flashattn):
+def get_deepspeed_engine(args, model_config_dir):
     with deepspeed.zero.Init(config_dict_or_path=args.deepspeed_config,
                              enabled=True,
                              mem_efficient_linear=False,
                              mpu=None):
-        model = get_llama_model(model_config_dir, flashattn)
+        model = get_llama_model(model_config_dir, args.flashattn)
+    if args.gradient_checkpointing_enable:
+        model.gradient_checkpointing_enable()
 
     model_engine, _, _, _ = deepspeed.initialize(
         args=args, model=model, model_parameters=model.parameters())
@@ -107,10 +115,12 @@ def get_metric(texts):
     theoryflops = getattr(module, 'theoryflops')
     epochs = getattr(module, 'epochs')
     flashattn = getattr(module, 'flashattn')
+    gradient_checkpointing_enable = getattr(module, 'gradient_checkpointing_enable', False)
+    args.flashattn = flashattn
+    args.gradient_checkpointing_enable = gradient_checkpointing_enable
 
     deepspeed.init_distributed()
-    model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"),
-                                        flashattn)
+    model_engine = get_deepspeed_engine(args, os.path.join("llama2_7b_hf"))
     dataset = get_llama_dataset(args, seqlength, datafilename)
 
     logger = logging.getLogger("DeepSpeed")
@@ -138,4 +148,8 @@ def get_metric(texts):
             chip_tps = whole_tps / args.nproc * args.nnodes
             print("System tokens per second: ", whole_tps)
             print("Tokens/p/s: ", chip_tps)
+
+            TFLOPS = int(theoryflops/1000000000000)
+            print("Theory TFLOPS: ", TFLOPS)
+            print("Tokens/TFLOPS: ", chip_tps / TFLOPS)
             print("MFU: ", chip_tps * 7000000000.0 * 6 / theoryflops)
diff --git a/training/benchmarks/resnet50/pytorch/train/trainer.py b/training/benchmarks/resnet50/pytorch/train/trainer.py
@@ -82,22 +82,7 @@ def train_one_epoch(self, train_dataloader, eval_dataloader):
             pure_start_time = time.time()
             optimizer.zero_grad()
 
-            images, target = batch
-            if scaler is not None:
-                with torch.cuda.amp.autocast(enabled=True):
-                    output = model(images)
-                    loss = criterion(output, target)
-
-                scaler.scale(loss).backward()
-                scaler.step(optimizer)
-                scaler.update()
-            else:
-                output = model(images)
-
-                criterion = torch.nn.CrossEntropyLoss()
-                loss = criterion(output, target)
-                loss.backward()
-                optimizer.step()
+            loss = self.adapter.train_step(model, batch, optimizer, scaler)
 
             if step % self.config.log_freq == 0:
                 print("Train Step " + str(step) + "/" + str(len(data_loader)) +

diff --git a/training/benchmarks/resnet50/pytorch/train/trainer_adapter.py b/training/benchmarks/resnet50/pytorch/train/trainer_adapter.py
@@ -41,3 +41,23 @@ def create_grad_scaler():
     """create_grad_scaler for mixed precision training"""
     scaler = torch.cuda.amp.GradScaler() if config.amp else None
     return scaler
+
+
+def train_step(model, batch, optimizer, scaler=None):
+    """train one step"""
+    images, target = batch
+    criterion = torch.nn.CrossEntropyLoss()
+    if scaler:
+        with torch.cuda.amp.autocast(enabled=True):
+            output = model(images)
+            loss = criterion(output, target)
+        scaler.scale(loss).backward()
+        scaler.step(optimizer)
+        scaler.update()
+    else:
+        output = model(images)
+        loss = criterion(output, target)
+        loss.backward()
+        optimizer.step()
+
+    return loss
diff --git a/training/iluvatar/iluvatar_monitor.py b/training/iluvatar/iluvatar_monitor.py
@@ -231,7 +231,7 @@ def get_system_info():
     cmd = cmd + r"echo ;"
 
     cmd = cmd + r"echo Accelerator Model:;"
-    cmd = cmd + r"ixsmi -L;"
+    cmd = cmd + r"export PATH=/usr/local/corex/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/corex/lib; ixsmi -L;"
     cmd = cmd + r"echo ;"
 
     cmd = cmd + r"echo Accelerator Driver version:;"

diff --git a/training/iluvatar/mobilenetv2-pytorch/README.md b/training/iluvatar/mobilenetv2-pytorch/README.md
@@ -40,7 +40,8 @@
 
 | 配置                  | precision | fix_hp         | e2e_time | p_whole | p_train | p_core | acc    | mem         |
 | --------------------- | --------- | -------------- | -------- | ------- | ------- | ------ | ------ | ----------- |
-| BI-V100单机8卡（1x8） | fp32      | bs=256,lr=0.72 | 103759   | 3520    | 3604    | 3651   | 68.61% | 21.6 / 32.0 |
+| BI-V100单机8卡（1x8）  | fp32      | /              | 174534    | 1857    | 1876    | 1885   | 68.52% | 3.6/32.0  |
+| BI-V100单机8卡（1x8） | fp32      | bs=256,lr=0.72 | 87559   | 4390    | 4543    | 4625   | 61.92% | 21.6 / 32.0 |
 | BI-V100单机8卡（1x1） | fp32      | bs=256,lr=0.72 | /        | 624     | 632     | 633    | /      | 21.4 / 32.0 |
 | BI-V100单机8卡（2x8） | fp32      | bs=256,lr=0.72 | /        | 6835    | 7058    | 7219   | /      | 22.2 / 32.0 |
 
diff --git a/training/iluvatar/mobilenetv2-pytorch/config/config_BI-V100x1x8.py b/training/iluvatar/mobilenetv2-pytorch/config/config_BI-V100x1x8.py
@@ -1,5 +1,5 @@
 from config_common import *
 
-train_batch_size = 256
-eval_batch_size = 256
+train_batch_size = 32
+eval_batch_size = 32