FlagOpen · upvenly · Jul 25, 2023 · Jul 18, 2023 · Jul 18, 2023 · Jul 19, 2023
diff --git a/training/iluvatar/bert-pytorch/README.md b/training/iluvatar/bert-pytorch/README.md
@@ -1,16 +1,6 @@
-## 模型信息
-### 模型介绍
 
-BERT stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
-BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
-
-Please refer to this paper for a detailed description of BERT:
-[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
-
-### 模型Checkpoint下载
-[模型Checkpoint下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
 ### 测试数据集下载
-[测试数据集下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
+[测试数据集下载](../../benchmarks/bert/README.md#测试数据集下载)
 
 ### 天数智芯 BI-V100 GPU配置与运行信息参考
 #### 环境配置
@@ -20,21 +10,38 @@ Please refer to this paper for a detailed description of BERT:
 - ##### 软件环境
    - OS版本：Ubuntu 20.04
    - OS kernel版本:  4.15.0-156-generic x86_64    
-   - 加速卡驱动版本：3.0.0
+   - 加速卡驱动版本：3.1.0
    - Docker 版本：20.10.8
-   - 训练框架版本：torch-1.10.2+corex.3.0.0
+   - 训练框架版本：torch-1.13.1+corex.3.1.0
    - 依赖软件版本：无
 
 
 ### 运行情况
-| 训练资源 | 配置文件            | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
-| -------- | ------------------ | ---------- | ------- | -------  | ------- | --------------- |
-| 单机1卡  | config_BI-V100x1x1 | 17854.76    | 0.72    | 0.7325   | 25000   |17.00            |
-| 单机8卡  | config_BI-V100x1x8 | 20312.57    | 0.72    | 0.9619   | 25000   |118.45           |
-| 两机8卡  | config_BI-V100x2x8 | pending     | 0.72    | pending  | pending |pending          |
 
-### 许可证
+* 通用指标
+
+| 指标名称       | 指标值                  | 特殊说明                                    |
+| -------------- | ----------------------- | ------------------------------------------- |
+| 任务类别       | 自然语言编码            |                                             |
+| 模型           | bert-large-uncased      |                                             |
+| 数据集         | Wikipedia               |                                             |
+| 数据精度       | precision,见“性能指标”  | 可选fp32/amp/fp16                           |
+| 超参修改       | fix_hp,见“性能指标”     | 跑满硬件设备评测吞吐量所需特殊超参          |
+| 硬件设备简称   | BI-V100            |                                             |
+| 硬件存储使用   | mem,见“性能指标”        | 通常称为“显存”,单位为GiB                    |
+| 端到端时间     | e2e_time,见“性能指标”   | 总时间+Perf初始化等时间                     |
+| 总吞吐量       | p_whole,见“性能指标”    | 实际训练序列数除以总时间(performance_whole) |
+| 训练吞吐量     | p_train,见“性能指标”    | 不包含每个epoch末尾的评估部分耗时           |
+| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1)            |
+| 训练结果       | mlm_acc,见“性能指标”    | masked_lm任务准确率                         |
+| 额外修改项     | 使用apex库        |                                             |
+
+* 性能指标
+
+| 配置                | precision | fix_hp           | e2e_time | p_whole | p_train | p_core | mlm_acc | mem       |
+| ------------------- | --------- | ---------------- | -------- | ------- | ------- | ------ | ------- | --------- |
+| BI-V100单机8卡（1x8）  | amp       | /                |      |     |      |     |    |  |
+| BI-V100两机8卡（2x8）  | amp       | bs=20,lr=0.00035 |      |     |      |     |    |  |
+| BI-V100单机单卡（1x1） | amp       | bs=20,lr=0.00035 |       |       |       |      |   |  |
 
-本项目基于Apache 2.0 license。
 
-本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA 实现。
diff --git a/training/iluvatar/bert-pytorch/config/config_BI-V100x1x1.py b/training/iluvatar/bert-pytorch/config/config_BI-V100x1x1.py
@@ -1,19 +1,19 @@
-from config_Ampere_common import *
+from config_BI_common import *
 
-gradient_accumulation_steps = 1
+gradient_accumulation_steps = 4
 start_warmup_step = 0
 warmup_proportion = 0
 warmup_steps = 0
 
 distributed_lamb = False
-exchange_padding = False
-learning_rate = 3.5e-4
+learning_rate = 0.00035
 weight_decay_rate = 0.01
 opt_lamb_beta_1 = 0.9
 opt_lamb_beta_2 = 0.999
 
 eval_batch_size = train_batch_size
 max_samples_termination = 4500000
 cache_eval_data = True
+max_steps = 240000
 
 seed = 9031
diff --git a/training/iluvatar/bert-pytorch/config/config_BI-V100x1x8.py b/training/iluvatar/bert-pytorch/config/config_BI-V100x1x8.py
@@ -1,18 +1,19 @@
-from config_Ampere_common import *
+from config_BI_common import *
 
-gradient_accumulation_steps = 1
+gradient_accumulation_steps = 4
 start_warmup_step = 0
 warmup_proportion = 0
 warmup_steps = 0
 
 distributed_lamb = False
-learning_rate = 3.5e-4
+learning_rate = 0.00035
 weight_decay_rate = 0.01
 opt_lamb_beta_1 = 0.9
 opt_lamb_beta_2 = 0.999
 
 eval_batch_size = train_batch_size
 max_samples_termination = 4500000
 cache_eval_data = True
+max_steps = 30000
 
 seed = 9031
diff --git a/training/iluvatar/bert-pytorch/config/config_BI-V100x2x8.py b/training/iluvatar/bert-pytorch/config/config_BI-V100x2x8.py
@@ -0,0 +1,19 @@
+from config_BI_common import *
+
+gradient_accumulation_steps = 4
+start_warmup_step = 0
+warmup_proportion = 0
+warmup_steps = 0
+
+distributed_lamb = False
+learning_rate = 0.00035
+weight_decay_rate = 0.01
+opt_lamb_beta_1 = 0.9
+opt_lamb_beta_2 = 0.999
+
+eval_batch_size = train_batch_size
+max_samples_termination = 4500000
+cache_eval_data = True
+max_steps = 20000
+
+seed = 9031
diff --git a/...rt-pytorch/config/config_Ampere_common.py → ...r/bert-pytorch/config/config_BI_common.py b/...rt-pytorch/config/config_Ampere_common.py → ...r/bert-pytorch/config/config_BI_common.py
@@ -3,14 +3,13 @@
 import os
 
 grad_scaler = GradScaler(init_scale=float(os.getenv("INIT_LOSS_SCALE", 2**20)),
-                         growth_interval=2000)
+                         growth_interval=2000, enabled=True)
 
 fp16 = True
 ddp_type = "apex"
 dist_backend = "nccl"
 
-train_batch_size = 12
-max_steps = 1000000
+train_batch_size = 20
 
 fused_gelu_bias = True
 fused_mha = True

diff --git a/training/iluvatar/bert-pytorch/config/config_common.py b/training/iluvatar/bert-pytorch/config/config_common.py
@@ -1,7 +1,5 @@
 import torch
 
-vendor: str = "iluvatar"
-
 # 'segmented' or 'full_iteration' options for CUDA graph capture.
 # 'segmented' option: Pytorch Autograd orchestrates execution of backward ops every iteration.
 # 'full_iteration' option: CUDA graph orchestrates execution of bwd ops every iteration without Autograd involvement (has composability limitations but could be more performant allowing optimizer                              and collectives capture).