Skip to content

Commit

Permalink
Iluvatar update bert conf (#165)
Browse files Browse the repository at this point in the history
* update iluvatar bert config

* update iluvatar bert README

* add iluvatar bert 1x1 2x8 conf

* update iluvatar bert README
  • Loading branch information
forestlee95 authored Jul 25, 2023
1 parent bbe6454 commit edf7ce2
Show file tree
Hide file tree
Showing 6 changed files with 57 additions and 33 deletions.
49 changes: 28 additions & 21 deletions training/iluvatar/bert-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,6 @@
## 模型信息
### 模型介绍

BERT stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Please refer to this paper for a detailed description of BERT:
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

### 模型Checkpoint下载
[模型Checkpoint下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
### 测试数据集下载
[测试数据集下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
[测试数据集下载](../../benchmarks/bert/README.md#测试数据集下载)

### 天数智芯 BI-V100 GPU配置与运行信息参考
#### 环境配置
Expand All @@ -20,21 +10,38 @@ Please refer to this paper for a detailed description of BERT:
- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 4.15.0-156-generic x86_64
- 加速卡驱动版本:3.0.0
- 加速卡驱动版本:3.1.0
- Docker 版本:20.10.8
- 训练框架版本:torch-1.10.2+corex.3.0.0
- 训练框架版本:torch-1.13.1+corex.3.1.0
- 依赖软件版本:无


### 运行情况
| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
| -------- | ------------------ | ---------- | ------- | ------- | ------- | --------------- |
| 单机1卡 | config_BI-V100x1x1 | 17854.76 | 0.72 | 0.7325 | 25000 |17.00 |
| 单机8卡 | config_BI-V100x1x8 | 20312.57 | 0.72 | 0.9619 | 25000 |118.45 |
| 两机8卡 | config_BI-V100x2x8 | pending | 0.72 | pending | pending |pending |

### 许可证
* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | ----------------------- | ------------------------------------------- |
| 任务类别 | 自然语言编码 | |
| 模型 | bert-large-uncased | |
| 数据集 | Wikipedia | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | BI-V100 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练序列数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | mlm_acc,见“性能指标” | masked_lm任务准确率 |
| 额外修改项 | 使用apex库 | |

* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | mlm_acc | mem |
| ------------------- | --------- | ---------------- | -------- | ------- | ------- | ------ | ------- | --------- |
| BI-V100单机8卡(1x8) | amp | / | | | | | | |
| BI-V100两机8卡(2x8) | amp | bs=20,lr=0.00035 | | | | | | |
| BI-V100单机单卡(1x1) | amp | bs=20,lr=0.00035 | | | | | | |

本项目基于Apache 2.0 license。

本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA 实现。
8 changes: 4 additions & 4 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x1.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
from config_Ampere_common import *
from config_BI_common import *

gradient_accumulation_steps = 1
gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
exchange_padding = False
learning_rate = 3.5e-4
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 240000

seed = 9031
7 changes: 4 additions & 3 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x8.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
from config_Ampere_common import *
from config_BI_common import *

gradient_accumulation_steps = 1
gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 3.5e-4
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 30000

seed = 9031
19 changes: 19 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from config_BI_common import *

gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 20000

seed = 9031
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,13 @@
import os

grad_scaler = GradScaler(init_scale=float(os.getenv("INIT_LOSS_SCALE", 2**20)),
growth_interval=2000)
growth_interval=2000, enabled=True)

fp16 = True
ddp_type = "apex"
dist_backend = "nccl"

train_batch_size = 12
max_steps = 1000000
train_batch_size = 20

fused_gelu_bias = True
fused_mha = True
Expand Down
2 changes: 0 additions & 2 deletions training/iluvatar/bert-pytorch/config/config_common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
import torch

vendor: str = "iluvatar"

# 'segmented' or 'full_iteration' options for CUDA graph capture.
# 'segmented' option: Pytorch Autograd orchestrates execution of backward ops every iteration.
# 'full_iteration' option: CUDA graph orchestrates execution of bwd ops every iteration without Autograd involvement (has composability limitations but could be more performant allowing optimizer and collectives capture).
Expand Down

0 comments on commit edf7ce2

Please sign in to comment.