Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iluvatar update bert conf #165

Merged
merged 7 commits into from
Jul 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 28 additions & 21 deletions training/iluvatar/bert-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,6 @@
## 模型信息
### 模型介绍

BERT stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Please refer to this paper for a detailed description of BERT:
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

### 模型Checkpoint下载
[模型Checkpoint下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
### 测试数据集下载
[测试数据集下载](../../benchmarks/bert/pytorch/readme.md#模型信息与数据集模型checkpoint下载)
[测试数据集下载](../../benchmarks/bert/README.md#测试数据集下载)

### 天数智芯 BI-V100 GPU配置与运行信息参考
#### 环境配置
Expand All @@ -20,21 +10,38 @@ Please refer to this paper for a detailed description of BERT:
- ##### 软件环境
- OS版本:Ubuntu 20.04
- OS kernel版本: 4.15.0-156-generic x86_64
- 加速卡驱动版本:3.0.0
- 加速卡驱动版本:3.1.0
- Docker 版本:20.10.8
- 训练框架版本:torch-1.10.2+corex.3.0.0
- 训练框架版本:torch-1.13.1+corex.3.1.0
- 依赖软件版本:无


### 运行情况
| 训练资源 | 配置文件 | 运行时长(s) | 目标精度 | 收敛精度 | Steps数 | 性能(samples/s) |
| -------- | ------------------ | ---------- | ------- | ------- | ------- | --------------- |
| 单机1卡 | config_BI-V100x1x1 | 17854.76 | 0.72 | 0.7325 | 25000 |17.00 |
| 单机8卡 | config_BI-V100x1x8 | 20312.57 | 0.72 | 0.9619 | 25000 |118.45 |
| 两机8卡 | config_BI-V100x2x8 | pending | 0.72 | pending | pending |pending |

### 许可证
* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| -------------- | ----------------------- | ------------------------------------------- |
| 任务类别 | 自然语言编码 | |
| 模型 | bert-large-uncased | |
| 数据集 | Wikipedia | |
| 数据精度 | precision,见“性能指标” | 可选fp32/amp/fp16 |
| 超参修改 | fix_hp,见“性能指标” | 跑满硬件设备评测吞吐量所需特殊超参 |
| 硬件设备简称 | BI-V100 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 端到端时间 | e2e_time,见“性能指标” | 总时间+Perf初始化等时间 |
| 总吞吐量 | p_whole,见“性能指标” | 实际训练序列数除以总时间(performance_whole) |
| 训练吞吐量 | p_train,见“性能指标” | 不包含每个epoch末尾的评估部分耗时 |
| **计算吞吐量** | **p_core,见“性能指标”** | 不包含数据IO部分的耗时(p3>p2>p1) |
| 训练结果 | mlm_acc,见“性能指标” | masked_lm任务准确率 |
| 额外修改项 | 使用apex库 | |

* 性能指标

| 配置 | precision | fix_hp | e2e_time | p_whole | p_train | p_core | mlm_acc | mem |
| ------------------- | --------- | ---------------- | -------- | ------- | ------- | ------ | ------- | --------- |
| BI-V100单机8卡(1x8) | amp | / | | | | | | |
| BI-V100两机8卡(2x8) | amp | bs=20,lr=0.00035 | | | | | | |
| BI-V100单机单卡(1x1) | amp | bs=20,lr=0.00035 | | | | | | |

本项目基于Apache 2.0 license。

本项目部分代码基于MLCommons https://github.com/mlcommons/training_results_v1.0/tree/master/NVIDIA 实现。
8 changes: 4 additions & 4 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x1.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
from config_Ampere_common import *
from config_BI_common import *

gradient_accumulation_steps = 1
gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
exchange_padding = False
learning_rate = 3.5e-4
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 240000

seed = 9031
7 changes: 4 additions & 3 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x1x8.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
from config_Ampere_common import *
from config_BI_common import *

gradient_accumulation_steps = 1
gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 3.5e-4
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 30000

seed = 9031
19 changes: 19 additions & 0 deletions training/iluvatar/bert-pytorch/config/config_BI-V100x2x8.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from config_BI_common import *

gradient_accumulation_steps = 4
start_warmup_step = 0
warmup_proportion = 0
warmup_steps = 0

distributed_lamb = False
learning_rate = 0.00035
weight_decay_rate = 0.01
opt_lamb_beta_1 = 0.9
opt_lamb_beta_2 = 0.999

eval_batch_size = train_batch_size
max_samples_termination = 4500000
cache_eval_data = True
max_steps = 20000

seed = 9031
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,13 @@
import os

grad_scaler = GradScaler(init_scale=float(os.getenv("INIT_LOSS_SCALE", 2**20)),
growth_interval=2000)
growth_interval=2000, enabled=True)

fp16 = True
ddp_type = "apex"
dist_backend = "nccl"

train_batch_size = 12
max_steps = 1000000
train_batch_size = 20

fused_gelu_bias = True
fused_mha = True
Expand Down
2 changes: 0 additions & 2 deletions training/iluvatar/bert-pytorch/config/config_common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
import torch

vendor: str = "iluvatar"

# 'segmented' or 'full_iteration' options for CUDA graph capture.
# 'segmented' option: Pytorch Autograd orchestrates execution of backward ops every iteration.
# 'full_iteration' option: CUDA graph orchestrates execution of bwd ops every iteration without Autograd involvement (has composability limitations but could be more performant allowing optimizer and collectives capture).
Expand Down