how to use pytorch to train model with DistributedDataParallel #13

AlexiFeng · 2023-05-04T15:52:59Z

Many nouns are also used in course "Parallel Computing"

parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=-1)
FLAGS = parser.parse_args()
local_rank = int(FLAGS.local_rank)

# 新增3：DDP backend初始化
#   a.根据local_rank来设定当前使用哪块GPU
torch.cuda.set_device(local_rank)
#   b.初始化DDP，使用默认backend(nccl)就行。如果是CPU模型运行，需要选择其他后端。
dist.init_process_group(backend='nccl')
device = torch.device("cuda", local_rank)
model=SimpleNet().to(device) #init model
model = DDP(model, device_ids=[local_rank], output_device=local_rank)  #use DDP

and then,must use distributedsampler,distribute different data to each process.

train_sampler  = torch.utils.data.distributed.DistributedSampler(train_dataset)

use one process to save model

if dist.get_rank()==0:
    meg.save(model.module, save_path+str(cur_epoch)+ ".pth")

then should use barrier?I've no idea.

AlexiFeng added TODO python labels May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use pytorch to train model with DistributedDataParallel #13

how to use pytorch to train model with DistributedDataParallel #13

AlexiFeng commented May 4, 2023

how to use pytorch to train model with DistributedDataParallel #13

how to use pytorch to train model with DistributedDataParallel #13

Comments

AlexiFeng commented May 4, 2023