-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paddle v2 api 支持分布式训练 #1732
Comments
#1680 和这个有一点关联 |
初步的想法,求拍砖。
|
我觉得会有两种路子:
|
paddle集群管理(mpi/k8s)方案前置工作
|
在k8s上启动一个测试用的MPI集群:参考内容:https://hub.docker.com/r/dispel4py/docker.openmpi/
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: mpi-header
labels:
app: mpi-header
spec:
replicas: 1
template:
metadata:
labels:
app: mpi-header
spec:
containers:
- image: dispel4py/docker.openmpi
name : mpi-header
resources:
limits:
cpu: 500m
memory: 2Gi
requests:
cpu: 500m
memory: 2Gi
ports:
- containerPort: 22
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: mpi-nodes
labels:
app: mpi-nodes
spec:
replicas: 8
template:
metadata:
labels:
app: mpi-nodes
spec:
containers:
- image: dispel4py/docker.openmpi
name : mpi-header
resources:
limits:
cpu: 500m
memory: 2Gi
requests:
cpu: 500m
memory: 2Gi
ports:
- containerPort: 22
git clone https://github.com/dispel4py/docker.openmpi.git
cd docker.openmpi
chmod 400 ssh/id_rsa.mpi
ssh -i ssh/id_rsa.mpi tutorial@[head的地址]
kubectl get po -a -o wide | grep mpi-nodes | awk '{print $6}' > machines
mpiexec -hostfile machines -n 16 python helloworld.py |
简单调研过enable paddlepaddle 运行在HPC/ mpi的环境。
mpirun -n 2 -ppn 1 -machinefile hosts paddle pserver --num_gradient_servers=2 --nics=eno3 --port=7164 --ports_num=1 --ports_num_for_sparse=0 --comment=paddle_process_by_paddle mpirun -n 2 -ppn 1 -machinefile hosts paddle train --num_gradient_servers=2 --nics=eno3 --port=7164 --ports_num=1 --comment=paddle_process_by_paddle --pservers=192.168.10.21,192.168.10.22 --ports_num_for_sparse=0 --config=./vgg_16_cifar.py --trainer_count=4 --use_gpu=0 --num_passes=1 --save_dir=./cifar_vgg_model --log_period=10 --dot_period=10 --saving_period=1 --local=0 --trainer_id=1
此外还有一些问题: |
FROM @moting9
snapshot最好存储在一个分布式的存储引擎,有且不限于以下两点的原因:
实际上 更多的关于集群训练design doc在 #1696 中,欢迎comment:) |
简单调研了一下TensorFlow On Google Cloud相关的操作流程
|
更新了一个简单示例程序演示在openmpi上运行分布式paddle训练: https://github.com/typhoonzero/paddle-openmpi |
done |
现状
目前的v2 api只支持单机执行,而分布式任务依然使用以前的paddle train xxx的方式运行,启动trainer/ps 二进制,读取trainer_conf.py,然后通过data_provider读取数据进行训练。
需求
需要做一下升级,即在mpi node上通过python xxx.py的方式启动trainer进行训练,并且通过python的v2 api驱动整个训练进程。
需要做的点
The text was updated successfully, but these errors were encountered: