-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
分布式训练 #6
Comments
你好,我想请问下,分布式训练应该怎样才能进行呢?假设有3台机子,现在具体应该怎么进行操作呢?能否提供一些详细的步骤啊,非常谢谢。 |
worker-1是'task': {'type': 'evaluator', 'index': 0}},ckpt多久保存一次? |
参考run_dist.sh |
3ks。那么请问下,假设我采用集群分布式进行模型的训练,数据是需要自己进行手动的分割吗?比如用3台机子,是否就意味着原始数据需要分割成3份,每台机子上各自存储一份?还是说不需要进行分割,每台机子都用整个数据集(ps:这个时候感觉所有机子都用了整个数据,没体现分布式训练加速啊)? |
@lambdaji 好的 这两个次数设置其中一个,还是有问题不过非常感谢,这几天排查一下问题再说_save_checkpoints_secs save_checkpoints_steps |
需要自己划分,然后代码里根据index决定读取哪一份 |
再请教您一个问题
work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
work'-0
日志的一部分
INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt.
INFO:tensorflow:global_step/sec: 7.5244
E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec)
INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec)
E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)
work-1 一直在等待
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}}
INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'}
INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.
work-2运行成功
INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec)
INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec)
INFO:tensorflow:Loss for final step: 84.82182.
ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222']
worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222']
chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222']
{"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}}
model_type:wide_deep
train_samples_num:3000000
Parsing /workspace/wlc/wide_deep_dist/data/train.csv
1.0hours
task train success.
modeldir=/workspace/wlc,modelname=model_dir
ps—0 日志
start checkWorkerIsFinish
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}}
INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Start Tensorflow server.
2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222}
2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222}
2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
The text was updated successfully, but these errors were encountered: