分布式训练 #6

WLCOOLONGS · 2018-07-12T12:11:31Z

再请教您一个问题
work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.

work'-0
日志的一部分
INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt.
INFO:tensorflow:global_step/sec: 7.5244
E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec)
INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec)
E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)

work-1 一直在等待

TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}}
INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'}
INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.

work-2运行成功

INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec)
INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec)
INFO:tensorflow:Loss for final step: 84.82182.
ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222']
worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222']
chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222']
{"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}}
model_type:wide_deep
train_samples_num:3000000
Parsing /workspace/wlc/wide_deep_dist/data/train.csv
1.0hours
task train success.
modeldir=/workspace/wlc,modelname=model_dir

ps—0 日志
start checkWorkerIsFinish
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}}
INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Start Tensorflow server.
2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222}
2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222}
2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222

Welchkimi · 2018-07-12T13:01:50Z

你好，我想请问下，分布式训练应该怎样才能进行呢？假设有3台机子，现在具体应该怎么进行操作呢？能否提供一些详细的步骤啊，非常谢谢。

lambdaji · 2018-07-18T03:13:05Z

worker-1是'task': {'type': 'evaluator', 'index': 0}}，ckpt多久保存一次？

lambdaji · 2018-07-18T03:13:33Z

参考run_dist.sh

Welchkimi · 2018-07-19T09:05:17Z

3ks。那么请问下，假设我采用集群分布式进行模型的训练，数据是需要自己进行手动的分割吗？比如用3台机子，是否就意味着原始数据需要分割成3份，每台机子上各自存储一份？还是说不需要进行分割，每台机子都用整个数据集(ps：这个时候感觉所有机子都用了整个数据，没体现分布式训练加速啊)？

WLCOOLONGS · 2018-07-19T09:16:35Z

@lambdaji 好的这两个次数设置其中一个，还是有问题不过非常感谢，这几天排查一下问题再说_save_checkpoints_secs save_checkpoints_steps
@Welchkimi 那你应该看一下分布式中的同步和异步

lambdaji · 2018-07-20T04:02:50Z

需要自己划分，然后代码里根据index决定读取哪一份
ps：把现在代码里glob那一行改了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式训练 #6

分布式训练 #6

WLCOOLONGS commented Jul 12, 2018

Welchkimi commented Jul 12, 2018

lambdaji commented Jul 18, 2018

lambdaji commented Jul 18, 2018

Welchkimi commented Jul 19, 2018

WLCOOLONGS commented Jul 19, 2018

lambdaji commented Jul 20, 2018

分布式训练 #6

分布式训练 #6

Comments

WLCOOLONGS commented Jul 12, 2018

Welchkimi commented Jul 12, 2018

lambdaji commented Jul 18, 2018

lambdaji commented Jul 18, 2018

Welchkimi commented Jul 19, 2018

WLCOOLONGS commented Jul 19, 2018

lambdaji commented Jul 20, 2018