Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式训练 #6

Open
WLCOOLONGS opened this issue Jul 12, 2018 · 6 comments
Open

分布式训练 #6

WLCOOLONGS opened this issue Jul 12, 2018 · 6 comments

Comments

@WLCOOLONGS
Copy link

再请教您一个问题
work1 一直在等待INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.

work'-0
日志的一部分
INFO:tensorflow:Saving checkpoints for 25076 into /workspace/wlc/model_dir/model.ckpt.
INFO:tensorflow:global_step/sec: 7.5244
E0712 18:25:49.778093 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 84.10749, average_loss = 0.65708977 (31.580 sec)
INFO:tensorflow:loss = 84.10749, step = 25285 (31.580 sec)
E0712 18:26:20.777756 ProjectorPluginIsActiveThread saver.py:1817] Couldn't match files for checkpoint /workspace/wlc/model_dir/model.ckpt-25076
INFO:tensorflow:loss = 81.15384, average_loss = 0.63401437 (25.918 sec)

work-1 一直在等待

TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-worker-1-0grc9:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'evaluator', 'index': 0}}
INFO:tensorflow:Using config: {'_num_worker_replicas': 0, '_num_ps_replicas': 0, '_global_id_in_cluster': None, '_master': '', '_save_checkpoints_steps': 1000, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 1000, '_keep_checkpoint_max': 5, '_log_step_count_steps': 1000, '_service': None, '_save_checkpoints_secs': None, '_is_chief': False, '_tf_random_seed': None, '_model_dir': '/workspace/wlc/model_dir/', '_evaluation_master': '', '_task_id': 0, '_cluster_spec': , '_task_type': 'evaluator'}
INFO:tensorflow:Waiting 1800.000000 secs before starting eval.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999588 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999654 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999693 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999667 secs before starting next eval run.
WARNING:tensorflow:Estimator is not trained yet. Will start an evaluation when a checkpoint is ready.
INFO:tensorflow:Waiting 1799.999685 secs before starting next eval run.

work-2运行成功

INFO:tensorflow:loss = 84.003555, average_loss = 0.6562778 (26.016 sec)
INFO:tensorflow:loss = 84.003555, step = 25914 (26.016 sec)
INFO:tensorflow:Loss for final step: 84.82182.
ps_host ['tensorflow-wanglianchen-144-16-ps-0:2222']
worker_host ['tensorflow-wanglianchen-144-16-worker-2:2222']
chief_hosts ['tensorflow-wanglianchen-144-16-worker-0:2222']
{"task": {"index": 0, "type": "worker"}, "cluster": {"ps": ["tensorflow-wanglianchen-144-16-ps-0:2222"], "worker": ["tensorflow-wanglianchen-144-16-worker-2:2222"], "chief": ["tensorflow-wanglianchen-144-16-worker-0:2222"]}}
model_type:wide_deep
train_samples_num:3000000
Parsing /workspace/wlc/wide_deep_dist/data/train.csv
1.0hours
task train success.
modeldir=/workspace/wlc,modelname=model_dir

ps—0 日志
start checkWorkerIsFinish
TensorBoard 1.6.0 at http://tensorflow-wanglianchen-144-16-ps-0-jrngn:6006 (Press CTRL+C to quit)
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'worker': ['tensorflow-wanglianchen-144-16-worker-2:2222'], 'ps': ['tensorflow-wanglianchen-144-16-ps-0:2222'], 'chief': ['tensorflow-wanglianchen-144-16-worker-0:2222']}, 'task': {'type': 'ps', 'index': 0}}
INFO:tensorflow:Using config: {'_cluster_spec': , '_task_id': 0, '_model_dir': '/workspace/wlc/model_dir/', '_service': None, '_session_config': device_count {
key: "CPU"
value: 1
}
device_count {
key: "GPU"
}
, '_save_summary_steps': 1000, '_is_chief': False, '_save_checkpoints_secs': None, '_master': 'grpc://tensorflow-wanglianchen-144-16-ps-0:2222', '_global_id_in_cluster': 2, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_checkpoints_steps': 1000, '_task_type': 'ps', '_tf_random_seed': None, '_num_worker_replicas': 2, '_log_step_count_steps': 1000, '_num_ps_replicas': 1, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Start Tensorflow server.
2018-07-12 17:26:33.154403: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-12 17:26:33.160418: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> tensorflow-wanglianchen-144-16-worker-0:2222}
2018-07-12 17:26:33.160444: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-12 17:26:33.160463: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tensorflow-wanglianchen-144-16-worker-2:2222}
2018-07-12 17:26:33.164749: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222

@Welchkimi
Copy link

你好,我想请问下,分布式训练应该怎样才能进行呢?假设有3台机子,现在具体应该怎么进行操作呢?能否提供一些详细的步骤啊,非常谢谢。

@lambdaji
Copy link
Owner

worker-1是'task': {'type': 'evaluator', 'index': 0}},ckpt多久保存一次?

@lambdaji
Copy link
Owner

参考run_dist.sh

@Welchkimi
Copy link

3ks。那么请问下,假设我采用集群分布式进行模型的训练,数据是需要自己进行手动的分割吗?比如用3台机子,是否就意味着原始数据需要分割成3份,每台机子上各自存储一份?还是说不需要进行分割,每台机子都用整个数据集(ps:这个时候感觉所有机子都用了整个数据,没体现分布式训练加速啊)?

@WLCOOLONGS
Copy link
Author

@lambdaji 好的 这两个次数设置其中一个,还是有问题不过非常感谢,这几天排查一下问题再说_save_checkpoints_secs save_checkpoints_steps
@Welchkimi 那你应该看一下分布式 中的 同步 和异步

@lambdaji
Copy link
Owner

需要自己划分,然后代码里根据index决定读取哪一份
ps:把现在代码里glob那一行改了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants