-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CAY-1189, 1171] Accept jobs with different configuration dynamically via client messaging channel #1191
Conversation
Test would fail because this PR uses locally updated REEF api |
@JunhoeKim Great! I'll take a look! |
…o job-launcher-impl
@@ -15,7 +15,7 @@ | |||
|
|||
# EXAMPLE USAGE | |||
# Classification | |||
# ./run_gbt.sh -num_workers 2 -number_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(PWD)/sample_gbt_test -max_num_epochs 50 -mini_batch_size 54 -num_worker_blocks 10 -init_step_size 0.1 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(PWD)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a github comment what has been changed especially when it's not related to this PR..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you've changed PWD to pwd, right?
@@ -14,7 +14,7 @@ | |||
# limitations under the License. | |||
|
|||
# EXAMPLE USAGE | |||
# ./run_lasso.sh -max_num_epochs 300 -num_workers 4 -number_servers 1 -mini_batch_size 50 -num_blocks_per_worker 9 -features 10 -max_num_eval_local 5 -input sample_lasso -test_data_path file://$(PWD)/sample_lasso_test -local true -lambda 0.5 -timeout 200000 -step_size 0.1 -decay_rate 0.95 -decay_period 5 -features_per_partition 2 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a github comment what has been changed especially when it's not related to this PR..
@@ -14,7 +14,7 @@ | |||
# limitations under the License. | |||
|
|||
# EXAMPLE USAGE | |||
# ./run_mlr.sh -num_workers 4 -number_servers 2 -local true -input sample_mlr -max_num_eval_local 6 -test_data_path file://$(PWD)/sample_mlr_test -max_num_epochs 20 -mini_batch_size 54 -num_worker_blocks 10 -init_step_size 0.1 -classes 10 -features 784 -features_per_partition 392 -model_gaussian 0.001 -lambda 0.005 -timeout 200000 -decay_period 5 -decay_rate 0.9 -num_trainer_threads 1 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a github comment what has been changed especially when it's not related to this PR..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I got it. Thanks for the advice:)
@JunhoeKim I've just confirmed that all apps run well in both multi-job and stand-alone mode. Thanks! |
dolphin/async/bin/submit_gbt.sh
Outdated
|
||
# EXAMPLE USAGE | ||
# Classification | ||
# ./submit_gbt.sh -num_workers 2 -number_servers 1 -input file://$(PWD)/sample_gbt -test_data_path file://$(PWD)/sample_gbt_test -max_num_epochs 50 -mini_batch_size 54 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(PWD)/sample_gbt.meta -server_metric_flush_period_ms 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PWD -> pwd
I've also confirmed that it also works well in Yarn environment! |
I've finished testing and refining the code. |
Resolves #1189
Resolves #1171
This PR enables user to run a JobServer(
start_jobserver.sh
), submit jobs specified by job configuration, and close the JobServer(stop_jobserver.sh
).It separates configurations related with the JobServer and those related with a specific job. And it enables any user to send a job command to JobServer by shell scripts which uses
JobCommandSender
.