Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CAY-1189, 1171] Accept jobs with different configuration dynamically via client messaging channel #1191

Merged
merged 62 commits into from
Jul 19, 2017

Conversation

JunhoeKim
Copy link
Contributor

@JunhoeKim JunhoeKim commented Jun 23, 2017

Resolves #1189
Resolves #1171

This PR enables user to run a JobServer(start_jobserver.sh), submit jobs specified by job configuration, and close the JobServer(stop_jobserver.sh).

It separates configurations related with the JobServer and those related with a specific job. And it enables any user to send a job command to JobServer by shell scripts which uses JobCommandSender.

@JunhoeKim
Copy link
Contributor Author

JunhoeKim commented Jun 23, 2017

Test would fail because this PR uses locally updated REEF api parsedSevletRequest.getParameter(String key) to get request body parameter. In this usage, JobServerHttpHandler gets serialized job configuration.

@wynot12
Copy link
Contributor

wynot12 commented Jun 23, 2017

@JunhoeKim Great! I'll take a look!
Thanks for your work.

@wynot12 wynot12 self-requested a review June 23, 2017 11:02
@wynot12 wynot12 changed the title [CAY-1189] Accept jobs with different configuration dinamically using REST API [CAY-1189] Accept jobs with different configuration dynamically using REST API Jul 2, 2017
@JunhoeKim JunhoeKim changed the title [CAY-1189] Accept jobs with different configuration dynamically using REST API [CAY-1189] Accept jobs with different configuration dynamically using WAKE Transport Jul 13, 2017
@JunhoeKim JunhoeKim changed the title [CAY-1189, 1171] Accept jobs with different configuration dynamically using WAKE Transport [CAY-1189, 1171] Accept jobs with different configuration dynamically via client messaging channel Jul 18, 2017
@@ -15,7 +15,7 @@

# EXAMPLE USAGE
# Classification
# ./run_gbt.sh -num_workers 2 -number_servers 1 -local true -input sample_gbt -max_num_eval_local 3 -test_data_path file://$(PWD)/sample_gbt_test -max_num_epochs 50 -mini_batch_size 54 -num_worker_blocks 10 -init_step_size 0.1 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -timeout 200000 -num_trainer_threads 2 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -metadata_path file://$(PWD)/sample_gbt.meta -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
Copy link
Contributor

@wynot12 wynot12 Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a github comment what has been changed especially when it's not related to this PR..

Copy link
Contributor

@wynot12 wynot12 Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you've changed PWD to pwd, right?

@@ -14,7 +14,7 @@
# limitations under the License.

# EXAMPLE USAGE
# ./run_lasso.sh -max_num_epochs 300 -num_workers 4 -number_servers 1 -mini_batch_size 50 -num_blocks_per_worker 9 -features 10 -max_num_eval_local 5 -input sample_lasso -test_data_path file://$(PWD)/sample_lasso_test -local true -lambda 0.5 -timeout 200000 -step_size 0.1 -decay_rate 0.95 -decay_period 5 -features_per_partition 2 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
Copy link
Contributor

@wynot12 wynot12 Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a github comment what has been changed especially when it's not related to this PR..

@@ -14,7 +14,7 @@
# limitations under the License.

# EXAMPLE USAGE
# ./run_mlr.sh -num_workers 4 -number_servers 2 -local true -input sample_mlr -max_num_eval_local 6 -test_data_path file://$(PWD)/sample_mlr_test -max_num_epochs 20 -mini_batch_size 54 -num_worker_blocks 10 -init_step_size 0.1 -classes 10 -features 784 -features_per_partition 392 -model_gaussian 0.001 -lambda 0.005 -timeout 200000 -decay_period 5 -decay_rate 0.9 -num_trainer_threads 1 -optimizer edu.snu.cay.dolphin.async.optimizer.impl.EmptyPlanOptimizer -optimization_interval_ms 3000 -delay_after_optimization_ms 10000 -opt_benefit_threshold 0.1 -server_metric_flush_period_ms 1000 -moving_avg_window_size 0 -metric_weight_factor 0.0 -num_initial_batch_metrics_to_skip 3 -min_num_required_batch_metrics 3
Copy link
Contributor

@wynot12 wynot12 Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a github comment what has been changed especially when it's not related to this PR..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got it. Thanks for the advice:)

@wynot12
Copy link
Contributor

wynot12 commented Jul 18, 2017

@JunhoeKim I've just confirmed that all apps run well in both multi-job and stand-alone mode.
I'll let you know after adding minor patches and testing it in Yarn environment.

Thanks!


# EXAMPLE USAGE
# Classification
# ./submit_gbt.sh -num_workers 2 -number_servers 1 -input file://$(PWD)/sample_gbt -test_data_path file://$(PWD)/sample_gbt_test -max_num_epochs 50 -mini_batch_size 54 -num_worker_blocks 10 -features 784 -lambda 0 -gamma 0 -max_depth_of_tree 5 -leaf_min_size 3 -num_keys 5 -num_trainer_threads 2 -metadata_path file://$(PWD)/sample_gbt.meta -server_metric_flush_period_ms 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PWD -> pwd

@wynot12
Copy link
Contributor

wynot12 commented Jul 18, 2017

I've also confirmed that it also works well in Yarn environment!
Great!

@wynot12
Copy link
Contributor

wynot12 commented Jul 18, 2017

I've finished testing and refining the code.
I'll merge this PR after double-checking with @JunhoeKim tomorrow.

@JunhoeKim JunhoeKim merged commit 95ffbe9 into master Jul 19, 2017
@JunhoeKim JunhoeKim deleted the job-launcher-impl branch July 19, 2017 10:09
@wynot12
Copy link
Contributor

wynot12 commented Aug 7, 2017

I'm attaching a figure of the client-side architecture.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept jobs with different configurations Support multi-job of Dolphin
2 participants