Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Kernel operator tuning #8686

Merged
merged 5 commits into from
Nov 21, 2017
Merged

Kernel operator tuning #8686

merged 5 commits into from
Nov 21, 2017

Conversation

cjolivier01
Copy link
Member

@cjolivier01 cjolivier01 commented Nov 16, 2017

@piiswrong

Description

Automatic OMP operator tuning based upon kernel operation workload.
Determines "weight" of a unary or binary kernel op and then uses this to determine if OMP should be used, given # of iterations required and # threads to perform the job.

Correct decision accuracy is tested in gtest OMP_TUNING test suite by comparing with OMP, without OMP, and Auto times.

For example:

  • AWS c4.8xlarge (36 vCores):
    Success rate for type float: 0.90278
    Success rate for type double: 0.88889
    Success rate for type mshadow::half::half_t: 0.83333
    Success rate for type unsigned char: 0.86111
    Success rate for type int: 0.95833
    Success rate for type long: 0.88889

  • desktop: 12-core (6 real CPU cores + hyperthreading)
    Success rate for type float: 0.79167
    Success rate for type double: 0.75000
    Success rate for type unsigned char: 0.72222
    Success rate for type int: 0.94444
    Success rate for type long: 1.00000

A sample output from OMP_TUNING tests including staticstical data:
tune_all.txt

Currently autotuned kernel operators (tuning at startup takes a total of ~ 3ms):

mxnet::op::PopulateFullIdxRspKernel
mxnet::op::mxnet_op::set_to_int<0>
mxnet::op::mshadow_op::smooth_l1_gradient
mxnet::op::mshadow_op::smooth_l1_loss
mxnet::op::mshadow_op::eq
mxnet::op::mshadow_op::ne
mxnet::op::mshadow_op::le
mxnet::op::mshadow_op::lt
mxnet::op::mshadow_op::hypot_grad_right
mxnet::op::mshadow_op::hypot_grad_left
mxnet::op::mshadow_op::hypot
mxnet::op::mshadow_op::arctanh_grad
mxnet::op::mshadow_op::arctan_grad
mxnet::op::mshadow_op::cosh
mxnet::op::mshadow_op::rpower
mxnet::op::mshadow_op::minimum
mxnet::op::mshadow_op::arctan
mxnet::op::mshadow_op::reciprocal_square_root
mxnet::op::mshadow_op::rminus
mxnet::op::mshadow_op::arccosh_grad
mxnet::op::mshadow_op::square_root_grad
mxnet::op::mshadow_op::arctanh
mxnet::op::mshadow_op::floor
mxnet::op::mshadow_op::cosh_grad
mxnet::op::mshadow_op::ceil
mxnet::op::mshadow_op::cos_grad
mxnet::op::mshadow_op::reciprocal_cube_root_grad
mxnet::op::mshadow_op::arcsinh_grad
mxnet::op::mshadow_op::sin
mxnet::op::mshadow_op::arcsin
mxnet::op::mshadow_op::log10_grad
mxnet::op::mshadow_op::log1p_grad
mxnet::op::mshadow_op::mod_grad
mxnet::op::mshadow_op::arccos_grad
mxnet::op::mshadow_op::exp
mxnet::op::mshadow_op::tanh_grad
mxnet::op::mshadow_op::log1p
mxnet::op::mshadow_op::rint
mshadow::op::minus
mxnet::op::mshadow_op::relu_grad
mxnet::op::mshadow_op::identity
mxnet::op::mshadow_op::maximum
mxnet::op::mshadow_op::reciprocal_grad
mshadow::op::div
mxnet::op::mshadow_op::rmod_grad
mxnet::op::mshadow_op::arcsin_grad
mxnet::op::mshadow_op::ge
mxnet::op::mshadow_op::gammaln_grad
mxnet::op::mshadow_op::sigmoid
mxnet::op::mshadow_op::power_rgrad
mxnet::op::mshadow_op::identity_grad
mxnet::op::mshadow_op::tan
mxnet::op::mshadow_op::gamma
mxnet::op::mshadow_op::arcsinh
mshadow::op::identity
mxnet::op::mshadow_op::square_root
mxnet::op::mshadow_op::reciprocal_square_root_grad
mxnet::op::mshadow_op::cos
mxnet::op::mshadow_op::log2
mxnet::op::mshadow_op::tanh
mxnet::op::mshadow_op::arccosh
mxnet::op::mshadow_op::negation
mxnet::op::mshadow_op::log10
mxnet::op::mshadow_op::cube_root_grad
mxnet::op::mshadow_op::expm1
mxnet::op::mshadow_op::arccos
mxnet::op::mshadow_op::rmod
mxnet::op::mshadow_op::softrelu_grad
mxnet::op::mshadow_op::sinh
mxnet::op::mshadow_op::log_grad
mxnet::op::mshadow_op::sin_grad
mxnet::op::mshadow_op::rdiv_grad
mxnet::op::mshadow_op::log
mxnet::op::mshadow_op::softrelu
mxnet::op::mshadow_op::square_grad
mxnet::op::mshadow_op::log2_grad
mxnet::op::mshadow_op::cube_root
mxnet::op::mshadow_op::reciprocal_cube_root
mxnet::op::mshadow_op::sign
mxnet::op::mshadow_op::square
mxnet::op::mshadow_op::sign_grad
mxnet::op::mshadow_op::round
mxnet::op::mshadow_op::trunc
mxnet::op::mshadow_op::mod_rgrad
mxnet::op::mshadow_op::reciprocal
mxnet::op::mshadow_op::fix
mxnet::op::mshadow_op::gamma_grad
mxnet::op::mshadow_op::gammaln
mxnet::op::mshadow_op::degrees
mshadow::op::right
mxnet::op::mshadow_op::sinh_grad
mxnet::op::mshadow_op::degrees_grad
mshadow::op::plus
mxnet::op::mshadow_op::radians
mxnet::op::mshadow_op::sigmoid_grad
mxnet::op::mshadow_op::radians_grad
mxnet::op::mshadow_op::gt
mxnet::op::mshadow_op::mod
mshadow::op::mul
mxnet::op::mshadow_op::rdiv
mxnet::op::mshadow_op::tan_grad
mxnet::op::mshadow_op::div_grad
mxnet::op::mshadow_op::div_rgrad
mxnet::op::mshadow_op::left
mxnet::op::mshadow_op::right
mxnet::op::mshadow_op::power
mxnet::op::mshadow_op::power_grad
mxnet::op::mshadow_op::relu
mxnet::op::mshadow_op::abs
mxnet::op::mshadow_op::rpower_grad

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • For user-facing API changes, API doc string has been updated.
  • To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Intersting edge cases to note here

This was referenced Nov 16, 2017
@@ -249,8 +348,9 @@ void BinaryBroadcastBackwardUseIn(const nnvm::NodeAttrs& attrs,
const std::vector<OpReqType>& req,
const std::vector<TBlob>& outputs) {
TShape new_lshape, new_rshape, new_oshape;
bool need_bc = BinaryBroadcastShapeCompact(outputs[0].shape_, outputs[1].shape_, inputs[0].shape_,
&new_lshape, &new_rshape, &new_oshape);
const bool need_bc = BinaryBroadcastShapeCompact(outputs[0].shape_,
Copy link
Member Author

@cjolivier01 cjolivier01 Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was problem converting in to bool here

s, new_oshape.Size(), req[0], lstride, rstride, oshape,
inputs[0].dptr<DType>(), inputs[1].dptr<DType>(), outputs[0].dptr<DType>(),
inputs[0].Size(), inputs[1].Size());
mshadow::Shape<NDim> oshape = new_oshape.get<NDim>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make my IDE stop complaining because it can't figure out the namespace

@cjolivier01 cjolivier01 force-pushed the bc_tune branch 6 times, most recently from 78844b7 to c8446f1 Compare November 18, 2017 00:11
inc(&coord, oshape, &lidx, lstride, &ridx, rstride);
KERNEL_ASSIGN(out[base+i], req, OP::Map(lhs[lidx], rhs[ridx]));
DType* out) {
if (req != kNullOp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if req is null the kernel shouldn't be launched

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in BinaryBroadcastCompute, there's no check for req -- kernel is called anyway. This is typical for most calls such as this (nnvm unary or binary ops). In some cases, it's done indirectly with the Req switch, but you said that wasn't worth the compile time+binary size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check to Compute call in the "remove broadcast" commit

* \return true if OMP parallelization should be used for the N iterations
*/
template<typename ...Args>
static bool UseOMP(const size_t N, const size_t thread_count, OpReqType req, Args... args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wildly too complicated for a broadcasting kernel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Wildly”? Ok so won’t support broadcast. I’ll remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadcast support removed

@cjolivier01
Copy link
Member Author

cjolivier01 commented Nov 18, 2017

OMP overhead this run, ~8500 ns.

unary and binary op times: 0.5-200 ns

OperatorTuneBase::duration_t OperatorTuneBase::omp_overhead_ns_ = 8495;
mshadow::op::identity time: 1.33008 ns
mxnet::op::mshadow_op::identity time: 1.32959 ns
mxnet::op::mshadow_op::identity_grad time: 1.22754 ns
mxnet::op::mshadow_op::negation time: 1.56836 ns
mxnet::op::mshadow_op::reciprocal time: 1.53467 ns
mxnet::op::mshadow_op::reciprocal_grad time: 2.48975 ns
mxnet::op::mshadow_op::sigmoid time: 29.3955 ns
mxnet::op::mshadow_op::sigmoid_grad time: 2.18262 ns
mxnet::op::mshadow_op::relu time: 1.77295 ns
mxnet::op::mshadow_op::relu_grad time: 2.14844 ns
mxnet::op::mshadow_op::tanh time: 38.1255 ns
mxnet::op::mshadow_op::tanh_grad time: 2.25098 ns
mxnet::op::mshadow_op::softrelu time: 51.7314 ns
mxnet::op::mshadow_op::softrelu_grad time: 33.5557 ns
mxnet::op::mshadow_op::exp time: 17.4258 ns
mxnet::op::mshadow_op::exp time: 18.1758 ns
mxnet::op::mshadow_op::expm1 time: 23.189 ns
mxnet::op::mshadow_op::log time: 25.1323 ns
mxnet::op::mshadow_op::log_grad time: 2.08008 ns
mxnet::op::mshadow_op::log1p time: 24.5527 ns
mxnet::op::mshadow_op::log1p_grad time: 2.52344 ns
mxnet::op::mshadow_op::log2 time: 24.1777 ns
mxnet::op::mshadow_op::log2_grad time: 2.0459 ns
mxnet::op::mshadow_op::log10 time: 26.4624 ns
mxnet::op::mshadow_op::log10_grad time: 2.08057 ns
mxnet::op::mshadow_op::sin time: 12.6177 ns
mxnet::op::mshadow_op::sin_grad time: 13.1626 ns
mxnet::op::mshadow_op::sinh time: 41.6372 ns
mxnet::op::mshadow_op::sinh_grad time: 29.5659 ns
mxnet::op::mshadow_op::arcsin time: 16.2319 ns
mxnet::op::mshadow_op::arcsin_grad time: 6.37695 ns
mxnet::op::mshadow_op::arcsinh time: 38.6709 ns
mxnet::op::mshadow_op::arcsinh_grad time: 15.4478 ns
mxnet::op::mshadow_op::cos time: 10.5376 ns
mxnet::op::mshadow_op::cos_grad time: 12.9243 ns
mxnet::op::mshadow_op::cosh time: 27.4175 ns
mxnet::op::mshadow_op::cosh_grad time: 41.7739 ns
mxnet::op::mshadow_op::arccos time: 19.2676 ns
mxnet::op::mshadow_op::arccos_grad time: 6.41064 ns
mxnet::op::mshadow_op::arccosh time: 15.2773 ns
mxnet::op::mshadow_op::arccosh_grad time: 19.4717 ns
mxnet::op::mshadow_op::tan time: 26.4966 ns
mxnet::op::mshadow_op::tan_grad time: 2.11426 ns
mxnet::op::mshadow_op::arctan time: 19.2329 ns
mxnet::op::mshadow_op::arctan_grad time: 3.00098 ns
mxnet::op::mshadow_op::arctanh time: 38.6025 ns
mxnet::op::mshadow_op::arctanh_grad time: 3.06934 ns
mxnet::op::mshadow_op::square time: 1.50049 ns
mxnet::op::mshadow_op::square_grad time: 1.90967 ns
mxnet::op::mshadow_op::square_root time: 13.1631 ns
mxnet::op::mshadow_op::square_root_grad time: 2.08008 ns
mxnet::op::mshadow_op::reciprocal_square_root time: 13.7085 ns
mxnet::op::mshadow_op::reciprocal_square_root_grad time: 16.062 ns
mxnet::op::mshadow_op::cube_root time: 38.6367 ns
mxnet::op::mshadow_op::cube_root_grad time: 3.00098 ns
mxnet::op::mshadow_op::reciprocal_cube_root time: 38.1255 ns
mxnet::op::mshadow_op::reciprocal_cube_root_grad time: 40.2393 ns
mxnet::op::mshadow_op::abs time: 1.50049 ns
mxnet::op::mshadow_op::sign time: 3.54639 ns
mxnet::op::mshadow_op::sign time: 4.09229 ns
mxnet::op::mshadow_op::sign_grad time: 1.53467 ns
mxnet::op::mshadow_op::round time: 7.74121 ns
mxnet::op::mshadow_op::floor time: 5.25146 ns
mxnet::op::mshadow_op::trunc time: 6.20654 ns
mxnet::op::mshadow_op::rint time: 9.10498 ns
mxnet::op::mshadow_op::fix time: 10.3667 ns
mxnet::op::mshadow_op::gamma time: 107.555 ns
mxnet::op::mshadow_op::gamma_grad time: 208.938 ns
mxnet::op::mshadow_op::gammaln time: 74.5112 ns
mxnet::op::mshadow_op::gammaln_grad time: 95.3125 ns
mxnet::op::mshadow_op::ceil time: 4.43311 ns
mxnet::op::mshadow_op::degrees time: 1.50049 ns
mxnet::op::mshadow_op::degrees_grad time: 1.50049 ns
mxnet::op::mshadow_op::radians time: 1.50049 ns
mxnet::op::mshadow_op::radians_grad time: 1.50049 ns
mshadow::op::plus time: 1.53467 ns
mshadow::op::minus time: 1.53467 ns
mshadow::op::mul time: 1.53467 ns
mshadow::op::div time: 1.63672 ns
mshadow::op::right time: 1.32959 ns
mxnet::op::mshadow_op::rminus time: 1.80762 ns
mxnet::op::mshadow_op::rdiv time: 1.84131 ns
mxnet::op::mshadow_op::div_grad time: 1.53467 ns
mxnet::op::mshadow_op::div_grad time: 1.90967 ns
mxnet::op::mshadow_op::div_rgrad time: 2.28467 ns
mxnet::op::mshadow_op::div_rgrad time: 2.83057 ns
mxnet::op::mshadow_op::rdiv_grad time: 2.62598 ns
mxnet::op::mshadow_op::mod time: 41.8423 ns
mxnet::op::mshadow_op::mod_grad time: 1.26172 ns
mxnet::op::mshadow_op::mod_rgrad time: 4.97852 ns
mxnet::op::mshadow_op::rmod time: 42.5581 ns
mxnet::op::mshadow_op::rmod_grad time: 5.0127 ns
mxnet::op::mshadow_op::left time: 1.19336 ns
mxnet::op::mshadow_op::left time: 1.53418 ns
mxnet::op::mshadow_op::right time: 1.22803 ns
mxnet::op::mshadow_op::right time: 1.50049 ns
mxnet::op::mshadow_op::power time: 71.7827 ns
mxnet::op::mshadow_op::rpower time: 68.4067 ns
mxnet::op::mshadow_op::power_grad time: 70.9302 ns
mxnet::op::mshadow_op::rpower_grad time: 22.0293 ns
mxnet::op::mshadow_op::power_rgrad time: 91.1865 ns
mxnet::op::mshadow_op::maximum time: 1.56836 ns
mxnet::op::mshadow_op::minimum time: 1.53467 ns
mxnet::op::mshadow_op::hypot time: 25.5757 ns
mxnet::op::mshadow_op::hypot_grad_left time: 19.3013 ns
mxnet::op::mshadow_op::hypot_grad_left time: 13.8789 ns
mxnet::op::mshadow_op::hypot_grad_right time: 13.1973 ns
mxnet::op::mshadow_op::hypot_grad_right time: 13.231 ns
mxnet::op::mshadow_op::lt time: 1.77295 ns
mxnet::op::mshadow_op::lt time: 2.5918 ns
mxnet::op::mshadow_op::le time: 1.73926 ns
mxnet::op::mshadow_op::le time: 2.62598 ns
mxnet::op::mshadow_op::gt time: 2.5918 ns
mxnet::op::mshadow_op::gt time: 2.08008 ns
mxnet::op::mshadow_op::ge time: 2.01172 ns
mxnet::op::mshadow_op::ge time: 2.55762 ns
mxnet::op::mshadow_op::ne time: 2.11426 ns
mxnet::op::mshadow_op::ne time: 2.62598 ns
mxnet::op::mshadow_op::eq time: 2.11426 ns
mxnet::op::mshadow_op::eq time: 2.62598 ns
mxnet::op::mshadow_op::smooth_l1_loss time: 4.29639 ns
mxnet::op::mshadow_op::smooth_l1_gradient time: 3.85352 ns
mxnet::op::mxnet_op::set_to_int<0> time: 1.09131 ns
mxnet::op::mxnet_op::set_to_int<1> time: 0.647949 ns
mxnet::op::PopulateFullIdxRspKernel time: 0.647949 ns

@cjolivier01 cjolivier01 changed the title Kernel operator tuning including broadcast Kernel operator tuning Nov 18, 2017
@cjolivier01
Copy link
Member Author

Resetting commit in order to cleanly remove broadcast

@cjolivier01 cjolivier01 force-pushed the bc_tune branch 6 times, most recently from 296e502 to 56acc81 Compare November 20, 2017 15:52
@cjolivier01 cjolivier01 merged commit 068b589 into apache:master Nov 21, 2017
eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this pull request Dec 3, 2017
* Refreshed branch bc_tune

* local-build openmp as static

* trigger

* Somehow broadcast found its way back in, removed again

* Trigger rebuild
zhreshold pushed a commit to zhreshold/mxnet that referenced this pull request Dec 14, 2017
* Refreshed branch bc_tune

* local-build openmp as static

* trigger

* Somehow broadcast found its way back in, removed again

* Trigger rebuild
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Refreshed branch bc_tune

* local-build openmp as static

* trigger

* Somehow broadcast found its way back in, removed again

* Trigger rebuild
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants