Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

TaylorZowtuk · 2020-11-12T20:57:10Z

While running scripts using both AutoScheduler and AutoTvm to consecutively search for schedules for a number of operators/shapes, I observed different behaviors during measurement following the output “Too many errors happened during tuning.”

After looking into the code I determined that the difference in behavior was due to AutoScheduler and AutoTvm handling the case of, the number of accumulated errors during measurement exceeding some threshold, differently.

I observed that while using AutoTvm, the program would switch to debug level logging and continue search.

Too many errors happen in the tuning. Now is in debug mode
No: 217	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /home/tanvir/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7fd9b685ee13]\n  [bt] (4) /home/tanvir/tvm/build/libtvm.so(+0x1309037) [0x7fd9b68c8037]\n  [bt] (3) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x3fa) [0x7fd9b68cc86a]\n  [bt] (2) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)+0x57) [0x7fd9b68c0217]\n  [bt] (1) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)+0x6bd) [0x7fd9b68b546d]\n  [bt] (0) /home/tanvir/tvm/build/libtvm.so(+0x12f3668) [0x7fd9b68b2668]\n  File "/home/tanvir/tvm/src/runtime/rpc/rpc_endpoint.cc", line 807\nTVMError: Check failed: code == RPCCode: :kReturn: code=1'),), error_no=4, all_cost=10.765872716903687, timestamp=1604092331.4940712)	[('tile_f', [-1, 16]), ('tile_y', [-1, 2]), ('tile_x', [-1, 2]), ('tile_z', [-1, 16])],None,1719
…
<continues>

While using AutoScheduler, the program would crash after throwing an uncaught error.

Traceback (most recent call last):
  …
  File "runner.py", line 124, in fig_6
    m = run_operator(
  File "runner.py", line 58, in run_operator
    sch, args = auto_scheduler.auto_schedule(task, tuning_options=tune_option)
  File "/home/taylor/tvm/python/tvm/auto_scheduler/auto_schedule.py", line 213, in auto_schedule
    sch, tensors = _ffi_api.AutoSchedule(search_policy, tuning_options)
  File "/home/taylor/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /home/taylor/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f11e187d7b3]
  [bt] (4) /home/taylor/tvm/build/libtvm.so(+0x6965ab) [0x7f11e0c755ab]
  [bt] (3) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)+0x11a) [0x7f11e0c74cca]
  [bt] (2) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)+0x760) [0x7f11e0cfb3d0]
  [bt] (1) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::ProgramMeasurerNode::Measure(tvm::auto_scheduler::SearchTask const&, tvm::auto_scheduler::SearchPolicy const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureResult, void>*, int)+0x11ed) [0x7f11e0cd7b2d]
  [bt] (0) /home/taylor/tvm/build/libtvm.so(+0x6f4af8) [0x7f11e0cd3af8]
  File "/home/taylor/tvm/src/auto_scheduler/measure.cc", line 268
TVMError: Too many errors happened during tuning

In my particular case, AutoScheduler crashing rather than continuing to attempt searching meant that my script would terminate prematurely when it may have recovered from whatever was causing errors during search.
In addition, I was unclear why this behavior was only occurring in AutoScheduler and not AutoTvm. This discrepancy in behavior can be confusing to new users who may want to explore both methods of schedule searching. This PR proposes bringing the AutoScheduler handling of errors in measurement in line with AutoTvm.

By removing the LOG(FATAL) and changing verbosity for AutoScheduler in the same way we change logging level in AutoTvm the programs will behave the same. In addition, I changed the default verbosity of AutoScheduler to 0 (silent) in order to match the default logging level of AutoTvm.

TaylorZowtuk · 2020-11-12T20:58:00Z

@tqchen @merrymercy Thoughts and review please?

tqchen · 2020-11-12T23:48:14Z

cc @jcf94 @merrymercy

merrymercy

How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
It is very rare to recover from a case where you have so many continuous errors.

src/auto_scheduler/measure.cc

merrymercy · 2020-11-15T19:51:50Z

Please rebase and fix the CI error.

TaylorZowtuk · 2020-11-16T20:30:40Z

How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
It is very rare to recover from a case where you have so many continuous errors.

I'm not entirely certain what causes us to hit this condition. In our case, we observed from the AutoTvm debug prints that it was due to error_no=4 which is a RUNTIME_DEVICE error (as you can see from the except of AutoTvm log I included previously). Hitting this condition happened very intermittently. We could run a particular op/shape one time and hit the condition and without changing anything it would work the next. In addition, having one op/shape reach this condition didnt mean the rest of our op/shapes that we were running in the same script would fail meaning the system overall was able to recover. I think the main issue is that by terminating the program as soon as we meet this condition we dont allow for the chance to recover and additionally, we wont be getting this useful precise feedback about what error we are hitting while using the auto_scheduler.

Ill do the rebasing and try to fix the CI issue.

…utoTVM and match default level of logging

Co-authored-by: Lianmin Zheng <[email protected]>

python/tvm/auto_scheduler/auto_schedule.py

merrymercy · 2020-11-17T08:14:27Z

Thanks, @TaylorZowtuk. It is merged.

…AutoTvm (apache#6909) * Match ansor handling of 'too many errors' during measure to that of autoTVM and match default level of logging * Set correct level of verbosity for debug mode Co-authored-by: Lianmin Zheng <[email protected]> * Lint * trigger CI Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Taylor Zowtuk 84152750 <[email protected]>

tqchen added the status: need review label Nov 12, 2020

tqchen assigned merrymercy Nov 12, 2020

merrymercy reviewed Nov 15, 2020

View reviewed changes

src/auto_scheduler/measure.cc Outdated Show resolved Hide resolved

TaylorZowtuk and others added 2 commits November 16, 2020 20:46

Match ansor handling of 'too many errors' during measure to that of a…

4e7e259

…utoTVM and match default level of logging

Set correct level of verbosity for debug mode

9e4c8ec

Co-authored-by: Lianmin Zheng <[email protected]>

TaylorZowtuk force-pushed the ansor_measure_errors branch from ccb15e9 to 9e4c8ec Compare November 16, 2020 20:52

Lint

a843599

merrymercy reviewed Nov 17, 2020

View reviewed changes

python/tvm/auto_scheduler/auto_schedule.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 17, 2020

View reviewed changes

python/tvm/auto_scheduler/auto_schedule.py Outdated Show resolved Hide resolved

trigger CI

c65de6f

merrymercy approved these changes Nov 17, 2020

View reviewed changes

merrymercy merged commit 6c01998 into apache:main Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

TaylorZowtuk commented Nov 12, 2020

TaylorZowtuk commented Nov 12, 2020

tqchen commented Nov 12, 2020

merrymercy left a comment •

edited

Loading

merrymercy commented Nov 15, 2020 •

edited

Loading

TaylorZowtuk commented Nov 16, 2020 •

edited

Loading

merrymercy commented Nov 17, 2020

Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

Conversation

TaylorZowtuk commented Nov 12, 2020

TaylorZowtuk commented Nov 12, 2020

tqchen commented Nov 12, 2020

merrymercy left a comment • edited Loading

Choose a reason for hiding this comment

merrymercy commented Nov 15, 2020 • edited Loading

TaylorZowtuk commented Nov 16, 2020 • edited Loading

merrymercy commented Nov 17, 2020

merrymercy left a comment •

edited

Loading

merrymercy commented Nov 15, 2020 •

edited

Loading

TaylorZowtuk commented Nov 16, 2020 •

edited

Loading