Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AutoScheduler handling of errors during measure consistent with AutoTvm #6909

Merged
merged 4 commits into from
Nov 17, 2020

Conversation

TaylorZowtuk
Copy link
Contributor

While running scripts using both AutoScheduler and AutoTvm to consecutively search for schedules for a number of operators/shapes, I observed different behaviors during measurement following the output “Too many errors happened during tuning.”

After looking into the code I determined that the difference in behavior was due to AutoScheduler and AutoTvm handling the case of, the number of accumulated errors during measurement exceeding some threshold, differently.

I observed that while using AutoTvm, the program would switch to debug level logging and continue search.

Too many errors happen in the tuning. Now is in debug mode
No: 217	GFLOPS: 0.00/0.00	result: MeasureResult(costs=(RuntimeError('Traceback (most recent call last):\n  [bt] (5) /home/tanvir/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7fd9b685ee13]\n  [bt] (4) /home/tanvir/tvm/build/libtvm.so(+0x1309037) [0x7fd9b68c8037]\n  [bt] (3) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x3fa) [0x7fd9b68cc86a]\n  [bt] (2) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)+0x57) [0x7fd9b68c0217]\n  [bt] (1) /home/tanvir/tvm/build/libtvm.so(tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)+0x6bd) [0x7fd9b68b546d]\n  [bt] (0) /home/tanvir/tvm/build/libtvm.so(+0x12f3668) [0x7fd9b68b2668]\n  File "/home/tanvir/tvm/src/runtime/rpc/rpc_endpoint.cc", line 807\nTVMError: Check failed: code == RPCCode: :kReturn: code=1'),), error_no=4, all_cost=10.765872716903687, timestamp=1604092331.4940712)	[('tile_f', [-1, 16]), ('tile_y', [-1, 2]), ('tile_x', [-1, 2]), ('tile_z', [-1, 16])],None,1719
…
<continues>

While using AutoScheduler, the program would crash after throwing an uncaught error.

Traceback (most recent call last):
  …
  File "runner.py", line 124, in fig_6
    m = run_operator(
  File "runner.py", line 58, in run_operator
    sch, args = auto_scheduler.auto_schedule(task, tuning_options=tune_option)
  File "/home/taylor/tvm/python/tvm/auto_scheduler/auto_schedule.py", line 213, in auto_schedule
    sch, tensors = _ffi_api.AutoSchedule(search_policy, tuning_options)
  File "/home/taylor/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /home/taylor/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f11e187d7b3]
  [bt] (4) /home/taylor/tvm/build/libtvm.so(+0x6965ab) [0x7f11e0c755ab]
  [bt] (3) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::AutoSchedule(tvm::auto_scheduler::SearchPolicy, tvm::auto_scheduler::TuningOptions)+0x11a) [0x7f11e0c74cca]
  [bt] (2) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::SketchPolicyNode::Search(int, int, int, tvm::auto_scheduler::ProgramMeasurer)+0x760) [0x7f11e0cfb3d0]
  [bt] (1) /home/taylor/tvm/build/libtvm.so(tvm::auto_scheduler::ProgramMeasurerNode::Measure(tvm::auto_scheduler::SearchTask const&, tvm::auto_scheduler::SearchPolicy const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureInput, void> const&, tvm::runtime::Array<tvm::auto_scheduler::MeasureResult, void>*, int)+0x11ed) [0x7f11e0cd7b2d]
  [bt] (0) /home/taylor/tvm/build/libtvm.so(+0x6f4af8) [0x7f11e0cd3af8]
  File "/home/taylor/tvm/src/auto_scheduler/measure.cc", line 268
TVMError: Too many errors happened during tuning

In my particular case, AutoScheduler crashing rather than continuing to attempt searching meant that my script would terminate prematurely when it may have recovered from whatever was causing errors during search.
In addition, I was unclear why this behavior was only occurring in AutoScheduler and not AutoTvm. This discrepancy in behavior can be confusing to new users who may want to explore both methods of schedule searching. This PR proposes bringing the AutoScheduler handling of errors in measurement in line with AutoTvm.

By removing the LOG(FATAL) and changing verbosity for AutoScheduler in the same way we change logging level in AutoTvm the programs will behave the same. In addition, I changed the default verbosity of AutoScheduler to 0 (silent) in order to match the default logging level of AutoTvm.

@TaylorZowtuk
Copy link
Contributor Author

@tqchen @merrymercy Thoughts and review please?

@tqchen
Copy link
Member

tqchen commented Nov 12, 2020

cc @jcf94 @merrymercy

Copy link
Member

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
It is very rare to recover from a case where you have so many continuous errors.

@merrymercy
Copy link
Member

merrymercy commented Nov 15, 2020

Please rebase and fix the CI error.

@TaylorZowtuk
Copy link
Contributor Author

TaylorZowtuk commented Nov 16, 2020

How do you hit this part of the code? Generally, it means you have some fatal errors in the code.
It is very rare to recover from a case where you have so many continuous errors.

I'm not entirely certain what causes us to hit this condition. In our case, we observed from the AutoTvm debug prints that it was due to error_no=4 which is a RUNTIME_DEVICE error (as you can see from the except of AutoTvm log I included previously). Hitting this condition happened very intermittently. We could run a particular op/shape one time and hit the condition and without changing anything it would work the next. In addition, having one op/shape reach this condition didnt mean the rest of our op/shapes that we were running in the same script would fail meaning the system overall was able to recover. I think the main issue is that by terminating the program as soon as we meet this condition we dont allow for the chance to recover and additionally, we wont be getting this useful precise feedback about what error we are hitting while using the auto_scheduler.

Ill do the rebasing and try to fix the CI issue.

@merrymercy merrymercy merged commit 6c01998 into apache:main Nov 17, 2020
@merrymercy
Copy link
Member

Thanks, @TaylorZowtuk. It is merged.

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
…AutoTvm (apache#6909)

* Match ansor handling of 'too many errors' during measure to that of autoTVM and match default level of logging

* Set correct level of verbosity for debug mode

Co-authored-by: Lianmin Zheng <[email protected]>

* Lint

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Taylor Zowtuk 84152750 <[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…AutoTvm (apache#6909)

* Match ansor handling of 'too many errors' during measure to that of autoTVM and match default level of logging

* Set correct level of verbosity for debug mode

Co-authored-by: Lianmin Zheng <[email protected]>

* Lint

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Taylor Zowtuk 84152750 <[email protected]>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…AutoTvm (apache#6909)

* Match ansor handling of 'too many errors' during measure to that of autoTVM and match default level of logging

* Set correct level of verbosity for debug mode

Co-authored-by: Lianmin Zheng <[email protected]>

* Lint

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Taylor Zowtuk 84152750 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants