Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add MXEnginePushAsync and MXEnginePushSync C APIs #14615

Merged
merged 7 commits into from
Apr 12, 2019

Conversation

yuxihu
Copy link
Member

@yuxihu yuxihu commented Apr 4, 2019

std::function definition is not compatible between GCC 4.x and GCC 5.x+. The incompatibility caused segmentation fault when a third-party library compiled with GCC 5+ uses the PushAsync/PushSync interfaces in engine (MXNet pip package is compiled using GCC 4).

To avoid the above incompatibility issue, this PR addes MXEnginePushAsync/MXEnginePushSync C APIs. These two APIs take function pointer instead of std::function as the engine execution function.

@yuxihu yuxihu requested a review from anirudh2290 as a code owner April 4, 2019 06:46
@yuxihu
Copy link
Member Author

yuxihu commented Apr 4, 2019

@eric-haibin-lin @szha @apeforest @ctcyang Please help review

include/mxnet/engine.h Outdated Show resolved Hide resolved
include/mxnet/engine.h Outdated Show resolved Hide resolved
@eric-haibin-lin
Copy link
Member

Could you share what you tested to validate the change?

@yuxihu
Copy link
Member Author

yuxihu commented Apr 4, 2019

Could you share what you tested to validate the change?

Here is the PR in Horovod that adopts the new API. I built MXNet with GCC 4.8.4 and Horovod with GCC 5.4.0. I ran the unit tests, MNIST example, ResNet50 example in Horovod and there was no segmentation fault.

@wkcn
Copy link
Member

wkcn commented Apr 4, 2019

It is better to use the pure C type in the parameters list.
Because different compilers (such as gcc, clang, msvc) may have different implementations of STL containers (std::vector, std::shared_ptr, etc.), it will cause the ABI compatible problem too.

@yuxihu
Copy link
Member Author

yuxihu commented Apr 4, 2019

It is better to use the pure C type in the parameters list.
Because different compilers (such as gcc, clang, msvc) may have different implementations of STL containers (std::vector, std::shared_ptr, etc.), it will cause the ABI compatible problem too.

This is a good point. But we may not be able to use the pure C type here. We are pushing execution into another thread and the lifetime of the params need to be managed properly. shared_ptr is a straightforward choice. I am not aware of any ABI compatibility issue for shared_ptr as of now. Please suggest if you know any. Another option to consider is to implement a shared_ptr class by ourselves to remove the dependency on std::shared_ptr.

@piyushghai
Copy link
Contributor

Thanks for your contributions @yuxihu.
@mxnet-label-bot Add [Backend]

@marcoabreu marcoabreu added the Backend Issues related to the backend of MXNet label Apr 4, 2019
src/engine/engine.cc Outdated Show resolved Hide resolved
@wkcn
Copy link
Member

wkcn commented Apr 4, 2019

@yuxihu Could we add a callback function, which will be called to release the resource when the execution is finished?
There is an example in TVM project: https://github.com/dmlc/tvm/blob/master/src/runtime/c_runtime_api.cc#L443

@yuxihu
Copy link
Member Author

yuxihu commented Apr 5, 2019

@yuxihu Could we add a callback function, which will be called to release the resource when the execution is finished?
There is an example in TVM project: https://github.com/dmlc/tvm/blob/master/src/runtime/c_runtime_api.cc#L443

This sounds like a better option. Let me try it.

@yuxihu yuxihu force-pushed the engine_func_ptr branch 2 times, most recently from 49f118d to 2a4aa0f Compare April 8, 2019 22:37
@yuxihu
Copy link
Member Author

yuxihu commented Apr 8, 2019

@wkcn Changed based on your suggestion. It worked. Please review again.

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Apr 8, 2019
Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to replace std::vector<T> with T* and size_t vector_size in the parameters list.
The rest looks good to me.
Thank you!

include/mxnet/engine.h Outdated Show resolved Hide resolved
include/mxnet/engine.h Outdated Show resolved Hide resolved
@yuxihu yuxihu force-pushed the engine_func_ptr branch from 2a4aa0f to e794d2f Compare April 9, 2019 20:40
@szha
Copy link
Member

szha commented Apr 9, 2019

@junrushao1994 would you mind take a look at this PR?

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to revisit the direct call of MXNet engine from horovod. I am not very familiar with horovod but when I look at horovod tensorflow code it registers an op for broadcast and allreduce. Why couldn't we do this for MXNet also ? Horovod directly plugging into MXNet engine code means that we have to provide backward compatibility guarantees like C API or frontend API, which didnt apply to MXNet backend earlier and can cause maintenance overhead.

@yuxihu
Copy link
Member Author

yuxihu commented Apr 9, 2019

Is there a plan to revisit the direct call of MXNet engine from horovod. I am not very familiar with horovod but when I look at horovod tensorflow code it registers an op for broadcast and allreduce. Why couldn't we do this for MXNet also ? Horovod directly plugging into MXNet engine code means that we have to provide backward compatibility guarantees like C API or frontend API, which didnt apply to MXNet backend earlier and can cause maintenance overhead.

There is no plan to rewrite the Horovod-MXNet integration code.

FYI, Horovod-Tensorflow integration recently encountered the same GCC compatibility issue on tensorflow nightly build. The new APIs added here is not specific to Horovod use case. It can also be used by other third-party libraries.

@anirudh2290
Copy link
Member

@yuxihu these APIs weren't supposed to be directly called by third-party libraries AFAIK. MXNet provides no guarantees for backward compatibility or semver for these APIs in the backend. I think we are changing this now with horovod. Better would be to register the op and invoke op on horovod side or for other 3rd party libs for that matter.

@junrushao
Copy link
Member

Private APIs in MXNet Engine are not designed in the way that they interact with external libraries, but C APIs are. I am not super familiar with Horovod side, but is it possible to do the integration through C API instead of private API?

@yuxihu
Copy link
Member Author

yuxihu commented Apr 9, 2019

@yuxihu these APIs weren't supposed to be directly called by third-party libraries AFAIK. MXNet provides no guarantees for backward compatibility or semver for these APIs in the backend. I think we are changing this now with horovod. Better would be to register the op and invoke op on horovod side or for other 3rd party libs for that matter.

I am not familiar with how custom op is implemented in MXNet so I am not sure if the custom op approach fits into Horovod use cases at this moment. Even if it fits, it is a completely different design and will require significant amount of change. Solving this GCC incompatibility issue is high priority for our users. To me, evaluating and potentially using custom op is a longer term plan which we can discuss more offline.

@junrushao
Copy link
Member

@yuxihu If a third-party library would call MXNet, it should go through C API. A more correct way to do this is to add a C API pushing stuff to the engine.

@junrushao
Copy link
Member

Horovod already calls PushAsync these days. I was not involved in the original design of integrating Horovod and MXNet so I do not know how the decision was made.

So it is our opportunity to make it more correct, right?

This PR just replaces PushAsync with PushAsyncPtr to avoid GCC incompatibility issue.

As you mentioned, it only avoids incompatibility issue in a certain compiler (GCC 4.x and 5.x) in some test environments (some Linux distribution?), which seems ad-hoc, right? Adopting MXNet's more principled approach would be better for future maintenance.

To make it clear, I would say it could be simple to define PushAsyncPtr as a C API, which internally uses Engine::Get()->PushXXX. This should be simple sed/awk refactoring.

@yuxihu
Copy link
Member Author

yuxihu commented Apr 10, 2019

Horovod already calls PushAsync these days. I was not involved in the original design of integrating Horovod and MXNet so I do not know how the decision was made.

So it is our opportunity to make it more correct, right?

I agree with you C API is cleaner way of doing it.

This PR just replaces PushAsync with PushAsyncPtr to avoid GCC incompatibility issue.

As you mentioned, it only avoids incompatibility issue in a certain compiler (GCC 4.x and 5.x) in some test environments (some Linux distribution?), which seems ad-hoc, right? Adopting MXNet's more principled approach would be better for future maintenance.

It is not ad-hoc. Our MXNet pip packages are built with GCC 4.8.5. Most of recent OSes like Ubuntu 16.04 are GCC 5+. This is big usability problem for our users. The incompatibility is not only between 4 and 5. It should be between 4 and 5+.

To make it clear, I would say it could be simple to define PushAsyncPtr as a C API, which internally uses Engine::Get()->PushXXX. This should be simple sed/awk refactoring.

I got your point. But this also means we may need to expose data structures defined in Engine at C API level. Let me try it out.

@anirudh2290
Copy link
Member

I wasn't suggesting a custom op but a standard nnvm op like mxnet backend ops. This can then be invoked using MXImperativeInvokeOpEx. Since this may require considerable change I am also okay with adding C API for engine invoke directly ( though I am not aware of historical reason as to why it was not added yet. ). Also, depending on the complexity I am okay with moving forward with the current solution (to fix user issue) and follow up PRs on MXNet and horovod to fix this.

@yuxihu yuxihu changed the title Add PushAsyncPtr and PushSyncPtr APIs in engine Add MXEnginePushAsync and MXEnginePushSync C APIs Apr 10, 2019
@yuxihu
Copy link
Member Author

yuxihu commented Apr 10, 2019

I wasn't suggesting a custom op but a standard nnvm op like mxnet backend ops. This can then be invoked using MXImperativeInvokeOpEx. Since this may require considerable change I am also okay with adding C API for engine invoke directly ( though I am not aware of historical reason as to why it was not added yet. ). Also, depending on the complexity I am okay with moving forward with the current solution (to fix user issue) and follow up PRs on MXNet and horovod to fix this.

Changed to C APIs. Let's discuss the op approach you suggested offline.

@yuxihu
Copy link
Member Author

yuxihu commented Apr 10, 2019

@junrushao1994 Changed to C API. Please help review.

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution! Let's wait for the CI to pass.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test which pushes an op using this C API, to ensure we don't have regressions in the future.

src/c_api/c_api.cc Outdated Show resolved Hide resolved
include/mxnet/c_api.h Outdated Show resolved Hide resolved
include/mxnet/c_api.h Show resolved Hide resolved
Copy link
Member

@wkcn wkcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM : )

@yuxihu
Copy link
Member Author

yuxihu commented Apr 11, 2019

Please add a test which pushes an op using this C API, to ensure we don't have regressions in the future.

@anirudh2290 Addressed your comments and added tests. Please review and merge if it looks good to you. Thanks.

@anirudh2290 anirudh2290 dismissed their stale review April 11, 2019 06:57

LGTM, Thank you!

@yuxihu
Copy link
Member Author

yuxihu commented Apr 11, 2019

@mxnet-label-bot update [Backend, pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Apr 11, 2019
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! one comment:

include/mxnet/c_api.h Outdated Show resolved Hide resolved
src/c_api/c_api.cc Outdated Show resolved Hide resolved
@anirudh2290 anirudh2290 merged commit 800590e into apache:master Apr 12, 2019
@yuxihu yuxihu deleted the engine_func_ptr branch April 12, 2019 05:05
larroy pushed a commit to larroy/mxnet that referenced this pull request Apr 15, 2019
* add PushAsyncPtr and PushSyncPtr APIs in engine

* avoid using shared_ptr for param in new APIs

* avoid using std::vector in parameters

* change to C API

* address comments and add tests

* fix perl build

* use int instead of size_t
yuxihu added a commit to yuxihu/incubator-mxnet that referenced this pull request Apr 22, 2019
* add PushAsyncPtr and PushSyncPtr APIs in engine

* avoid using shared_ptr for param in new APIs

* avoid using std::vector in parameters

* change to C API

* address comments and add tests

* fix perl build

* use int instead of size_t
szha pushed a commit that referenced this pull request Apr 23, 2019
* add PushAsyncPtr and PushSyncPtr APIs in engine

* avoid using shared_ptr for param in new APIs

* avoid using std::vector in parameters

* change to C API

* address comments and add tests

* fix perl build

* use int instead of size_t
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* add PushAsyncPtr and PushSyncPtr APIs in engine

* avoid using shared_ptr for param in new APIs

* avoid using std::vector in parameters

* change to C API

* address comments and add tests

* fix perl build

* use int instead of size_t
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants