-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add MXEnginePushAsync and MXEnginePushSync C APIs #14615
Conversation
@eric-haibin-lin @szha @apeforest @ctcyang Please help review |
Could you share what you tested to validate the change? |
Here is the PR in Horovod that adopts the new API. I built MXNet with GCC 4.8.4 and Horovod with GCC 5.4.0. I ran the unit tests, MNIST example, ResNet50 example in Horovod and there was no segmentation fault. |
It is better to use the pure C type in the parameters list. |
This is a good point. But we may not be able to use the pure C type here. We are pushing execution into another thread and the lifetime of the params need to be managed properly. shared_ptr is a straightforward choice. I am not aware of any ABI compatibility issue for shared_ptr as of now. Please suggest if you know any. Another option to consider is to implement a shared_ptr class by ourselves to remove the dependency on std::shared_ptr. |
Thanks for your contributions @yuxihu. |
@yuxihu Could we add a callback function, which will be called to release the resource when the execution is finished? |
This sounds like a better option. Let me try it. |
49f118d
to
2a4aa0f
Compare
@wkcn Changed based on your suggestion. It worked. Please review again. @mxnet-label-bot add [pr-awaiting-review] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary to replace std::vector<T>
with T* and size_t vector_size in the parameters list.
The rest looks good to me.
Thank you!
@junrushao1994 would you mind take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a plan to revisit the direct call of MXNet engine from horovod. I am not very familiar with horovod but when I look at horovod tensorflow code it registers an op for broadcast and allreduce. Why couldn't we do this for MXNet also ? Horovod directly plugging into MXNet engine code means that we have to provide backward compatibility guarantees like C API or frontend API, which didnt apply to MXNet backend earlier and can cause maintenance overhead.
There is no plan to rewrite the Horovod-MXNet integration code. FYI, Horovod-Tensorflow integration recently encountered the same GCC compatibility issue on tensorflow nightly build. The new APIs added here is not specific to Horovod use case. It can also be used by other third-party libraries. |
@yuxihu these APIs weren't supposed to be directly called by third-party libraries AFAIK. MXNet provides no guarantees for backward compatibility or semver for these APIs in the backend. I think we are changing this now with horovod. Better would be to register the op and invoke op on horovod side or for other 3rd party libs for that matter. |
Private APIs in MXNet Engine are not designed in the way that they interact with external libraries, but C APIs are. I am not super familiar with Horovod side, but is it possible to do the integration through C API instead of private API? |
I am not familiar with how custom op is implemented in MXNet so I am not sure if the custom op approach fits into Horovod use cases at this moment. Even if it fits, it is a completely different design and will require significant amount of change. Solving this GCC incompatibility issue is high priority for our users. To me, evaluating and potentially using custom op is a longer term plan which we can discuss more offline. |
@yuxihu If a third-party library would call MXNet, it should go through C API. A more correct way to do this is to add a C API pushing stuff to the engine. |
So it is our opportunity to make it more correct, right?
As you mentioned, it only avoids incompatibility issue in a certain compiler (GCC 4.x and 5.x) in some test environments (some Linux distribution?), which seems ad-hoc, right? Adopting MXNet's more principled approach would be better for future maintenance. To make it clear, I would say it could be simple to define |
I agree with you C API is cleaner way of doing it.
It is not ad-hoc. Our MXNet pip packages are built with GCC 4.8.5. Most of recent OSes like Ubuntu 16.04 are GCC 5+. This is big usability problem for our users. The incompatibility is not only between 4 and 5. It should be between 4 and 5+.
I got your point. But this also means we may need to expose data structures defined in Engine at C API level. Let me try it out. |
I wasn't suggesting a custom op but a standard nnvm op like mxnet backend ops. This can then be invoked using MXImperativeInvokeOpEx. Since this may require considerable change I am also okay with adding C API for engine invoke directly ( though I am not aware of historical reason as to why it was not added yet. ). Also, depending on the complexity I am okay with moving forward with the current solution (to fix user issue) and follow up PRs on MXNet and horovod to fix this. |
Changed to C APIs. Let's discuss the op approach you suggested offline. |
@junrushao1994 Changed to C API. Please help review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the contribution! Let's wait for the CI to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test which pushes an op using this C API, to ensure we don't have regressions in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM : )
@anirudh2290 Addressed your comments and added tests. Please review and merge if it looks good to you. Thanks. |
@mxnet-label-bot update [Backend, pr-awaiting-merge] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! one comment:
* add PushAsyncPtr and PushSyncPtr APIs in engine * avoid using shared_ptr for param in new APIs * avoid using std::vector in parameters * change to C API * address comments and add tests * fix perl build * use int instead of size_t
* add PushAsyncPtr and PushSyncPtr APIs in engine * avoid using shared_ptr for param in new APIs * avoid using std::vector in parameters * change to C API * address comments and add tests * fix perl build * use int instead of size_t
* add PushAsyncPtr and PushSyncPtr APIs in engine * avoid using shared_ptr for param in new APIs * avoid using std::vector in parameters * change to C API * address comments and add tests * fix perl build * use int instead of size_t
* add PushAsyncPtr and PushSyncPtr APIs in engine * avoid using shared_ptr for param in new APIs * avoid using std::vector in parameters * change to C API * address comments and add tests * fix perl build * use int instead of size_t
std::function definition is not compatible between GCC 4.x and GCC 5.x+. The incompatibility caused segmentation fault when a third-party library compiled with GCC 5+ uses the PushAsync/PushSync interfaces in engine (MXNet pip package is compiled using GCC 4).
To avoid the above incompatibility issue, this PR addes MXEnginePushAsync/MXEnginePushSync C APIs. These two APIs take function pointer instead of std::function as the engine execution function.