Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MKLDNN: Fully Connected layer. #9197

Closed
mozga-intel opened this issue Mar 19, 2018 · 4 comments
Closed

MKLDNN: Fully Connected layer. #9197

mozga-intel opened this issue Mar 19, 2018 · 4 comments
Labels

Comments

@mozga-intel
Copy link
Contributor

I will use the Fully Connected layer as an example to describe this problem. So, during the implementation of the Fully Connected layer with the use of MKLDNN algorithm, I have encountered a few difficulties. The current version of the fully connected layer of Paddle is splitting into two operations: multiplication and addition. Basically, these operations are used in the current version of Paddle. Subsequently, MKLDNN version of algorithm gives us the opportunity to combine these operations into one. So, If I wanted to kill two birds with one stone I should have made a new kernel to this layer. Thus, I should make a stand-alone version of FC's algorithm. However, when I implemented new kernel, I picked up a few problems:

  • First of all, Am I forced to make three versions of the same algorithm on a CPU, GPU and MKLDNN, in order to register the new MKLDNN's op kernel?

  • Can I use the new Fc's kernel when I don't have a full implementation of FC's kernels on a CPU and GPU place, but I have only two fake kernels on CPU and GPU place?
    By fake kernel I mean that this kernel is registered in the system but when it is called then the system gets the message that the kernel is not available at this time. I worked out that there are fake objects because the PaddlePaddle platform needs to have all kernels on all platforms.

  • Referring to the point above, Can I integrate single FC's kernel and all fake CPU's and GPU's kernels with current platform, when I have the old version of algorithm (multiplication and sum of matrix)?

  • Also, what can I do to link some of algorithms to one. Should we remove the old version of the algorithm (multiplication and sum) or should we replace this solution with a new algorithm (fully connected on MKLDNN) or is it not possible to touch it, and we need to add a new op kernel to the current solution?

  • Can we have a special kernel only to one specific platform, i.e MKLDNN, without a need to register new kernel for other platform i.e CPU (naive) and GPU?

Thank you.

@dzhwinter
Copy link
Contributor

dzhwinter commented Mar 19, 2018

  • First of all, Am I forced to make three versions of the same algorithm on a CPU, GPU and MKLDNN, in order to register the new MKLDNN's op kernel?

No, You can just add a big operator of FC, and only implement the MKLDNN kernel.

  • Can I use the new Fc's kernel when I don't have a full implementation of FC's kernels on a CPU and GPU place, but I have only two fake kernels on CPU and GPU place?

If you are familiar with Eigen Tensor, implement a CPU/GPU FC kernel is similar to MKLDNN kernel. The user only call the kernel through the Python, we can fall back to small op combines(mul + addition) when there is no MKLDNN available.

  • Also, what can I do to link some of algorithms to one. Should we remove the old version of the algorithm (multiplication and sum) or should we replace this solution with a new algorithm (fully connected on MKLDNN) or is it not possible to touch it, and we need to add a new op kernel to the current solution?

I do not fully understand your point. The multiplication and sum operator is fundamental in algebra, it used everywhere. I think the FC kernel can not replace these two operators, it's just a speed up when you want to do the fully connected operation.

  • Can we have a special kernel only to one specific platform, i.e MKLDNN, without a need to register new kernel for other platform i.e CPU (naive) and GPU?

Yes. See the first comment.

@dzhwinter
Copy link
Contributor

There is one point I want to clarify that why I didn't implement the FC CPU/GPU kernel.
The kernel fusion https://arxiv.org/abs/1305.1183 https://www.tensorflow.org/performance/xla/jit is a big topic, combined small ops to a big one by hand is the old-fashioned way to do it.
We can do some tricks to fuse the batch normalization or fully connected, but I thought that we need the general solution. Because

  1. You can not write kernel fusion for every platform.
    For example, today we have more than 10 kinds of mobile chips. If you choose kernel fusion by hand, take FC kernel and batch normal kernel for example, we have to implement 10 kinds of them, and the multiplication, addition operator(you need these two basis op everywhere, do you?). But if you choose small operators, say the multiplication and addition operation, we only need to port these two.

  2. Kernel fusion by hand will lead to explosion of op combination.
    Take FC kernel for example,
    fc kernel = mul + addition + activation, right?
    Then the general rule is
    New Kernel = Kernel A + Kernel B + ....
    if we can gain some benefits, we choose to combine the kernels in the right and generate a new kernel like the left, right? Then we can imagine, how many combination we will have, that will be a disaster if we go that way.

These two reason force TensorFlow team choose the XLA https://www.tensorflow.org/performance/xla/ way. But AFAIK, it will make debugging like a nightmare, because you can not imagine what happenend in your code.

We will follow the tvm or similar tech later. Currently the multi-node multi-gpu performance hurts, I am focuing on that topic.

@luotao1
Copy link
Contributor

luotao1 commented Mar 19, 2018

You can just add a big operator of FC, and only implement the MKLDNN kernel

I agree with it. You can add fc_mkldnn_op.cc, fc_mkldnn_op.h, and modify the fc method in nn.py.

@mozga-intel
Copy link
Contributor Author

@luotao1, @dzhwinter, Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants