MKLDNN: Fully Connected layer. #9197

mozga-intel · 2018-03-19T09:26:50Z

I will use the Fully Connected layer as an example to describe this problem. So, during the implementation of the Fully Connected layer with the use of MKLDNN algorithm, I have encountered a few difficulties. The current version of the fully connected layer of Paddle is splitting into two operations: multiplication and addition. Basically, these operations are used in the current version of Paddle. Subsequently, MKLDNN version of algorithm gives us the opportunity to combine these operations into one. So, If I wanted to kill two birds with one stone I should have made a new kernel to this layer. Thus, I should make a stand-alone version of FC's algorithm. However, when I implemented new kernel, I picked up a few problems:

First of all, Am I forced to make three versions of the same algorithm on a CPU, GPU and MKLDNN, in order to register the new MKLDNN's op kernel?
Can I use the new Fc's kernel when I don't have a full implementation of FC's kernels on a CPU and GPU place, but I have only two fake kernels on CPU and GPU place?
By fake kernel I mean that this kernel is registered in the system but when it is called then the system gets the message that the kernel is not available at this time. I worked out that there are fake objects because the PaddlePaddle platform needs to have all kernels on all platforms.
Referring to the point above, Can I integrate single FC's kernel and all fake CPU's and GPU's kernels with current platform, when I have the old version of algorithm (multiplication and sum of matrix)?
Also, what can I do to link some of algorithms to one. Should we remove the old version of the algorithm (multiplication and sum) or should we replace this solution with a new algorithm (fully connected on MKLDNN) or is it not possible to touch it, and we need to add a new op kernel to the current solution?
Can we have a special kernel only to one specific platform, i.e MKLDNN, without a need to register new kernel for other platform i.e CPU (naive) and GPU?

Thank you.

dzhwinter · 2018-03-19T10:28:07Z

First of all, Am I forced to make three versions of the same algorithm on a CPU, GPU and MKLDNN, in order to register the new MKLDNN's op kernel?

No, You can just add a big operator of FC, and only implement the MKLDNN kernel.

Can I use the new Fc's kernel when I don't have a full implementation of FC's kernels on a CPU and GPU place, but I have only two fake kernels on CPU and GPU place?

If you are familiar with Eigen Tensor, implement a CPU/GPU FC kernel is similar to MKLDNN kernel. The user only call the kernel through the Python, we can fall back to small op combines(mul + addition) when there is no MKLDNN available.

Also, what can I do to link some of algorithms to one. Should we remove the old version of the algorithm (multiplication and sum) or should we replace this solution with a new algorithm (fully connected on MKLDNN) or is it not possible to touch it, and we need to add a new op kernel to the current solution?

I do not fully understand your point. The multiplication and sum operator is fundamental in algebra, it used everywhere. I think the FC kernel can not replace these two operators, it's just a speed up when you want to do the fully connected operation.

Can we have a special kernel only to one specific platform, i.e MKLDNN, without a need to register new kernel for other platform i.e CPU (naive) and GPU?

Yes. See the first comment.

dzhwinter · 2018-03-19T10:58:37Z

There is one point I want to clarify that why I didn't implement the FC CPU/GPU kernel.
The kernel fusion https://arxiv.org/abs/1305.1183 https://www.tensorflow.org/performance/xla/jit is a big topic, combined small ops to a big one by hand is the old-fashioned way to do it.
We can do some tricks to fuse the batch normalization or fully connected, but I thought that we need the general solution. Because

You can not write kernel fusion for every platform.
For example, today we have more than 10 kinds of mobile chips. If you choose kernel fusion by hand, take FC kernel and batch normal kernel for example, we have to implement 10 kinds of them, and the multiplication, addition operator(you need these two basis op everywhere, do you?). But if you choose small operators, say the multiplication and addition operation, we only need to port these two.
Kernel fusion by hand will lead to explosion of op combination.
Take FC kernel for example,
fc kernel = mul + addition + activation, right?
Then the general rule is
New Kernel = Kernel A + Kernel B + ....
if we can gain some benefits, we choose to combine the kernels in the right and generate a new kernel like the left, right? Then we can imagine, how many combination we will have, that will be a disaster if we go that way.

These two reason force TensorFlow team choose the XLA https://www.tensorflow.org/performance/xla/ way. But AFAIK, it will make debugging like a nightmare, because you can not imagine what happenend in your code.

We will follow the tvm or similar tech later. Currently the multi-node multi-gpu performance hurts, I am focuing on that topic.

luotao1 · 2018-03-19T11:10:31Z

You can just add a big operator of FC, and only implement the MKLDNN kernel

I agree with it. You can add fc_mkldnn_op.cc, fc_mkldnn_op.h, and modify the fc method in nn.py.

mozga-intel · 2018-03-20T09:19:28Z

@luotao1, @dzhwinter, Thank you very much.

mozga-intel added the Intel label Mar 19, 2018

mozga-intel closed this as completed Mar 20, 2018

luotao1 mentioned this issue Mar 29, 2018

Implementation of MKLDNN FC #9385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKLDNN: Fully Connected layer. #9197

MKLDNN: Fully Connected layer. #9197

mozga-intel commented Mar 19, 2018

dzhwinter commented Mar 19, 2018 •

edited

Loading

dzhwinter commented Mar 19, 2018

luotao1 commented Mar 19, 2018 •

edited

Loading

mozga-intel commented Mar 20, 2018

MKLDNN: Fully Connected layer. #9197

MKLDNN: Fully Connected layer. #9197

Comments

mozga-intel commented Mar 19, 2018

dzhwinter commented Mar 19, 2018 • edited Loading

dzhwinter commented Mar 19, 2018

luotao1 commented Mar 19, 2018 • edited Loading

mozga-intel commented Mar 20, 2018

dzhwinter commented Mar 19, 2018 •

edited

Loading

luotao1 commented Mar 19, 2018 •

edited

Loading