TFLite: reduced redundant calculation in uint8/float conv.h #42477

danielyou0230 · 2020-08-18T21:35:59Z

This PR contains almost identical changes to TF Lite convolution kernel as in #41947 but different versions (uint8 and float.) The testing platform is identical as well, note that this platform has no FPU available so floating point operations are soft floating point operations.

The following is the summary of improvements on TFLM examples in the repo:

[uint8] Sped up inference by 12.50% on person_detection (uint8 model)
[float] Sped up inference by 1.93% on magic_wand (float model)

bjacob · 2020-08-19T19:15:19Z

the diff is fine (so long as tests pass).
This reference code was written as explicitly reference-only code where speed was not a concern at all. It was not intended to be ever run in any application. If I understand correctly the above comment, TFLite-for-Microcontrollers breaks this assumption by actually using this as the user code path?

In that case, you have much more low-hanging fruit to pick here. The Offset() calculations in the inner loop are a good place to start: it would be more efficient to expand these Offset() calls as local code, then distribute the resulting code as much as possible out of the inner loop, typically performing only simple pointer increments in the inner loops.

However, doing so would start to break down the design of this code as easy-to-read reference code.

Bottom line: there is a contradiction between, on the one hand, TFLM using reference code, and on the other hand, TFLM users caring about 10% speed differences. It seems reasonable to care about 10% speed differences, so I suppose that the natural course of action here would be for TFLM to start having its own kernels rather than use reference code. Going in that direction, however, I would suggest checking with @Maratyszcza if XNNPACK might readily provide better optimized code suitable for the portability requirements of TFLM (likely meaning - no NEON code).

advaitjain · 2020-08-19T22:32:27Z

For TFLM as well, the goal is for reference code to be readable to serve as a way to implement and debug optimized kernels. As such, reference kernels are not recommended for applications where speed is important.

@danielyou0230 can you please give more context on why a speed up of 10% with the reference kernels is important for you?

I'll defer to @bjacob on the readability impact of this particular PR, but would also recommend against going down the path of optimizing the reference kernels.

danielyou0230 · 2020-08-20T22:35:23Z

Thanks @bjacob and @advaitjain for letting me know what the reference kernels are for, I really appreciate your time and effort reviewing this PR.

@advaitjain I'm currently a SWE intern at Google and my project is to make Tensorflow faster on microcontrollers, e.g. Arduino or soft-CPU on FPGAs like VexRISCV (no Vector extension and SIMD, which is what I mainly test on,) where most of them have no hardware-optimized instructions available, e.g. NEON, SSE, AVX, RISC-V Vector extension. Having that in mind, I turned to the kernels used by TFLM and saw there's something that can be changed to make the inference faster on those microcontrollers. @bjacob as what you understand, TFLM is using those reference kernels from TF Lite right now, that's why I'm optimizing them.

I understand that TF Lite needs to have those references kernels (intended not optimized for speed) for people who want to write their optimized code for their particular platform. At the same time, TFLM needs to have their own kernels in the future, they will probably begin with the reference kernel that TF Lite have right now. But the changes I made is hardware independent, so any platform with no vector instruction support can still benefit from the change. Hopefully, when TFLM releases their own kernels, people can have these changes as a reference of how to improve the performance on microcontrollers without additional hardware instructions.

bjacob · 2020-08-21T18:26:46Z

I shouldn't be a reviewer here - I'm not actively working on TFLite.

Regarding this particular diff, I don't think it significantly changes readability, so as said above I'm OK with it. That said, the code idiom here is shared with several other reference code kernels, so changing it here will create a discrepancy.

Generally I agree with Advait that we shouldn't do anything to speed up reference code. But OK to make an exception if some production code (TFLM?) unwittingly started relying on reference code and now needs a quick change.

I would like to propose though to prioritize the migration of TFLM away from reference code. That should be as simple as forking the current reference code into a TFLM code directory, stripping any 'reference' in the identifier or namespace names. Then, optimizations like this could be done easily without such discussions, and as I said, there is much more still that could be done to speed up this code while still remaining self-contained portable standard C++ code, starting with dropping the Offset() calls.

advaitjain · 2020-08-21T20:18:50Z

Sounds good. I have approved this PR. We'll figure out the path for any future optimizations in a separate conversation.

TFLite: reduced redundant calculation in uint8/float conv.h

c06408c

google-ml-butler bot added the size:M CL Change Size: Medium label Aug 18, 2020

googlebot added the cla: yes label Aug 18, 2020

gbaned self-assigned this Aug 19, 2020

gbaned added the comp:lite TF Lite related issues label Aug 19, 2020

gbaned requested a review from lu-wang-g August 19, 2020 03:02

lu-wang-g requested review from advaitjain and removed request for lu-wang-g August 19, 2020 22:14

advaitjain added the comp:micro Related to TensorFlow Lite Microcontrollers label Aug 19, 2020

advaitjain approved these changes Aug 21, 2020

View reviewed changes

google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 21, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 21, 2020

tensorflow-copybara merged commit da95160 into tensorflow:master Aug 24, 2020

advaitjain mentioned this pull request Aug 26, 2020

TFLite: Improved memory access pattern in integer_ops depthwise_conv.h #42285

Closed

advaitjain mentioned this pull request Dec 7, 2020

TFLite Conv.h inner loop optimizations #45227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFLite: reduced redundant calculation in uint8/float conv.h #42477

TFLite: reduced redundant calculation in uint8/float conv.h #42477

danielyou0230 commented Aug 18, 2020

bjacob commented Aug 19, 2020 •

edited

Loading

advaitjain commented Aug 19, 2020

danielyou0230 commented Aug 20, 2020

bjacob commented Aug 21, 2020

advaitjain commented Aug 21, 2020

TFLite: reduced redundant calculation in uint8/float conv.h #42477

TFLite: reduced redundant calculation in uint8/float conv.h #42477

Conversation

danielyou0230 commented Aug 18, 2020

bjacob commented Aug 19, 2020 • edited Loading

advaitjain commented Aug 19, 2020

danielyou0230 commented Aug 20, 2020

bjacob commented Aug 21, 2020

advaitjain commented Aug 21, 2020

bjacob commented Aug 19, 2020 •

edited

Loading