Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFLite: reduced redundant calculation in uint8/float conv.h #42477

Merged

Conversation

danielyou0230
Copy link
Contributor

This PR contains almost identical changes to TF Lite convolution kernel as in #41947 but different versions (uint8 and float.) The testing platform is identical as well, note that this platform has no FPU available so floating point operations are soft floating point operations.

The following is the summary of improvements on TFLM examples in the repo:

  • [uint8] Sped up inference by 12.50% on person_detection (uint8 model)
  • [float] Sped up inference by 1.93% on magic_wand (float model)

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Aug 18, 2020
@gbaned gbaned self-assigned this Aug 19, 2020
@gbaned gbaned added the comp:lite TF Lite related issues label Aug 19, 2020
@gbaned gbaned requested a review from lu-wang-g August 19, 2020 03:02
@bjacob
Copy link
Contributor

bjacob commented Aug 19, 2020

the diff is fine (so long as tests pass).
This reference code was written as explicitly reference-only code where speed was not a concern at all. It was not intended to be ever run in any application. If I understand correctly the above comment, TFLite-for-Microcontrollers breaks this assumption by actually using this as the user code path?

In that case, you have much more low-hanging fruit to pick here. The Offset() calculations in the inner loop are a good place to start: it would be more efficient to expand these Offset() calls as local code, then distribute the resulting code as much as possible out of the inner loop, typically performing only simple pointer increments in the inner loops.

However, doing so would start to break down the design of this code as easy-to-read reference code.

Bottom line: there is a contradiction between, on the one hand, TFLM using reference code, and on the other hand, TFLM users caring about 10% speed differences. It seems reasonable to care about 10% speed differences, so I suppose that the natural course of action here would be for TFLM to start having its own kernels rather than use reference code. Going in that direction, however, I would suggest checking with @Maratyszcza if XNNPACK might readily provide better optimized code suitable for the portability requirements of TFLM (likely meaning - no NEON code).

@lu-wang-g lu-wang-g requested review from advaitjain and removed request for lu-wang-g August 19, 2020 22:14
@advaitjain
Copy link
Member

For TFLM as well, the goal is for reference code to be readable to serve as a way to implement and debug optimized kernels. As such, reference kernels are not recommended for applications where speed is important.

@danielyou0230 can you please give more context on why a speed up of 10% with the reference kernels is important for you?

I'll defer to @bjacob on the readability impact of this particular PR, but would also recommend against going down the path of optimizing the reference kernels.

@advaitjain advaitjain added the comp:micro Related to TensorFlow Lite Microcontrollers label Aug 19, 2020
@danielyou0230
Copy link
Contributor Author

Thanks @bjacob and @advaitjain for letting me know what the reference kernels are for, I really appreciate your time and effort reviewing this PR.

@advaitjain I'm currently a SWE intern at Google and my project is to make Tensorflow faster on microcontrollers, e.g. Arduino or soft-CPU on FPGAs like VexRISCV (no Vector extension and SIMD, which is what I mainly test on,) where most of them have no hardware-optimized instructions available, e.g. NEON, SSE, AVX, RISC-V Vector extension. Having that in mind, I turned to the kernels used by TFLM and saw there's something that can be changed to make the inference faster on those microcontrollers. @bjacob as what you understand, TFLM is using those reference kernels from TF Lite right now, that's why I'm optimizing them.

I understand that TF Lite needs to have those references kernels (intended not optimized for speed) for people who want to write their optimized code for their particular platform. At the same time, TFLM needs to have their own kernels in the future, they will probably begin with the reference kernel that TF Lite have right now. But the changes I made is hardware independent, so any platform with no vector instruction support can still benefit from the change. Hopefully, when TFLM releases their own kernels, people can have these changes as a reference of how to improve the performance on microcontrollers without additional hardware instructions.

@bjacob
Copy link
Contributor

bjacob commented Aug 21, 2020

I shouldn't be a reviewer here - I'm not actively working on TFLite.

Regarding this particular diff, I don't think it significantly changes readability, so as said above I'm OK with it. That said, the code idiom here is shared with several other reference code kernels, so changing it here will create a discrepancy.

Generally I agree with Advait that we shouldn't do anything to speed up reference code. But OK to make an exception if some production code (TFLM?) unwittingly started relying on reference code and now needs a quick change.

I would like to propose though to prioritize the migration of TFLM away from reference code. That should be as simple as forking the current reference code into a TFLM code directory, stripping any 'reference' in the identifier or namespace names. Then, optimizations like this could be done easily without such discussions, and as I said, there is much more still that could be done to speed up this code while still remaining self-contained portable standard C++ code, starting with dropping the Offset() calls.

@google-ml-butler google-ml-butler bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 21, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 21, 2020
@advaitjain
Copy link
Member

Sounds good. I have approved this PR. We'll figure out the path for any future optimizations in a separate conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:lite TF Lite related issues comp:micro Related to TensorFlow Lite Microcontrollers ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants