-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support half precision sigmoid activation #378
Conversation
@hwu36 Thanks for the suggestion. I've add an alternative, vectorized implementation using But the faster one has non trivial accuracy difference with the TVM result:
So I added an |
@shangz-ai, is it you that told me that you need to use fp32 to compute sigmoid for the accuracy? |
@hwu36 I'm not sure if we talked about this recently. But yes, in BERT, we usually use fp32 in softmax for better accuracy. |
To be clear, this PR is about fp16 sigmoid, not softmax. I think doing sigmoid in fp16 is generally ok. I saw the accuracy issue only when using |
@hwu36 Can we get this in? I have an upcoming PR to TVM that could make use of this (sigmoid activation fusion with fp16 accum) |
I will try to get it in this week. I need to go through our internal test process and I may need to make some changes to your PR directly. |
I am working on this one now. Most code is rewritten. You can wait a little bit before I commit. |
I just committed my change. It basically move I confess that I haven't done any correctness or performance testing of my code. |
There seems some accuracy issues. On a simple test case like
And on an end-to-end imagenet model, the output is very different from the ones using either of my previous implementations (scalar and vector). |
Sorry, on the
|
There must be something wrong and we need to bottom it out. We need to dump the value after every math step to check. Can you tell me which input to sigmoid can generate the most ridiculous output? To debug, we can include this file and call Do you want to give it a try during your daytime or I can try it tomorrow? |
It seems The performance on efficientnet v2 model improved from 8.26 to 8.05 msec. |
Usually we don't hit the case when the size of |
This is an imagnet model so there are lot of
It depends on the number of channels, 8, 4, 2 or 1 |
Then, you are not supposed to hit the case where Can you dump all the number of |
So that code path is indeed used once. |
Correct. What is the output channel |
|
The regular conv that cutlass has now is functional for the small input channel such as C = 3, but not great in the performance.
If the output channel is 24, you can set |
Do we still have accuracy issue? Is the PR ready to be merged? |
No, the accuracy issue has been resolved and it is ready to go!! Thanks |
Do you have new perf number after vectorized sigmoid based on vectorized exp? Do you need to change |
The performance on efficientnet v2 model improved from 8.26 to 8.05 msec using vectorized exp. No need to change |
Currently, trying to use sigmoid activation with
half_t
results in a compile error: