-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second matmul for fully custom attention #227
base: master
Are you sure you want to change the base?
Conversation
What is the speed of the matmul_tri compared with cublas? |
On my A4000, cublas (no Tensorcores) is getting reported at 52% of the FP32 capacity, whereas this kernel gets 33%. So it is overall slower, but as it calculates only half, it still wins out. That changes with tensorcores, though. I think its the writing back of results that still is quite bad here. |
Some more optimizations, and now its slightly faster than the tensorcore counterparts. With getting rid of the permutes, this yields a substantial net speedup for the attention kernel. |
So far, just in the /dev files, because for the main script we also need to touch backward.
For some reason, I see considerable speed-up in the benchmarks here, but in my attempts to use this in the main model, this hasn't really translated.