Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second matmul for fully custom attention #227

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ngc92
Copy link
Contributor

@ngc92 ngc92 commented Apr 22, 2024

So far, just in the /dev files, because for the main script we also need to touch backward.
For some reason, I see considerable speed-up in the benchmarks here, but in my attempts to use this in the main model, this hasn't really translated.

@FeSens
Copy link
Contributor

FeSens commented Apr 24, 2024

What is the speed of the matmul_tri compared with cublas?

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 24, 2024

On my A4000, cublas (no Tensorcores) is getting reported at 52% of the FP32 capacity, whereas this kernel gets 33%. So it is overall slower, but as it calculates only half, it still wins out. That changes with tensorcores, though.

I think its the writing back of results that still is quite bad here.

@ngc92
Copy link
Contributor Author

ngc92 commented Apr 27, 2024

Some more optimizations, and now its slightly faster than the tensorcore counterparts. With getting rid of the permutes, this yields a substantial net speedup for the attention kernel.
Unfortunately, we cannot yet use this in the main model, because the backward still assumes the permutations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants