-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flash Attention #844
Comments
No, it's not implemented yet. I will merge it for the next version |
Appreciated, your work is something amazing! |
Truly a joyous occasion! This looks very promising! |
Hi, can see if this works fine for you on the latest version? |
I checked with my old MX150 and now it works. The llama.cpp upgrade to CUDA without tensor cores must have solved it. The prompt processing speed is higher now (around 2x faster), but the generation a bit slower (around 20%). But this is a good tradeoff in the end. |
It seems to work fine, holy hell its quick too. Thank you! |
So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? ggml-org#7188 I assume this in not implemented for this yet, any chance?
The text was updated successfully, but these errors were encountered: