Flash Attention #844

ss4elby · 2024-05-12T18:24:08Z

So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? ggml-org#7188 I assume this in not implemented for this yet, any chance?

LostRuins · 2024-05-13T06:33:33Z

No, it's not implemented yet. I will merge it for the next version

ss4elby · 2024-05-13T20:22:12Z

Appreciated, your work is something amazing!

Spacellary · 2024-05-14T12:19:59Z

Truly a joyous occasion! This looks very promising!

LostRuins · 2024-05-24T10:37:52Z

Hi, can see if this works fine for you on the latest version?

gustrd · 2024-05-24T14:59:27Z

I checked with my old MX150 and now it works.

The llama.cpp upgrade to CUDA without tensor cores must have solved it. The prompt processing speed is higher now (around 2x faster), but the generation a bit slower (around 20%). But this is a good tradeoff in the end.

ss4elby · 2024-05-24T22:45:57Z

It seems to work fine, holy hell its quick too. Thank you!

ss4elby closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention #844

Flash Attention #844

ss4elby commented May 12, 2024

LostRuins commented May 13, 2024

ss4elby commented May 13, 2024

Spacellary commented May 14, 2024 •

edited

Loading

LostRuins commented May 24, 2024

gustrd commented May 24, 2024

ss4elby commented May 24, 2024

Flash Attention #844

Flash Attention #844

Comments

ss4elby commented May 12, 2024

LostRuins commented May 13, 2024

ss4elby commented May 13, 2024

Spacellary commented May 14, 2024 • edited Loading

LostRuins commented May 24, 2024

gustrd commented May 24, 2024

ss4elby commented May 24, 2024

Spacellary commented May 14, 2024 •

edited

Loading