-
Notifications
You must be signed in to change notification settings - Fork 369
Metal Prompt Feeding #403
Comments
I haven't been tracking this, but ggerganov/llama.cpp#2428 suggests that it's still an issue and it's because the Metal kernels don't support batching. You can probably find more on the repo if you look. Not sure if there's anything we can do from our end, outside of implementing it and submitting it upstream 😅 |
I've gotten a change to look into this more, and I don't think it's an issue upstream. Here's my prompt:
And when I run the same prompt through each project:
llm
llm did attach another token to the prompt, but the main issue is the prompt processing time difference. Both were compiled and ran on an M1 Pro with metal enabled and all layers offloaded to GPU |
For reference, here's the perf for llm without metal enabled:
|
Also, this link was put in the code referencing the llama.cpp fallback to accelerate for matrix x matrix compute: https://github.com/ggerganov/llama.cpp/blob/e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a/llama.cpp#L1692 However that branch is not in the newest llama.cpp, so I'm guessing it was worked out. |
@philpax |
Thanks for the investigation! Yes, indeed, that's quite promising. Hoping the upgrade + changing the feeding logic will fix this 🙏 |
I'm trying to run llama on mac using metal, but I noticed on the accelerators doc it states metal cannot be used for feeding in a prompt with more than 1 token. Is this an underlying limitation with ggml, or llm?
I'd love to help enable this, but I'm not sure where to begin.
The text was updated successfully, but these errors were encountered: