Metal Prompt Feeding #403

jafioti · 2023-08-12T23:29:02Z

I'm trying to run llama on mac using metal, but I noticed on the accelerators doc it states metal cannot be used for feeding in a prompt with more than 1 token. Is this an underlying limitation with ggml, or llm?

I'd love to help enable this, but I'm not sure where to begin.

philpax · 2023-08-12T23:42:21Z

I haven't been tracking this, but ggerganov/llama.cpp#2428 suggests that it's still an issue and it's because the Metal kernels don't support batching. You can probably find more on the repo if you look.

Not sure if there's anything we can do from our end, outside of implementing it and submitting it upstream 😅

jafioti · 2023-08-29T23:38:53Z

I've gotten a change to look into this more, and I don't think it's an issue upstream. Here's my prompt:

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    
    mid = len(arr) // 2
    left = arr[:mid]
    right = arr[mid:]
    
    left = merge_sort(left)
    right = merge_sort(right)
    
    return merge(left, right)


def merge(left, right):
    if not left:
        return right
    if not right:
        return left
    
    result = []
    
    i = j = 0
    
    while len(left) > i and len(right) > j:
        if left[i] < right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
            
    result.extend(left[i:])
    result.extend(right[j:])
    
    return result

print(merge_sort([56,32,5,2,43]))

And when I run the same prompt through each project:
llama.cpp

llama_print_timings:        load time =   298.62 ms
llama_print_timings:      sample time =    90.26 ms /   128 runs   (    0.71 ms per token,  1418.17 tokens per second)
llama_print_timings: prompt eval time =  1676.02 ms /   250 tokens (    6.70 ms per token,   149.16 tokens per second)
llama_print_timings:        eval time =  3909.66 ms /   127 runs   (   30.78 ms per token,    32.48 tokens per second)
llama_print_timings:       total time =  5686.13 ms

llm

feed_prompt_duration: 6956ms
prompt_tokens: 251
predict_duration: 7818ms
predict_tokens: 269
per_token_duration: 29.063ms

llm did attach another token to the prompt, but the main issue is the prompt processing time difference. Both were compiled and ran on an M1 Pro with metal enabled and all layers offloaded to GPU

jafioti · 2023-08-29T23:52:53Z

For reference, here's the perf for llm without metal enabled:

feed_prompt_duration: 8508ms
prompt_tokens: 251
predict_duration: 8565ms
predict_tokens: 252
per_token_duration: 33.988ms

jafioti · 2023-08-30T00:04:17Z

Also, this link was put in the code referencing the llama.cpp fallback to accelerate for matrix x matrix compute: https://github.com/ggerganov/llama.cpp/blob/e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a/llama.cpp#L1692

However that branch is not in the newest llama.cpp, so I'm guessing it was worked out.

jafioti · 2023-08-30T00:28:12Z

@philpax
For reference, this commit (ggerganov/llama.cpp@bf83bff) fixed the matrix matrix mul, which is about 2 weeks newer than the currently used llama.cpp, so we'd need to upgrade to a llama cpp version past it. I'm sure you already are doing an upgrade with the gguf stuff which should be to a sufficiently new version as to support the mul.

philpax · 2023-08-30T07:14:10Z

Thanks for the investigation! Yes, indeed, that's quite promising. Hoping the upgrade + changing the feeding logic will fix this 🙏

philpax added issue:bug Something isn't working topic:metal https://developer.apple.com/metal/ support labels Aug 12, 2023

philpax mentioned this issue Nov 4, 2023

Why is the feed_prompt process so slow? #439

Open

philpax mentioned this issue Nov 12, 2023

Develop #442

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal Prompt Feeding #403

Metal Prompt Feeding #403

jafioti commented Aug 12, 2023

philpax commented Aug 12, 2023

jafioti commented Aug 29, 2023

jafioti commented Aug 29, 2023

jafioti commented Aug 30, 2023

jafioti commented Aug 30, 2023

philpax commented Aug 30, 2023

Metal Prompt Feeding #403

Metal Prompt Feeding #403

Comments

jafioti commented Aug 12, 2023

philpax commented Aug 12, 2023

jafioti commented Aug 29, 2023

jafioti commented Aug 29, 2023

jafioti commented Aug 30, 2023

jafioti commented Aug 30, 2023

philpax commented Aug 30, 2023