Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Metal Prompt Feeding #403

Open
jafioti opened this issue Aug 12, 2023 · 6 comments
Open

Metal Prompt Feeding #403

jafioti opened this issue Aug 12, 2023 · 6 comments
Labels
issue:bug Something isn't working topic:metal https://developer.apple.com/metal/ support

Comments

@jafioti
Copy link
Contributor

jafioti commented Aug 12, 2023

I'm trying to run llama on mac using metal, but I noticed on the accelerators doc it states metal cannot be used for feeding in a prompt with more than 1 token. Is this an underlying limitation with ggml, or llm?

I'd love to help enable this, but I'm not sure where to begin.

@philpax
Copy link
Collaborator

philpax commented Aug 12, 2023

I haven't been tracking this, but ggerganov/llama.cpp#2428 suggests that it's still an issue and it's because the Metal kernels don't support batching. You can probably find more on the repo if you look.

Not sure if there's anything we can do from our end, outside of implementing it and submitting it upstream 😅

@philpax philpax added issue:bug Something isn't working topic:metal https://developer.apple.com/metal/ support labels Aug 12, 2023
@jafioti
Copy link
Contributor Author

jafioti commented Aug 29, 2023

I've gotten a change to look into this more, and I don't think it's an issue upstream. Here's my prompt:

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    
    mid = len(arr) // 2
    left = arr[:mid]
    right = arr[mid:]
    
    left = merge_sort(left)
    right = merge_sort(right)
    
    return merge(left, right)


def merge(left, right):
    if not left:
        return right
    if not right:
        return left
    
    result = []
    
    i = j = 0
    
    while len(left) > i and len(right) > j:
        if left[i] < right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
            
    result.extend(left[i:])
    result.extend(right[j:])
    
    return result

print(merge_sort([56,32,5,2,43]))

And when I run the same prompt through each project:
llama.cpp

llama_print_timings:        load time =   298.62 ms
llama_print_timings:      sample time =    90.26 ms /   128 runs   (    0.71 ms per token,  1418.17 tokens per second)
llama_print_timings: prompt eval time =  1676.02 ms /   250 tokens (    6.70 ms per token,   149.16 tokens per second)
llama_print_timings:        eval time =  3909.66 ms /   127 runs   (   30.78 ms per token,    32.48 tokens per second)
llama_print_timings:       total time =  5686.13 ms

llm

feed_prompt_duration: 6956ms
prompt_tokens: 251
predict_duration: 7818ms
predict_tokens: 269
per_token_duration: 29.063ms

llm did attach another token to the prompt, but the main issue is the prompt processing time difference. Both were compiled and ran on an M1 Pro with metal enabled and all layers offloaded to GPU

@jafioti
Copy link
Contributor Author

jafioti commented Aug 29, 2023

For reference, here's the perf for llm without metal enabled:

feed_prompt_duration: 8508ms
prompt_tokens: 251
predict_duration: 8565ms
predict_tokens: 252
per_token_duration: 33.988ms

@jafioti
Copy link
Contributor Author

jafioti commented Aug 30, 2023

Also, this link was put in the code referencing the llama.cpp fallback to accelerate for matrix x matrix compute: https://github.com/ggerganov/llama.cpp/blob/e1886cf4fe0d0f31661dda52a4a9f34bd9b9009a/llama.cpp#L1692

However that branch is not in the newest llama.cpp, so I'm guessing it was worked out.

@jafioti
Copy link
Contributor Author

jafioti commented Aug 30, 2023

@philpax
For reference, this commit (ggerganov/llama.cpp@bf83bff) fixed the matrix matrix mul, which is about 2 weeks newer than the currently used llama.cpp, so we'd need to upgrade to a llama cpp version past it. I'm sure you already are doing an upgrade with the gguf stuff which should be to a sufficiently new version as to support the mul.

@philpax
Copy link
Collaborator

philpax commented Aug 30, 2023

Thanks for the investigation! Yes, indeed, that's quite promising. Hoping the upgrade + changing the feeding logic will fix this 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:bug Something isn't working topic:metal https://developer.apple.com/metal/ support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants