Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate multiple outputs #2789

Closed
SimonBenhamou opened this issue Aug 25, 2023 · 10 comments
Closed

Generate multiple outputs #2789

SimonBenhamou opened this issue Aug 25, 2023 · 10 comments
Labels

Comments

@SimonBenhamou
Copy link

Hello,

I can't find the following information: is it possible to generate, given a prompt, multiple hypotheses simultaneously? This functionality is available via the num_return_sequences parameter in the model.generate method of the transformers library, or the num_hypotheses argument in ctranslate2.

Thanks,
Simon

@ghost
Copy link

ghost commented Aug 25, 2023

Nope. @KerfuffleV2 mentions a possible method.

Edit: Technically possible, but not currently supported.

@SimonBenhamou
Copy link
Author

Thanks a lot for your answer @JackJollimore

I did find this issue before posting, but I'm not sure I understand @KerfuffleV2's answer: does this server example perform parallel generation ?

Thanks,
Simon

@KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 mentions a #1623 (comment).

Think this is a bit of a different problem. That person wanted to feed multiple different prompts and get all the results while not necessarily actually evaluating anything in parallel. This issue seems like it's about one prompt and producing multiple answers (with different RNG seeds, presumably).

It's very, very clunky but if it doesn't need interactive mode you could hack together something using the prompt cache stuff from the the main example to cache a prompt and then run evaluation with it however many times.

@SimonBenhamou
Copy link
Author

SimonBenhamou commented Aug 25, 2023

Thanks @KerfuffleV2 for the answer.

My use case is the following: I'm currently using ctranslate2/transformers to generate about 20 outputs for the same prompt, after which I have an arbitrage algorithm to select the top 5 outputs depending on a bunch of criterias. Typically with those 2 frameworks, the generation time doesn't change much, whether I generate 1 or 20 outputs, since the decoding logic is based on 2d tensors operations, and multidimensional multinomial sampling at each step. That makes the generation of multiple outputs truly parallel.

My tests with llama.cpp on the GPU revealed that the q4 quantizations improve the generation speed quite significantly with respect to other frameworks, for a single output.

If I use your hack, the prompt cache will accelerate only the forward pass through the prompt, but then I will still have to spend a time proportional to the number of outputs I want, right? For example currently I can generate 20 outputs at about 15 tokens per second with other framework. I can generate 90 tokens per second for a single output on llama.cpp. If I implement your solution, do you agree that it would be slower than other frameworks when n_outputs > 4 ?

Or are you saying that once the prompt is cached I can then run the generations in parallel somehow?

@KerfuffleV2
Copy link
Collaborator

the generation time doesn't change much, whether I generate 1 or 20 outputs, since the decoding logic is based on 2d tensors operations, and multidimensional multinomial sampling at each step.

This part is above my head. I don't really understand how it's possible to follow 20 different paths of evaluation (since everything like the KV state) can diverge at each step without running a full evaluation through the model.

I'm 98% sure this doesn't exist in llama.cpp current. If it's possible, it certainly would be a great addition.

I implement your solution, do you agree that it would be slower than other frameworks when n_outputs > 4 ?

Yes, that's correct. My "optimization" only saves having to reevaluate the prompt, so you understood correctly.

@ggerganov
Copy link
Member

The functionality is possible, but currently it is not implemented in llama.cpp. It's called batched inference / decoding.

An initial implementation has been demonstrated in one of the examples by @xaedes :
https://github.com/ggerganov/llama.cpp/blob/eff86d4f1334c08300d3cb1110dbac3c8e26286c/examples/baby-llama/baby-llama.cpp#L785-L794

But it will take some time to make it part of the main API

@SimonBenhamou
Copy link
Author

Thanks a lot @ggerganov. Do you think I can adapt the example you linked for my use case, even if it's not part of the API ?

@ggerganov
Copy link
Member

You'll have to make a similar function in llama.cpp and prepare the KV cache correctly to support parallel runs.
I haven't looked in all the details, but I feel it shouldn't be something too complicated.

@grilikin
Copy link

I believe this pull added the ability to do what is described in the issue?
#3228

# the prompt is "Hello my name is" and the number of sequences that will be generated is 8
./batched ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is" 8

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants