-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate multiple outputs #2789
Comments
Nope. @KerfuffleV2 mentions a possible method. Edit: Technically possible, but not currently supported. |
Thanks a lot for your answer @JackJollimore I did find this issue before posting, but I'm not sure I understand @KerfuffleV2's answer: does this server example perform parallel generation ? Thanks, |
Think this is a bit of a different problem. That person wanted to feed multiple different prompts and get all the results while not necessarily actually evaluating anything in parallel. This issue seems like it's about one prompt and producing multiple answers (with different RNG seeds, presumably). It's very, very clunky but if it doesn't need interactive mode you could hack together something using the prompt cache stuff from the the main example to cache a prompt and then run evaluation with it however many times. |
Thanks @KerfuffleV2 for the answer. My use case is the following: I'm currently using ctranslate2/transformers to generate about 20 outputs for the same prompt, after which I have an arbitrage algorithm to select the top 5 outputs depending on a bunch of criterias. Typically with those 2 frameworks, the generation time doesn't change much, whether I generate 1 or 20 outputs, since the decoding logic is based on 2d tensors operations, and multidimensional multinomial sampling at each step. That makes the generation of multiple outputs truly parallel. My tests with llama.cpp on the GPU revealed that the q4 quantizations improve the generation speed quite significantly with respect to other frameworks, for a single output. If I use your hack, the prompt cache will accelerate only the forward pass through the prompt, but then I will still have to spend a time proportional to the number of outputs I want, right? For example currently I can generate 20 outputs at about 15 tokens per second with other framework. I can generate 90 tokens per second for a single output on llama.cpp. If I implement your solution, do you agree that it would be slower than other frameworks when n_outputs > 4 ? Or are you saying that once the prompt is cached I can then run the generations in parallel somehow? |
This part is above my head. I don't really understand how it's possible to follow 20 different paths of evaluation (since everything like the KV state) can diverge at each step without running a full evaluation through the model. I'm 98% sure this doesn't exist in llama.cpp current. If it's possible, it certainly would be a great addition.
Yes, that's correct. My "optimization" only saves having to reevaluate the prompt, so you understood correctly. |
The functionality is possible, but currently it is not implemented in An initial implementation has been demonstrated in one of the examples by @xaedes : But it will take some time to make it part of the main API |
Thanks a lot @ggerganov. Do you think I can adapt the example you linked for my use case, even if it's not part of the API ? |
You'll have to make a similar function in |
I believe this pull added the ability to do what is described in the issue?
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hello,
I can't find the following information: is it possible to generate, given a prompt, multiple hypotheses simultaneously? This functionality is available via the
num_return_sequences
parameter in the model.generate method of the transformers library, or the num_hypotheses argument in ctranslate2.Thanks,
Simon
The text was updated successfully, but these errors were encountered: