Generate multiple outputs #2789

SimonBenhamou · 2023-08-25T15:34:12Z

Hello,

I can't find the following information: is it possible to generate, given a prompt, multiple hypotheses simultaneously? This functionality is available via the num_return_sequences parameter in the model.generate method of the transformers library, or the num_hypotheses argument in ctranslate2.

Thanks,
Simon

The text was updated successfully, but these errors were encountered:

ghost · 2023-08-25T15:51:41Z

Nope. @KerfuffleV2 mentions a possible method.

Edit: Technically possible, but not currently supported.

SimonBenhamou · 2023-08-25T16:15:50Z

Thanks a lot for your answer @JackJollimore

I did find this issue before posting, but I'm not sure I understand @KerfuffleV2's answer: does this server example perform parallel generation ?

Thanks,
Simon

KerfuffleV2 · 2023-08-25T20:03:35Z

@KerfuffleV2 mentions a #1623 (comment).

Think this is a bit of a different problem. That person wanted to feed multiple different prompts and get all the results while not necessarily actually evaluating anything in parallel. This issue seems like it's about one prompt and producing multiple answers (with different RNG seeds, presumably).

It's very, very clunky but if it doesn't need interactive mode you could hack together something using the prompt cache stuff from the the main example to cache a prompt and then run evaluation with it however many times.

SimonBenhamou · 2023-08-25T22:45:14Z

Thanks @KerfuffleV2 for the answer.

My use case is the following: I'm currently using ctranslate2/transformers to generate about 20 outputs for the same prompt, after which I have an arbitrage algorithm to select the top 5 outputs depending on a bunch of criterias. Typically with those 2 frameworks, the generation time doesn't change much, whether I generate 1 or 20 outputs, since the decoding logic is based on 2d tensors operations, and multidimensional multinomial sampling at each step. That makes the generation of multiple outputs truly parallel.

My tests with llama.cpp on the GPU revealed that the q4 quantizations improve the generation speed quite significantly with respect to other frameworks, for a single output.

If I use your hack, the prompt cache will accelerate only the forward pass through the prompt, but then I will still have to spend a time proportional to the number of outputs I want, right? For example currently I can generate 20 outputs at about 15 tokens per second with other framework. I can generate 90 tokens per second for a single output on llama.cpp. If I implement your solution, do you agree that it would be slower than other frameworks when n_outputs > 4 ?

Or are you saying that once the prompt is cached I can then run the generations in parallel somehow?

KerfuffleV2 · 2023-08-25T22:54:25Z

the generation time doesn't change much, whether I generate 1 or 20 outputs, since the decoding logic is based on 2d tensors operations, and multidimensional multinomial sampling at each step.

This part is above my head. I don't really understand how it's possible to follow 20 different paths of evaluation (since everything like the KV state) can diverge at each step without running a full evaluation through the model.

I'm 98% sure this doesn't exist in llama.cpp current. If it's possible, it certainly would be a great addition.

I implement your solution, do you agree that it would be slower than other frameworks when n_outputs > 4 ?

Yes, that's correct. My "optimization" only saves having to reevaluate the prompt, so you understood correctly.

ggerganov · 2023-08-26T13:18:24Z

The functionality is possible, but currently it is not implemented in llama.cpp. It's called batched inference / decoding.

An initial implementation has been demonstrated in one of the examples by @xaedes :
https://github.com/ggerganov/llama.cpp/blob/eff86d4f1334c08300d3cb1110dbac3c8e26286c/examples/baby-llama/baby-llama.cpp#L785-L794

But it will take some time to make it part of the main API

SimonBenhamou · 2023-08-28T08:31:25Z

Thanks a lot @ggerganov. Do you think I can adapt the example you linked for my use case, even if it's not part of the API ?

ggerganov · 2023-08-28T08:38:31Z

You'll have to make a similar function in llama.cpp and prepare the KV cache correctly to support parallel runs.
I haven't looked in all the details, but I feel it shouldn't be something too complicated.

grilikin · 2024-01-19T19:45:49Z

I believe this pull added the ability to do what is described in the issue?
#3228

# the prompt is "Hello my name is" and the number of sequences that will be generated is 8
./batched ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is" 8

github-actions · 2024-04-09T01:06:44Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov mentioned this issue Aug 26, 2023

llama : add support for batched inference #2813

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate multiple outputs #2789

Generate multiple outputs #2789

SimonBenhamou commented Aug 25, 2023

ghost commented Aug 25, 2023 •

edited by ghost

Loading

SimonBenhamou commented Aug 25, 2023

KerfuffleV2 commented Aug 25, 2023

SimonBenhamou commented Aug 25, 2023 •

edited

Loading

KerfuffleV2 commented Aug 25, 2023

ggerganov commented Aug 26, 2023

SimonBenhamou commented Aug 28, 2023

ggerganov commented Aug 28, 2023

grilikin commented Jan 19, 2024

github-actions bot commented Apr 9, 2024

Generate multiple outputs #2789

Generate multiple outputs #2789

Comments

SimonBenhamou commented Aug 25, 2023

ghost commented Aug 25, 2023 • edited by ghost Loading

SimonBenhamou commented Aug 25, 2023

KerfuffleV2 commented Aug 25, 2023

SimonBenhamou commented Aug 25, 2023 • edited Loading

KerfuffleV2 commented Aug 25, 2023

ggerganov commented Aug 26, 2023

SimonBenhamou commented Aug 28, 2023

ggerganov commented Aug 28, 2023

grilikin commented Jan 19, 2024

github-actions bot commented Apr 9, 2024

ghost commented Aug 25, 2023 •

edited by ghost

Loading

SimonBenhamou commented Aug 25, 2023 •

edited

Loading