Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchedExecutor Double Buffering #748

Merged

Conversation

martindevans
Copy link
Member

Added "double buffering" to the batched executor, this allows conversations to be prompted (adding tokens to one batch) while inference is still running on the other batch.

There's some special consideration given to errors. If an error occurs during inference the previous batch is swapped back to be the "next" batch. This allows infer to be called again (after fixing whatever caused the error).

…ions to be prompted (adding tokens to one batch) while inference is still running on the other batch.
@zsogitbe
Copy link
Contributor

Martin,
This is probably not the right place to mention this, but I find the batched executor examples confusing and difficult to understand. In a user point of view I would prefer something more logical and easy to understand focusing on important use cases.
Some explanation about what I mean:

  • BatchedExecutorFork.cs:
    o I find the term 'Fork' confusing. Why don't you just use 'Clone', clone is immediately clear to anyone?
    o "This demonstrates generating multiple replies to the same prompt, with a shared cache" this is clear, so you get happy that you will see something great, but then it gets ugly starting with this "// Create the root node of the tree". The whole "Node" class and the use of it is confusing. I think that in some way the Node class should not be needed in this example or should be hidden.
    o then "// Occasionally fork all the active conversations" => "root.Split();". Here "fork" and then "split",... confusing and honestly I do not clearly understand what is happening.
    o The example promises "generating multiple replies to the same prompt", but I do not see where those replies are.

  • BatchedExecutorGuidance.cs:
    o here again the term "Fork" is a bit confusing. Is this equivalent to "Clone"?
    o "var unguidedSampler = new GuidedSampler(null, weight);" here the use of "GuidedSampler" for "unguidedSampler" is a bit confusing.
    o I would rename the "GuidedSampler" to "Sampler" and then add a method called "Sample" where I would add "guidedDecoder.Add(g); guided.Prompt(g); guidance.Prompt(g);" that in the example there is only one line "guidedSampler.Sample()" and thus Add and Prompt would be done once in the "Sampler" only.
    o I am not sure that the separation of the decoder and sampler is logical in point of view of a user. As a user I would expect decoding and the sampling should be hidden. I think that there is just too much detail here.
    o what is the meaning of the "n_len" parameter here? Is this equivalent to the number of maxtokens parameter we use to limit the generation? Is this a real world scenario here as we have to do in an application if we use the library?

  • BatchedExecutorRewind.cs:
    o here the use of the "class Node" again... not sure what the meaning of this here again.
    o not clear what the meaning of rewind is. Is this equivalent to saving the context state and then reloading it to that former state? If yes, then all of this should be hidden or not even needed and just simply use save/load context state.

So, what I mean here is that for an end user all of this will be very overwhelming and confusing. What they would want is simple, but real word scenarios that they can immediately understand and use in their work.

@martindevans
Copy link
Member Author

I find the batched executor examples confusing and difficult to understand

To some extent that's expected, the BatchedExecutor is intended as a primitive that we use to build higher level "user friendly" abstractions (i.e. replacements for the executors). It's meant to expose all of the power of the lower levels, with as much extra safety as possible (e.g. it's not possible to sample a conversation at the wrong time, wheras normally in llama.cpp that would silently fail and return garbage).

That said, any feedback that makes it nicer to use is valuable, keep it coming 😁

I find the term 'Fork' confusing. Why don't you just use 'Clone'

I used clone when first developing this and switched to calling it fork instead. To me "Clone" has a connotation of copying the conversation which is definitely not what is happening! A fork of a conversation is basically free in terms of memory usage.

It's analagous to the unix fork system call which "forks" a process without copying the memory involved.

The whole "Node" class and the use of it is confusing. I think that in some way the Node class should not be needed in this example or should be hidden.

Yeah agreed, it's all a bit ugly. Due to the low level nature of this executor the examples are a bit hard to follow.

In this case the "Node" class is representing the tree of conversations, each "Split" of a node corresponds to a "Fork" of the conversation. An easy improvement would be to at least rename "Split" to "Fork" so the same terminology is used everywhere!

The example promises "generating multiple replies to the same prompt", but I do not see where those replies are.

You should see an output like this:

VsDebugConsole_2024-05-22_14-15-47

Where you can follow any line (such as the one I've highlighted) to get a reply.

In this example: "Not many people know that in addition to his numerous films and televisions series, the actor George Peppard, best known as Shepard Fairey on "Fawlty Towers," had quite an interesting voice career. He recorded narrations for dozens of documentaries, films, and cartoon shows. Let's take a closer look at some of his best roles".

here the use of "GuidedSampler" for "unguidedSampler" is a bit confusing.

I think this could be switched to use the DefaultSamplingPipeline, would that be clearer?

I am not sure that the separation of the decoder and sampler is logical in point of view of a user. As a user I would expect decoding and the sampling should be hidden. I think that there is just too much detail here.

This goes back to the low level nature of the sampler. If you want to build a custom executor, or define a custom sampling mechanism dealing directly with raw logits these are the things you would be using. Long term, there should be new executors which wrap this stuff up in more limited but much more "user-friendly" classes.

We've already got that to some extent, where the current high level executors can accept an ISamplingPipeline and if you don't specify one it just uses DefaultSamplingPipeline.

what is the meaning of the "n_len" parameter here?

This is a terrible name, it should really be something like "TokenCount" - it's simply the number of tokens that will be sampled in the main inference loop (line 61).

Originally these demos were written by porting across bits of llama.cpp demos (with all the terrible C++ naming conventions) and calling the low level native functions. n_len is a leftover that never got changed.

not clear what the meaning of rewind is. Is this equivalent to saving the context state and then reloading it to that former state? If yes, then all of this should be hidden or not even needed and just simply use save/load context state.

Say you have a conversation that generates tokens [A, B, C, D, E, F] you can rewind that state to [A, B, C] and continue generating from there.

This is not equivalent to save/load for two reasons.

The first is simply usability - if you can't rewind you'd need to know ahead of time when you need to save a "checkpoint" to rewind to. Saving is fairly expensive (it's a large copy), so you want to minimise the need to defensively save just in case you ever want to come back to this point.

More importantly, it's much less memory efficient to save and load! If you save a state and then load it into a new conversation you have copied that entire state, consuming a lot of extra memory. Wheras if you rewind you are using the same memory cells for [A, B, C].

@zsogitbe
Copy link
Contributor

OK, if this is for library developers, then I would like to suggest to create a separate solution for this type of examples. The 'examples' should be helping the users of the library and not confusing them 😵. Or when you distribute the code to low level and high level parts, maybe this can then be done by moving these examples to the low level part.

I think that several names and logic should be changed even for library developers. I will not repeat the above, but just mention that you did not convince me with your explanation of 'fork' 😅. If I fork a github repository, then it is copied completely and I do not use unix, so nothing helps here. Maybe you should find a better name. It would probably be easier to understand if you would 'Clone' the context, which probably means what you mean, if I understand you well.

@AsakusaRinne
Copy link
Collaborator

I will not repeat the above, but just mention that you did not convince me with your explanation of 'fork'

@zsogitbe For the current implementations in LLamaSharp, there's little semantic difference between using Fork and using Clone. However, I agree to use Fork because it's more clear when it comes to beam search. In llama.cpp, beam search is totally wrapped as an exported C API so an abstraction of SequenceGroup is not very necessary. However, since beam search is not too difficult for users to implement it themselves, we should take it into account when naming the APIs.

Beam search in LLM can be described as following steps.

  1. Set a SequenceGroup with one sequence, which contains the prompt. Then set the size of beam search to 3.
  2. Prefill the the first sequence (let's call it seq1), and fork seq2 and seq3 from seq1.
  3. Sample the output of seq1 and take the 3 highest scoring tokens. The highest one will be added to seq1, the second to seq2, the third to seq3.
  4. Compute seq1, seq2 and seq3 simultaneously, and get the top3 scoring tokens for each sequence. Now we have 9 candidate sequence.
  5. Sort the 9 sequences by the global score, and take the highest 3 sequences as seq1, seq2 and seq3 for the next loop. The global score here refers to a score computed with all the tokens generated till now. Let's assume the global scores are [[0.2, 0.3, 0.1], [0.5, 0.2, 0.7], [0.8, 0.9, 0.2]]. Two of the highest 3 scores appears in the original seq3, and the other appears in the original seq2. Now, we need to keep the sequence with 0.9 as the next seq3 (only append the token), and make the one with 0.8 as the next seq3 (fork and append)
  6. Loop step 4 and step 5.

After the Fork (or Clone as you preferred), the sequences are still closely related to each other in the same SequenceGroup.

Taking your analogy of repo clone&fork, please regard SequenceGroup as Github, the earliest sequence as SciSharp/LLamaSharp. Besides, there're only at most 3 best LLamaSharp repos can exist at the same time. Then, zsogitbe/LLamaSharp, martindevans/LLamaSharp and SciSharp/LLamaSharp will both have 3 branches developed. If two of your branches have the highest scores, one of them will be merged back to SciSharp/LLamaSharp. Obviously, the semantic here is more likely to be Fork instead of Git clone.

There's also another reason: after forking the sequence, the kv-cache of it will not be copied at once. The copying happens only when one of the sequences is used to generate the next token.

@martindevans
Copy link
Member Author

There's also another reason: after forking the sequence, the kv-cache of it will not be copied at once. The copying happens only when one of the sequences is used to generate the next token.

The copying doesn't happen then either!

If you do this:

  1. Seq1, generate [A, B, C]
  2. Fork into Seq2
  3. Seq1, generate [A, B, C, D, E, F]
  4. Seq2, generate [A, B, C, X, Y, Z]

Then you will have 9 occupied KV slots, like this: [A, B, C, D, E, F, X, Y, Z]. [A, B, C] is never copied, it is always shared between the two sequences.

This is why the save and load method for cloning a conversation, instead of forking it, is bad. As soon as you save and load the slots are no longer shared and you are taking up a lot more memory and compute.

@martindevans martindevans merged commit ad6f22c into SciSharp:master May 23, 2024
6 checks passed
@martindevans martindevans deleted the batched_executor_double_buffer branch May 23, 2024 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants