Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced a new BatchedExecutor #503

Merged
merged 5 commits into from
Feb 15, 2024

Conversation

martindevans
Copy link
Member

@martindevans martindevans commented Feb 9, 2024

Created a new BatchedExecutor which processes multiple "Conversations" in one single inference batch. This is faster, even when the conversations are unrelated, and is much faster if the conversations share some overlap (e.g. a common system prompt prefix).

Conversations can be "forked", to create a copy of a conversation at a given point. This allows e.g. prompting a conversation with a system prefix just once and then forking it again and again for each individual conversation. Conversations can also be "rewound" to an earlier state.

Added two new examples, demonstrating forking and rewinding.

This is currently much "lower level" than the existing executors, and is really just a minimum viable system to move LLamaSharp over to batching. There is more work needed in the future:

  • Wrap Conversation with functionality that the existing executors have (e.g. prompt templates, sampling)
  • Extend Conversation with new capabilities (whatever is needed to handle out-of-context)
  • Extend executor with new capabilities, such as saving/loading the entire batch.

How To Use

Brief guide to using the BatchedExecutor:

  1. Create a BatchedExecutor
  2. Create one or more new Conversations with executor.Prompt("hello");
  3. Call executor.Infer() to run inference for all conversations which need inference simultaneously
  4. Sample each conversation individually (conversation.Sample())
  5. Prompt the conversation with the token chosen by sampling, or with more user input
  6. goto 3

Conversation objects have flags which indicate what state they're in (waiting for sampling, waiting for inference) and will throw exceptions if you try to use them wrong.

…ns" in one single inference batch. This is faster, even when the conversations are unrelated, and is much faster if the conversations share some overlap (e.g. a common system prompt prefix).

Conversations can be "forked", to create a copy of a conversation at a given point. This allows e.g. prompting a conversation with a system prefix just once and then forking it again and again for each individual conversation. Conversations can also be "rewound" to an earlier state.

Added two new examples, demonstrating forking and rewinding.
@martindevans martindevans force-pushed the batched_executor_again branch from 225ba47 to b0acecf Compare February 9, 2024 23:57
…* access to directly modify the KV cache.

 - Re-implmented `Rewind` as an extension method using `Modify` internally
 - Implemented `ShiftLeft`, which shifts everything over except for some starting tokens. This is the same as the `StatelessExecutor` out-of-context handling.
 - Starting batch at epoch 1, this ensures that conversations (starting at zero) are below the current epoch. It also means `0` can always be used as a value guaranteed to be below the current epoch.
@martindevans martindevans merged commit d03c1a9 into SciSharp:master Feb 15, 2024
3 checks passed
@martindevans martindevans deleted the batched_executor_again branch February 15, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant