Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat session state management #560

Merged

Conversation

eublefar
Copy link
Contributor

@eublefar eublefar commented Mar 2, 2024

As discussed here #559

I've added few things:

  • StatefulExecutorBase.AddPromptAsync - basically runs the text through decode without sampling new tokens.
    I am basically running InferInternal twice with WaitForInput = true, but it's very reliant on specific implementation of child classes and not the abstraction itself, so maybe you'll have some suggestions on how to improve it.
  • SessionState record class reperesenting ChatSession state in memory.
  • void LoadSession(SessionState state) and SessionState GetSessionState() for ChatSession class to save session state at arbitrary point.
  • static Task<ChatSession> ChatSession.InitializeSessionFromHistoryAsync(ILLamaExecutor executor, ChatHistory history, CancellationToken cancellationToken = default) and Task<ChatSession> ChatSession.ProcessSystemMessage(string content) to pre-process KV cache.
  • Functions to reset states of StatefulExecutors and LlamaContext.
  • Example ChatSessionWithRestart to show off how it works

I didn't find any ChatSession unit tests so I did not write any.

BR,
Mykyta

eublefar added 3 commits March 2, 2024 14:51
… from history and process system message methods for pre-processing prompts. Serializing executor state to JSON, to avoid saved states from being updated by reference.
…ration that resets chat to original point of history without extra processing
LLama/LLamaContext.cs Outdated Show resolved Hide resolved
LLama/LLamaContext.cs Outdated Show resolved Hide resolved
LLama/LLamaContext.cs Outdated Show resolved Hide resolved
@martindevans
Copy link
Member

martindevans commented Mar 2, 2024

Thanks for this PR! I'm not very familiar with the higher level executors, but I've just left a few comments on things I spotted :)

As a general note. I'm personally working on bringing a brand new executor to LLamaSharp based on the BatchedExecutor which I introduced in the least release. It's currently very "low level" compared to these executors, but you may be interested in taking a look at it since it can do a lot of the things you want to do here. I'd be interested in any feedback you have about it.

@eublefar
Copy link
Contributor Author

eublefar commented Mar 2, 2024

Sorry, I've over-thought it a bit with nullable ExecutorBaseState and context State. I've just made those non-nullable and removed all the ResetState methods (as now there is no need for them, there will always be some state to load so no need to reset).

@eublefar
Copy link
Contributor Author

eublefar commented Mar 2, 2024

@martindevans

As a general note. I'm personally working on bringing a brand new executor to LLamaSharp based on the BatchedExecutor which I introduced in the least release. It's currently very "low level" compared to these executors, but you may be interested in taking a look at it since it can do a lot of the things you want to do here. I'd be interested in any feedback you have about it.

Huh, it indeed does, I should've checked it out.
As for feedback, my use case is a bit weird for this project I guess, but I shall share it nontheless :)
Basically the use case for me is that I am making a game in Unity with a lot of prompts.
I started this project of mine some time ago and decided to write my own high-level interface based on NativeAPI/handles directly.

Some notes on what would make BatchExecutor viable for my use-case:

  • I found that it's impossible for me to keep everything in one KV cache on GPU because of the limited context size, so Save/Load state in-memory is very important. Any ways to spill the context into RAM in seamless and optimal way, while still being able to run inference in batch mode would be very appreciated.
    On a side note: I found out that vanilla Executors actually do something like this while writing this PR (I've cleaned KV cache, but if I did not remove Executor state, then LLM actually continued conversation that should've been removed from the KV cache). Really curious if that can be applied to batch setting.
  • If I process batches too big - frames per second suffer, so I need to be able to batch requests implicitly (e.g. split single prompt/message into multiple runs). If I am not doing it manually, llama.cpp just crashes whole application with assert that batch size is too small, so C# side needs to be smart in this regard.

I'd be happy to help, if that direction would be useful for you.

Copy link
Member

@martindevans martindevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing those previous comments, this looks good to me 👍

I won't merge it myself immediately, since I'm not very familiar with the high level chat/executor stuff. If no one else has any comments I'll merge it before the next release happens :)

Copy link
Collaborator

@AsakusaRinne AsakusaRinne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! The part of pre-filling prompt looks good to me while I have some concerns about the the part of session state. I'm open for any discussion and please feel free to ping me if there's something blocking you. :)

LLama/ChatSession.cs Outdated Show resolved Hide resolved
LLama/ChatSession.cs Outdated Show resolved Hide resolved
LLama/ChatSession.cs Outdated Show resolved Hide resolved
LLama/ChatSession.cs Outdated Show resolved Hide resolved
LLama/LLamaExecutorBase.cs Outdated Show resolved Hide resolved
LLama/LLamaExecutorBase.cs Show resolved Hide resolved
LLama/LLamaExecutorBase.cs Show resolved Hide resolved
@martindevans
Copy link
Member

@eublefar I opened a discussion thread here about the your BatchedExecutor feedback, to keep this PR on topic :)

@Xsanf
Copy link

Xsanf commented Mar 12, 2024

@eublefar > If I process batches too big - frames per second suffer....

I'm not sure if this will help, but it helped me in a similar situation.

By default, we transfer control to await UniTask.NextFrame(); and we believe that this is sufficient from the point of view of control logic.

               await foreach (var text in executor.InferAsync(query, inferenceParams))
                {
                    output += text;
                    Output.text = output;
                    await UniTask.NextFrame();
                }

At a fixed frequency of 35 FPS

        Application.targetFrameRate = 35;
        QualitySettings.vSyncCount = 1;

During execution, FPS may drop to 7-6 frames for the entire time of inference.
There arises the difficult question of synchronizing two sequences with different periods, the Moire effect, at what exact moment in the frame the control transfer occurs.

Quite a crude technique, but you can simply add additional await UniTask.NextFrame();

                await foreach (var text in executor.InferAsync(query, inferenceParams))
                {
                    output += text;
                    Output.text = output;
                    await UniTask.NextFrame();
                    await UniTask.NextFrame();
                    await UniTask.NextFrame();
                }

When adding one additional UniTask.NextFrame(); FPS reaches up to 25-35.
When adding two UniTask.NextFrame(); FPS does not drop.

This has almost no effect on the speed of the withdrawal itself, and eliminates the Moire effect (frieze), since regardless of the generation length, only the generation time of one token matters.

Additional question. Why is the "seed" set in ModelParams. When it is more logical in inferenceParams, where it is needed.
This is not very convenient, because if you need to regenerate the request to obtain a new option, you cannot change the "seed" without additional manipulations.

@martindevans
Copy link
Member

Why is the "seed" set in ModelParams.

LLamaSharp has IModelParams which is everything required to load a model into memory and IContextParams which is everything required to create an inference context. Seed is set on the context params. The ModelParams class is a convenience that implements both interfaces in one place, so you can set all config options in one go. But if you want more control you can implement those two params interfaces yourself.

(If you want to ask any followup questions please open an issue and ping me, to keep this PR on topic)

@AsakusaRinne
Copy link
Collaborator

@eublefar Hi, how's it going? Many thanks for your contribution. We'll be happy if you could complete this PR but it shouldn't be blamed if you don't have time to continue. Please let us know if you're not available in the future two weeks and I'll finish this work myself. :)

@eublefar
Copy link
Contributor Author

@AsakusaRinne Hey, sorry, just got back from the vacation. I'll get on it today :)

@eublefar
Copy link
Contributor Author

eublefar commented Mar 17, 2024

@AsakusaRinne I've suffered a bit with serializing/deserializing transforms, but It's working now I think.

@AsakusaRinne AsakusaRinne added the enhancement New feature or request label Mar 17, 2024
@AsakusaRinne AsakusaRinne added this to the v0.11.0 milestone Mar 17, 2024
@martindevans martindevans self-requested a review March 17, 2024 18:33
@eublefar
Copy link
Contributor Author

eublefar commented Mar 18, 2024

Tests pass on my machine and I can't figure out which one is crashing here, any suggestions?

@martindevans
Copy link
Member

martindevans commented Mar 18, 2024

Tests are a little flakey at the moment (language models are huge, so I think we're just trying to do too much work in some tests for the github runners to handle). Since you passed on Linux, and this PR isn't really platform specific, you're probably ok. I've restarted the failed CI runs.

@eublefar eublefar requested a review from AsakusaRinne March 20, 2024 08:22
Copy link
Collaborator

@AsakusaRinne AsakusaRinne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for the delay of the review. It's impressing and the overall looks good to me, with a few comments left. :)

LLama/ChatSession.cs Outdated Show resolved Hide resolved
LLama/ChatSession.cs Show resolved Hide resolved
LLama/LLamaExecutorBase.cs Show resolved Hide resolved
@eublefar eublefar requested a review from AsakusaRinne March 21, 2024 11:19
Copy link
Collaborator

@AsakusaRinne AsakusaRinne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, many thanks for this contribution! :)

@AsakusaRinne AsakusaRinne linked an issue Mar 26, 2024 that may be closed by this pull request
@AsakusaRinne AsakusaRinne merged commit b677cdc into SciSharp:master Mar 26, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
break change enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switching ChatSessions without writing to file.
4 participants