-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Microsoft.KernelMemory version 0.68+ compatibility fix #862
Microsoft.KernelMemory version 0.68+ compatibility fix #862
Conversation
I've submitted a few review comments. The one with the empty strings I'm not really sure how best to handle, and if you want to go ahead with the current implementation I'm happy with that as long as there's at least a test covering this weirdness and a comment explaining what's going on. |
…of redundant tokens resulting from multi-token characters with ref to PR #862
@martindevans I pushed the relevant changes. I created a duplicate unit test with only the unicode cases and added this comment (also referenced in the GetTokens implementations) :
|
fixes #859
Issue details
The latest version of Microsoft.KernelMemory (0.68.240716.1 in my case) adds IReadOnlyList GetTokens(string) to interface Microsoft.KernelMemory.AI.ITextTokenizer
This breaks any project that would reference the latest packages of LlamaSharp.kernel-memory and Microsoft.KernelMemory.Core together, affecting mostly developers just getting into LLamaSharp.
How it's solved in this commit
This commit provides a tentative implementation using LLamaContext.Tokenizer to get the tokens in embedding form and StreamingTokenDecoder to turn them back into (parts of) words and return them.
My assumptions for the overall expected behavior are based on the implementation of CountTokens in LLamaSharpTextEmbedingsGenerator and LLamaSharpTextGenerator, This means that it breaks on null input and returns an empty token that corresponds to the BOS embedding. Unit tests also check that the result of CountTokens matches the actual count of the tokens return from GetTokens.
Other considerations
In the unit tests I trim the 'actual' result to match the 'expected' to account for the added empty space that corresponds to the BOS token. Issues such as #856 indicate that further clarity will emerge with respect to how this should be properly handled.