You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GPT3Tokenizer is using gpt-3-encoder package, which uses r50k_base encoding. However, gpt-4 and gpt-3.5 is using cl100k_base encoding. As the major model used with Teams-AI library is gpt-4 and gpt-3.5, using gpt-3-encoder by default might cause potential issues.
We need to either update GPT3Tokenizer to use cl100k_base encoding, or create a new tokenizer for gpt-4/3.5 and set the new tokenizer as the default one.
## Linked issues
closes: #1066
## Details
1. Implement tokenizers for Python based on the JS SDK
2. Changes the underlying coding to `cl100k_base`, which is used by gpt4
and gpt3.5. JS is using `r50k_base` and I have created
#1171 to track this issue.
3. Rename `GPT3Tokenizer` to `GPTTokenizer`, which seems making more
sense for its functionality, as both gpt4 and gpt3.5 can use this
tokenizer.
4. Add unit tests for the code
5. Add docstring for the code
## Attestation Checklist
- [x] My code follows the style guidelines of this project
- I have checked for/fixed spelling, linting, and other errors
- I have commented my code for clarity
- I have made corresponding changes to the documentation (we use
[TypeDoc](https://typedoc.org/) to document our code)
- My changes generate no new warnings
- I have added tests that validates my changes, and provides sufficient
test coverage. I have tested with:
- Local testing
- E2E testing in Teams
- New and existing unit tests pass locally with my changes
corinagum
added
the
JS
Change/fix applies to JS. If all three, use the 'JS & dotnet & Python' label
label
Jan 19, 2024
The issue is there’s not a 3rd party implementation of tiktoken that I’m aware of for JS. Everyone just uses gpt-3-encoder because it’s close enough. Token counting in general (regardless of tiktoken or not) is a rough estimate at best for numerous reasons these days. OpenAI adds all sorts of hidden tokens to the prompt that you can’t easily account for. It’s best to just overestimate by a small percentage.
Language
Javascript/Typescript
Version
latest
Description
The GPT3Tokenizer is using
gpt-3-encoder
package, which usesr50k_base
encoding. However, gpt-4 and gpt-3.5 is usingcl100k_base
encoding. As the major model used with Teams-AI library is gpt-4 and gpt-3.5, usinggpt-3-encoder
by default might cause potential issues.We need to either update
GPT3Tokenizer
to usecl100k_base
encoding, or create a new tokenizer for gpt-4/3.5 and set the new tokenizer as the default one.Reference: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
Reproduction Steps
The text was updated successfully, but these errors were encountered: