Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

Closed
blackchoey opened this issue Jan 16, 2024 · 2 comments · Fixed by #1352
Closed

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

blackchoey opened this issue Jan 16, 2024 · 2 comments · Fixed by #1352
Assignees
Labels
bug Something isn't working JS Change/fix applies to JS. If all three, use the 'JS & dotnet & Python' label

Comments

@blackchoey
Copy link
Contributor

Language

Javascript/Typescript

Version

latest

Description

The GPT3Tokenizer is using gpt-3-encoder package, which uses r50k_base encoding. However, gpt-4 and gpt-3.5 is using cl100k_base encoding. As the major model used with Teams-AI library is gpt-4 and gpt-3.5, using gpt-3-encoder by default might cause potential issues.
We need to either update GPT3Tokenizer to use cl100k_base encoding, or create a new tokenizer for gpt-4/3.5 and set the new tokenizer as the default one.

Reference: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Reproduction Steps

N/A
@blackchoey blackchoey added the bug Something isn't working label Jan 16, 2024
aacebo pushed a commit that referenced this issue Jan 17, 2024
## Linked issues

closes: #1066 

## Details

1. Implement tokenizers for Python based on the JS SDK
2. Changes the underlying coding to `cl100k_base`, which is used by gpt4
and gpt3.5. JS is using `r50k_base` and I have created
#1171 to track this issue.
3. Rename `GPT3Tokenizer` to `GPTTokenizer`, which seems making more
sense for its functionality, as both gpt4 and gpt3.5 can use this
tokenizer.
4. Add unit tests for the code
5. Add docstring for the code

## Attestation Checklist

- [x] My code follows the style guidelines of this project

- I have checked for/fixed spelling, linting, and other errors
- I have commented my code for clarity
- I have made corresponding changes to the documentation (we use
[TypeDoc](https://typedoc.org/) to document our code)
- My changes generate no new warnings
- I have added tests that validates my changes, and provides sufficient
test coverage. I have tested with:
  - Local testing
  - E2E testing in Teams
- New and existing unit tests pass locally with my changes
@corinagum corinagum added the JS Change/fix applies to JS. If all three, use the 'JS & dotnet & Python' label label Jan 19, 2024
@Stevenic
Copy link
Collaborator

The issue is there’s not a 3rd party implementation of tiktoken that I’m aware of for JS. Everyone just uses gpt-3-encoder because it’s close enough. Token counting in general (regardless of tiktoken or not) is a rough estimate at best for numerous reasons these days. OpenAI adds all sorts of hidden tokens to the prompt that you can’t easily account for. It’s best to just overestimate by a small percentage.

@corinagum
Copy link
Collaborator

https://github.com/niieani/gpt-tokenizer seems like a good alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working JS Change/fix applies to JS. If all three, use the 'JS & dotnet & Python' label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants