[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

blackchoey · 2024-01-16T05:50:10Z

Language

Javascript/Typescript

Version

latest

Description

The GPT3Tokenizer is using gpt-3-encoder package, which uses r50k_base encoding. However, gpt-4 and gpt-3.5 is using cl100k_base encoding. As the major model used with Teams-AI library is gpt-4 and gpt-3.5, using gpt-3-encoder by default might cause potential issues.
We need to either update GPT3Tokenizer to use cl100k_base encoding, or create a new tokenizer for gpt-4/3.5 and set the new tokenizer as the default one.

Reference: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

Reproduction Steps

N/A

The text was updated successfully, but these errors were encountered:

## Linked issues closes: #1066 ## Details 1. Implement tokenizers for Python based on the JS SDK 2. Changes the underlying coding to `cl100k_base`, which is used by gpt4 and gpt3.5. JS is using `r50k_base` and I have created #1171 to track this issue. 3. Rename `GPT3Tokenizer` to `GPTTokenizer`, which seems making more sense for its functionality, as both gpt4 and gpt3.5 can use this tokenizer. 4. Add unit tests for the code 5. Add docstring for the code ## Attestation Checklist - [x] My code follows the style guidelines of this project - I have checked for/fixed spelling, linting, and other errors - I have commented my code for clarity - I have made corresponding changes to the documentation (we use [TypeDoc](https://typedoc.org/) to document our code) - My changes generate no new warnings - I have added tests that validates my changes, and provides sufficient test coverage. I have tested with: - Local testing - E2E testing in Teams - New and existing unit tests pass locally with my changes

Stevenic · 2024-02-11T04:33:40Z

The issue is there’s not a 3rd party implementation of tiktoken that I’m aware of for JS. Everyone just uses gpt-3-encoder because it’s close enough. Token counting in general (regardless of tiktoken or not) is a rough estimate at best for numerous reasons these days. OpenAI adds all sorts of hidden tokens to the prompt that you can’t easily account for. It’s best to just overestimate by a small percentage.

corinagum · 2024-02-26T23:54:21Z

https://github.com/niieani/gpt-tokenizer seems like a good alternative.

blackchoey added the bug Something isn't working label Jan 16, 2024

blackchoey mentioned this issue Jan 16, 2024

[PY] feat: add tokenizer #1172

Merged

1 task

corinagum added the JS Change/fix applies to JS. If all three, use the 'JS & dotnet & Python' label label Jan 19, 2024

blackchoey assigned corinagum Jan 23, 2024

This was referenced Feb 27, 2024

[Bug]: Utilities' toString method should consistently go between yml and JSON #1350

Open

[JS] fix: Change default encoder for Tokenizer #1352

Merged

corinagum closed this as completed in #1352 Feb 28, 2024

corinagum closed this as completed in 3bd09df Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

blackchoey commented Jan 16, 2024

Stevenic commented Feb 11, 2024

corinagum commented Feb 26, 2024

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

[Bug]: GPT3Tokenizer is not using same encoding with gpt-4 and gpt-3.5 #1171

Comments

blackchoey commented Jan 16, 2024

Language

Version

Description

Reproduction Steps

Stevenic commented Feb 11, 2024

corinagum commented Feb 26, 2024