Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiktoken failed to decode Math symbols #202

Closed
ahmedmoorsy opened this issue Oct 6, 2023 · 1 comment
Closed

Tiktoken failed to decode Math symbols #202

ahmedmoorsy opened this issue Oct 6, 2023 · 1 comment

Comments

@ahmedmoorsy
Copy link

Hello,

I am trying to use tiktoken to tokenize some texts that contain math symbols like , , `A⊇B. But, tiktoken failed to decode this to a single token.

code:

import tiktoken
encoding = tiktoken.encoding_for_model('gpt-4')

encoded_message = encoding.encode("∩")

decoded_tokens = [encoding.decode_single_token_bytes(token).decode("utf-8") for token in encoded_message]
print(decoded_tokens)

I tried to use encoding.decode() method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.

Any help?

@hauntsaninja
Copy link
Collaborator

In general, we do not guarantee that every Unicode codepoint is a single token (because there are like a million of them), nor do we guarantee that individual tokens are valid UTF-8. It's a "byte" pair encoding, after all. [encoding.decode_single_token_bytes(token) for token in encoded_message] will show you the bytes of your tokens.

Please see https://github.com/openai/tiktoken#what-is-bpe-anyway and the tiktoken._educational submodule for more questions about BPE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants