Tiktoken failed to decode Math symbols #202

ahmedmoorsy · 2023-10-06T04:29:01Z

Hello,

I am trying to use tiktoken to tokenize some texts that contain math symbols like ∩, ⊆, `A⊇B. But, tiktoken failed to decode this to a single token.

code:

import tiktoken
encoding = tiktoken.encoding_for_model('gpt-4')

encoded_message = encoding.encode("∩")

decoded_tokens = [encoding.decode_single_token_bytes(token).decode("utf-8") for token in encoded_message]
print(decoded_tokens)

I tried to use encoding.decode() method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.

Any help?

The text was updated successfully, but these errors were encountered:

hauntsaninja · 2023-10-06T04:33:53Z

In general, we do not guarantee that every Unicode codepoint is a single token (because there are like a million of them), nor do we guarantee that individual tokens are valid UTF-8. It's a "byte" pair encoding, after all. [encoding.decode_single_token_bytes(token) for token in encoded_message] will show you the bytes of your tokens.

Please see https://github.com/openai/tiktoken#what-is-bpe-anyway and the tiktoken._educational submodule for more questions about BPE.

hauntsaninja closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2023

patrickvonplaten mentioned this issue Sep 19, 2024

[Bugfix][Core] Fix tekken edge case for mistral tokenizer vllm-project/vllm#8640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiktoken failed to decode Math symbols #202

Tiktoken failed to decode Math symbols #202

ahmedmoorsy commented Oct 6, 2023

hauntsaninja commented Oct 6, 2023

Tiktoken failed to decode Math symbols #202

Tiktoken failed to decode Math symbols #202

Comments

ahmedmoorsy commented Oct 6, 2023

hauntsaninja commented Oct 6, 2023