You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use tiktoken to tokenize some texts that contain math symbols like ∩, ⊆, `A⊇B. But, tiktoken failed to decode this to a single token.
code:
import tiktoken
encoding = tiktoken.encoding_for_model('gpt-4')
encoded_message = encoding.encode("∩")
decoded_tokens = [encoding.decode_single_token_bytes(token).decode("utf-8") for token in encoded_message]
print(decoded_tokens)
I tried to use encoding.decode() method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.
Any help?
The text was updated successfully, but these errors were encountered:
In general, we do not guarantee that every Unicode codepoint is a single token (because there are like a million of them), nor do we guarantee that individual tokens are valid UTF-8. It's a "byte" pair encoding, after all. [encoding.decode_single_token_bytes(token) for token in encoded_message] will show you the bytes of your tokens.
Hello,
I am trying to use tiktoken to tokenize some texts that contain math symbols like
∩
,⊆
, `A⊇B. But, tiktoken failed to decode this to a single token.code:
I tried to use
encoding.decode()
method and it's working very well. But, it's gives me the full text and I need to have a list of decode tokens instead.Any help?
The text was updated successfully, but these errors were encountered: