You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering a decoding issue in the tokenizers library, particularly with some latin characters included in the added_tokens. This issue is observed when using the DeepSeek-coder model, which has the following token definition:
While encoding this character works as expected, the decoding process does not produce the correct result. Here's an example illustrating the issue:
tok.encode('õ', add_special_tokens=False)
# Output: [32000] // This is correcttok.decode([32000])
# Output: '�' // This is incorrect
The decoding of the token ID 32000 should return 'õ', but instead, it returns an incorrect character. This issue seems to be specific to the decoding process.
Could you please investigate this problem? Any assistance in resolving this would be greatly appreciated.
Thank you for your help.
The text was updated successfully, but these errors were encountered:
Hi @44670, thanks for your interests in DeepSeek models. The problem can be explained in #1392. This issue cannot be resolved for the time being. We will update our tokenizer in the subsequent model releases.
Hello,
I'm encountering a decoding issue in the
tokenizers
library, particularly with some latin characters included in theadded_tokens
. This issue is observed when using the DeepSeek-coder model, which has the following token definition:While encoding this character works as expected, the decoding process does not produce the correct result. Here's an example illustrating the issue:
The decoding of the token ID
32000
should return 'õ', but instead, it returns an incorrect character. This issue seems to be specific to the decoding process.Could you please investigate this problem? Any assistance in resolving this would be greatly appreciated.
Thank you for your help.
The text was updated successfully, but these errors were encountered: