added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

DOGEwbx · 2023-11-16T03:33:33Z

I just found that if added tokens contain some characters that exist in the byte map for ByteLevel preprocessor could not be decoded correctly.
This is a script to reproduce the problem with version 0.14.1

from tokenizers import Tokenizer
from tokenizers import normalizers
from tokenizers.pre_tokenizers import (
    ByteLevel,
)
from tokenizers.models import BPE
from tokenizers import decoders
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = normalizers.Sequence([])

tokenizer.pre_tokenizer = Sequence(
    [
        ByteLevel(add_prefix_space=False, use_regex=False),
    ])
tokenizer.add_tokens(["ilÖveyou"])
# Ö is the character representing for 0xf6
tokenizer.decoder = decoders.ByteLevel()
encode_result = tokenizer.encode("ilÖveyou")
print(encode_result.ids)
print(tokenizer.decode(encode_result.ids))

the output wil be

[0]
il�veyou

I believe the problem comes from
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/tokenizer/mod.rs#L832-L836
I don't think added token should be sent to bytelevel decoder for it is extacted before pretokenize.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-11-16T17:32:27Z

Hey! Thanks for the report, the output is the same if you use tokenizers==0.13.3 so unrelated to the 0.14.1 release.
I think one solution would be to have a normalizer that does ByteLevel, and set normalized to false for the tokens. Not sure we have an other solution for now.

AddedTokens usually have to be sent to the decoder, because the pre-tokenizatio is applied to them.

DOGEwbx · 2023-11-17T02:27:28Z

is it possible to let added_tokens_map_r of AddedVocabulary store the mapping of id to tokens after pre-tokenization? so that it can generate the same output after decoder?

ArthurZucker · 2023-11-20T07:42:40Z

It's not possible no, normalizers are use for that purpose however. And you can also add the token like this:
Since the text is first splitted and normalized, and then pre_tokenized, adding the pre-tokenized version of the token to the added_tokens_map_r will still not work.
In this case you could use a StripAccents normalizer to make sure accents are stripped, but this will also affect all the other tokens.

DOGEwbx · 2023-11-20T08:32:10Z

Ah I see. Thanks for your explaination. Is there any planning on solving this bug?

github-actions · 2023-12-21T01:50:21Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-01-03T09:26:18Z

A pr for a fix that is backward compatible is welcome! Otherwise I won't have time to dive in this 🤗

github-actions · 2024-02-23T01:47:54Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-04-23T20:07:51Z

meta-llama/llama3#67 (comment) TLDR this should help:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('BÃ¡c', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("BÃ¡c", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

ArthurZucker · 2024-06-11T11:41:20Z

Re-opening as the merge on main will be reverted for a better fix soon

DOGEwbx mentioned this issue Nov 16, 2023

Different behaviour of BPE encoder after update to 0.14.1 #1358

Closed

DOGEwbx mentioned this issue Dec 1, 2023

German umlaut missing with deepseek-llm on llama deepseek-ai/DeepSeek-LLM#9

Closed

github-actions bot added the Stale label Dec 21, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2023

DOGEwbx mentioned this issue Jan 22, 2024

Decoding Issue for Latin Characters in added_tokens #1424

Closed

ArthurZucker reopened this Jan 23, 2024

ArthurZucker added bug Something isn't working and removed Stale labels Jan 23, 2024

github-actions bot added the Stale label Feb 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2024

DOGEwbx mentioned this issue May 8, 2024

tests : add test-tokenizer-0.sh ggerganov/llama.cpp#7036

Merged

ArthurZucker reopened this Jun 11, 2024

github-actions bot removed the Stale label Jun 12, 2024

ArthurZucker mentioned this issue Jun 28, 2024

Add bytelevel normalizer to fix decode when adding tokens to BPE #1555

Merged

ArthurZucker closed this as completed in #1555 Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

DOGEwbx commented Nov 16, 2023

ArthurZucker commented Nov 16, 2023

DOGEwbx commented Nov 17, 2023

ArthurZucker commented Nov 20, 2023

DOGEwbx commented Nov 20, 2023

github-actions bot commented Dec 21, 2023

ArthurZucker commented Jan 3, 2024

github-actions bot commented Feb 23, 2024

ArthurZucker commented Apr 23, 2024

ArthurZucker commented Jun 11, 2024

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392

Comments

DOGEwbx commented Nov 16, 2023

ArthurZucker commented Nov 16, 2023

DOGEwbx commented Nov 17, 2023

ArthurZucker commented Nov 20, 2023

DOGEwbx commented Nov 20, 2023

github-actions bot commented Dec 21, 2023

ArthurZucker commented Jan 3, 2024

github-actions bot commented Feb 23, 2024

ArthurZucker commented Apr 23, 2024

ArthurZucker commented Jun 11, 2024