-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于 tokenizer.json 词汇表中没有中文的问题 #111
Comments
ok感谢认可哈
你可以测试以下脚本: from transformers import AutoTokenizer
# 加载分词器
tokenizer = AutoTokenizer.from_pretrained("./model/minimind_tokenizer")
# 测试文本
text = "这是一个测试,看看分词器是否能正确处理中文。"
# 编码:将文本转换为 token IDs
encoded_ids = tokenizer.encode(text)
print("编码结果 (Token IDs):", encoded_ids)
# 解码:将 token IDs 转换回文本
decoded_text = tokenizer.decode(encoded_ids)
print("解码结果 (文本):", decoded_text)
decoded_text = tokenizer.decode([434])
print("解码结果 (文本):", decoded_text)
decoded_text = tokenizer.decode([1589])
print("解码结果 (文本):", decoded_text)
# 如果需要查看具体的 tokens(子词单元)
tokens = tokenizer.tokenize(text)
print("分词结果 (Tokens):", tokens) 结果是:
意味着
它们实际上都是有意义的中文token 进一步的用以下脚本,可以测试vocab-token数量占比: from transformers import AutoTokenizer
import re
tokenizer = AutoTokenizer.from_pretrained("./model/minimind_tokenizer")
vocab = tokenizer.get_vocab()
total_tokens = len(vocab)
chinese_tokens = 0
english_tokens = 0
non_chinese_or_english_tokens = 0
def is_chinese(text):
return all('\u4e00' <= char <= '\u9fff' for char in text)
def is_english(text):
return all(re.match(r"[a-zA-Z]", char) for char in text)
for token_id in range(total_tokens):
token = tokenizer.decode([token_id])
if is_chinese(token):
chinese_tokens += 1
elif is_english(token):
english_tokens += 1
else:
non_chinese_or_english_tokens += 1
chinese_ratio = chinese_tokens / total_tokens * 100
english_ratio = english_tokens / total_tokens * 100
other_ratio = non_chinese_or_english_tokens / total_tokens * 100
print(f"词表总大小: {total_tokens}")
print(f"中文 token 数量: {chinese_tokens} 占比: {chinese_ratio:.2f}%")
print(f"英文 token 数量: {english_tokens} 占比: {english_ratio:.2f}%")
print(f"其他 token 数量: {non_chinese_or_english_tokens} 占比: {other_ratio:.2f}%") 结果是:
作为比较,Qwen2.5的结果是:
|
Closed
谢谢作者的解惑~ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
您好,非常感谢作者做了这样精彩的模型,让我学习和理解大模型有了更好的切入点。我在玩这个模型的时候遇到不太理解的地方。train_tokenizer.py 训练出来的 vocab.json中的词汇我没有找到汉语的,但是 tokenizer_train.jsonl 这个里面其实是汉语,不知道为什么训练完了之后会这样
The text was updated successfully, but these errors were encountered: