Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[text] huggingface tokenizer #2186

Merged
merged 3 commits into from
Dec 1, 2023
Merged

[text] huggingface tokenizer #2186

merged 3 commits into from
Dec 1, 2023

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Nov 30, 2023

For llm, hugface is much more popular

this pr

  • unit test

next pr

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 30, 2023

有个问题,我们需要在requirment里引入transformers吗?

或者条件依赖

pip install wenet[transformers]?

@xingchensong
Copy link
Member

有个问题,我们需要在requirment里引入transformers吗?

如果要做类似qwen audio这种的话,我觉得引入是必须的了,除非自己造轮子

@Mddct Mddct force-pushed the Mddct-tokenizer-huggface branch from 7de63b8 to e3d5c6e Compare November 30, 2023 15:41
@Mddct Mddct force-pushed the Mddct-tokenizer-huggface branch from e3d5c6e to 891f8fd Compare November 30, 2023 15:42
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 30, 2023

有个问题,我们需要在requirment里引入transformers吗?

如果要做类似qwen audio这种的话,我觉得引入是必须的了,除非自己造轮子

可以靠后 再引入transformers, 这里使用了条件引入, test unit 也会尝试去安装
等后时机成熟了(搞audio+llm),再引入,

@Mddct Mddct mentioned this pull request Nov 30, 2023
15 tasks
@Mddct Mddct force-pushed the Mddct-tokenizer-huggface branch from 5d9ee46 to 77a265f Compare December 1, 2023 08:03
@Mddct
Copy link
Collaborator Author

Mddct commented Dec 1, 2023

1 tongyi audio tokenizer can work with HuggingFaceTokenizer

2 这里把token类型修改为 Union[str, byte], 因为llm背景下 ,hg及tiktoken 等都使用bytes 作为token, 比如https://github.com/QwenLM/Qwen-Audio/blob/main/tokenization_qwen.py#L280-L292
Screenshot 2023-12-01 at 15 49 05

unit test : test_tongyi_tokenizer中对应的输出应该为以下:
Screenshot 2023-12-01 at 15 53 51

@xingchensong xingchensong merged commit 711264d into main Dec 1, 2023
6 checks passed
@xingchensong xingchensong deleted the Mddct-tokenizer-huggface branch December 1, 2023 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants