English support #70

deweihu96 · 2024-10-17T10:45:55Z

Dear minimind's contributors,

I love this repo! Would you have an English training dataset and tokenizer in the future?

It would be very nice if the repo were more international!

jingyaogong · 2024-10-18T06:15:52Z

There are currently no plans to retrain on an English dataset.
However, the only difference between Chinese and English is the dataset. This issue
can be addressed by replacing it with an appropriate English dataset for training (this will require some exploration to find a high-quality dataset).

The tokenizer does not need to be replaced and can be reused as it currently has sufficient capabilities for both Chinese and English.

If I complete this, I will update it under this issue.
It may be done this month, or possibly within 2024, to achieve English tasks.
Thank you for your attention. Wishing you all the best.

cpp2016 · 2024-10-25T02:31:50Z

i think english dataset is important to model's english-chinese translation ability, and to chinese corpus with scattered english words.

jingyaogong closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English support #70

English support #70

deweihu96 commented Oct 17, 2024

jingyaogong commented Oct 18, 2024

cpp2016 commented Oct 25, 2024

English support #70

English support #70

Comments

deweihu96 commented Oct 17, 2024

jingyaogong commented Oct 18, 2024

cpp2016 commented Oct 25, 2024