-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add french version "vigogne" #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merci !
Thanks! Is there a reason you translated the prompt template as well? Eagerly waiting for the translated stanford_alpaca data! |
Hi @AngainorDev , IMO it will be more consistent to have the "prompt" and the instruction/input in the same language. But you can also try with the default one! Just shared the translated dataset and the training script in this repo :) |
Thanks for the dataset! I was wondering if there has been studies or tests about that precise prompt. But on the other side, I also tried to train llama with simpler templates, and every time, the training time was longer to get to the same quality point. I assumed the llama model itself, before alpaca tuning, already saw that style of prompt so it's better to stick to it, or it has to "override" previous knowledge and this penalize the training. Also, changing the template means adjusting all related scripts as well as output splitting, that can be an issue with ready made text generation tools and chatbot that don't have that setting exposed. As for the dataset, maybe your repo will be a better place to discuss that, but it could benefit a custom cleaning by french speaking people, because artifacts have been introduced by the gpt translation. Two quick samples: 1: Some Python code - not all of them - were translated (this surfaces in the results)
2: Rythmes and similar stuff are broken, by nature
Thanks again for your work and providing the dataset! |
Another example.
is translated to
This is a big issue. For once, the model loses some NLP knowledge and capabilities, but it also learns to output the same sentence that was given, and this generalizes in more outputs. Another thing I can see, but that could be a matter of personnal preference, is that prompts are either in passive style ("Expliquer pourquoi...") either active with vous ("Expliquez pourquoi") but very rarely using "Tu" (Explique pourquoi). This is a chatgpt artifact, that prefers to use "vous" rather than "tu" by default, where as I feel most of the users use either "Tu" either the passive form when asking a bot/LM. What a rabbit hole!. |
@AngainorDev totally agree with you about the translated dataset cleaning! A better method might be to have some initial seed tasks in French, followed by generating more tasks in French as done in Stanford Alpaca. But it will cost much more than translating provided dataset :( For translation, I chose ChatGPT instead of simple MT models (e.g., Deepl, NLLB) to avoid introducing these artifacts, but as you can see, there are still a lot.. |
It's a great basis and gives nice results anyway. That dataset is not that huge btw. A team of several french speaking people could filter or improve it in a reasonable time. |
Hi @tloen 👋,
Thank you for this amazing work!
Here is a French version alpaca-lora called vigogne-lora.
The translated stanford_alpaca data and the git will be shared as soon as possible :)