Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add french version "vigogne" #127

Merged
merged 1 commit into from
Mar 22, 2023
Merged

Add french version "vigogne" #127

merged 1 commit into from
Mar 22, 2023

Conversation

bofenghuang
Copy link
Contributor

Hi @tloen 👋,

Thank you for this amazing work!

Here is a French version alpaca-lora called vigogne-lora.

The translated stanford_alpaca data and the git will be shared as soon as possible :)

Copy link
Owner

@tloen tloen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merci !

@tloen tloen merged commit c7eabb8 into tloen:main Mar 22, 2023
@bofenghuang bofenghuang deleted the patch-1 branch March 24, 2023 09:36
@AngainorDev
Copy link
Contributor

Thanks!

Is there a reason you translated the prompt template as well?

Eagerly waiting for the translated stanford_alpaca data!

@bofenghuang
Copy link
Contributor Author

Hi @AngainorDev ,

IMO it will be more consistent to have the "prompt" and the instruction/input in the same language. But you can also try with the default one!

Just shared the translated dataset and the training script in this repo :)

@AngainorDev
Copy link
Contributor

Thanks for the dataset!

I was wondering if there has been studies or tests about that precise prompt.
I saw some just get rid of the prefix sentence, just use the ### VARS.
As a matter of fact, using generate without the prefix sentence works just as well, even on a model trained with it.

But on the other side, I also tried to train llama with simpler templates, and every time, the training time was longer to get to the same quality point. I assumed the llama model itself, before alpaca tuning, already saw that style of prompt so it's better to stick to it, or it has to "override" previous knowledge and this penalize the training.

Also, changing the template means adjusting all related scripts as well as output splitting, that can be an issue with ready made text generation tools and chatbot that don't have that setting exposed.
Btw, thanks for including the sample and prompt on HF so there is no confusion.

As for the dataset, maybe your repo will be a better place to discuss that, but it could benefit a custom cleaning by french speaking people, because artifacts have been introduced by the gpt translation.

Two quick samples:

1: Some Python code - not all of them - were translated (this surfaces in the results)

{
        "instruction": "Créez une boucle for en Python pour imprimer des nombres de 1 à 10.",
        "input": "",
        "output": "pour i dans la plage(1,11):\n    imprimer(i)"
    },

2: Rythmes and similar stuff are broken, by nature

 {
       "instruction": "Générer une liste de mots qui riment avec 'chat'.",
       "input": "",
       "output": "Les mots qui riment avec chat incluent bat, chapeau, tapis, gras, patte, rat, assis, bavardage et fuite."
   },

Thanks again for your work and providing the dataset!

@AngainorDev
Copy link
Contributor

Another example.
This is not specific to French, all gpt translated alpaca will face the same issues.

 {
        "instruction": "Edit the given sentence to make it grammatically correct.",
        "input": "I dont has a pen",
        "output": "I don't have a pen."
    },

is translated to

{
       "instruction": "Modifier la phrase donnée pour la rendre grammaticalement correcte.",
       "input": "Je n'ai pas de stylo.",
       "output": "Je n'ai pas de stylo."
   },

This is a big issue.
There are several similar instructions, where the source sentence has on purpose spelling and/or grammatical errors in it, so the model learns correct grammar and how to fix it.
With translated dataset, the input is fixed, and we end up with input == output.

For once, the model loses some NLP knowledge and capabilities, but it also learns to output the same sentence that was given, and this generalizes in more outputs.

Another thing I can see, but that could be a matter of personnal preference, is that prompts are either in passive style ("Expliquer pourquoi...") either active with vous ("Expliquez pourquoi") but very rarely using "Tu" (Explique pourquoi).

This is a chatgpt artifact, that prefers to use "vous" rather than "tu" by default, where as I feel most of the users use either "Tu" either the passive form when asking a bot/LM.
This style change could be automated on the inputs, maybe. But knowing the dataset used that "vous" style could be important to notice so you query it the same way for more precise answers.

What a rabbit hole!.

@bofenghuang
Copy link
Contributor Author

@AngainorDev totally agree with you about the translated dataset cleaning!

A better method might be to have some initial seed tasks in French, followed by generating more tasks in French as done in Stanford Alpaca. But it will cost much more than translating provided dataset :(

For translation, I chose ChatGPT instead of simple MT models (e.g., Deepl, NLLB) to avoid introducing these artifacts, but as you can see, there are still a lot..

@AngainorDev
Copy link
Contributor

It's a great basis and gives nice results anyway.
Quite impressive, especially given the translation glitches over the already not perfect source dataset.

That dataset is not that huge btw. A team of several french speaking people could filter or improve it in a reasonable time.
Maybe there exist distributed tools, like there are for translation (poeditor) , that would allow for a collaborative curation of such things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants