Add french version "vigogne" #127

bofenghuang · 2023-03-22T22:28:28Z

Hi @tloen 👋,

Thank you for this amazing work!

Here is a French version alpaca-lora called vigogne-lora.

The translated stanford_alpaca data and the git will be shared as soon as possible :)

tloen

Merci !

AngainorDev · 2023-03-24T10:44:34Z

Thanks!

Is there a reason you translated the prompt template as well?

Eagerly waiting for the translated stanford_alpaca data!

bofenghuang · 2023-03-26T22:33:13Z

Hi @AngainorDev ,

IMO it will be more consistent to have the "prompt" and the instruction/input in the same language. But you can also try with the default one!

Just shared the translated dataset and the training script in this repo :)

AngainorDev · 2023-03-27T07:50:08Z

Thanks for the dataset!

I was wondering if there has been studies or tests about that precise prompt.
I saw some just get rid of the prefix sentence, just use the ### VARS.
As a matter of fact, using generate without the prefix sentence works just as well, even on a model trained with it.

But on the other side, I also tried to train llama with simpler templates, and every time, the training time was longer to get to the same quality point. I assumed the llama model itself, before alpaca tuning, already saw that style of prompt so it's better to stick to it, or it has to "override" previous knowledge and this penalize the training.

Also, changing the template means adjusting all related scripts as well as output splitting, that can be an issue with ready made text generation tools and chatbot that don't have that setting exposed.
Btw, thanks for including the sample and prompt on HF so there is no confusion.

As for the dataset, maybe your repo will be a better place to discuss that, but it could benefit a custom cleaning by french speaking people, because artifacts have been introduced by the gpt translation.

Two quick samples:

1: Some Python code - not all of them - were translated (this surfaces in the results)

{
        "instruction": "Créez une boucle for en Python pour imprimer des nombres de 1 à 10.",
        "input": "",
        "output": "pour i dans la plage(1,11):\n    imprimer(i)"
    },

2: Rythmes and similar stuff are broken, by nature

 {
       "instruction": "Générer une liste de mots qui riment avec 'chat'.",
       "input": "",
       "output": "Les mots qui riment avec chat incluent bat, chapeau, tapis, gras, patte, rat, assis, bavardage et fuite."
   },

Thanks again for your work and providing the dataset!

AngainorDev · 2023-03-27T08:06:23Z

Another example.
This is not specific to French, all gpt translated alpaca will face the same issues.

 {
        "instruction": "Edit the given sentence to make it grammatically correct.",
        "input": "I dont has a pen",
        "output": "I don't have a pen."
    },

is translated to

{
       "instruction": "Modifier la phrase donnée pour la rendre grammaticalement correcte.",
       "input": "Je n'ai pas de stylo.",
       "output": "Je n'ai pas de stylo."
   },

This is a big issue.
There are several similar instructions, where the source sentence has on purpose spelling and/or grammatical errors in it, so the model learns correct grammar and how to fix it.
With translated dataset, the input is fixed, and we end up with input == output.

For once, the model loses some NLP knowledge and capabilities, but it also learns to output the same sentence that was given, and this generalizes in more outputs.

Another thing I can see, but that could be a matter of personnal preference, is that prompts are either in passive style ("Expliquer pourquoi...") either active with vous ("Expliquez pourquoi") but very rarely using "Tu" (Explique pourquoi).

This is a chatgpt artifact, that prefers to use "vous" rather than "tu" by default, where as I feel most of the users use either "Tu" either the passive form when asking a bot/LM.
This style change could be automated on the inputs, maybe. But knowing the dataset used that "vous" style could be important to notice so you query it the same way for more precise answers.

What a rabbit hole!.

bofenghuang · 2023-03-27T08:19:04Z

@AngainorDev totally agree with you about the translated dataset cleaning!

A better method might be to have some initial seed tasks in French, followed by generating more tasks in French as done in Stanford Alpaca. But it will cost much more than translating provided dataset :(

For translation, I chose ChatGPT instead of simple MT models (e.g., Deepl, NLLB) to avoid introducing these artifacts, but as you can see, there are still a lot..

AngainorDev · 2023-03-27T09:14:28Z

It's a great basis and gives nice results anyway.
Quite impressive, especially given the translation glitches over the already not perfect source dataset.

That dataset is not that huge btw. A team of several french speaking people could filter or improve it in a reasonable time.
Maybe there exist distributed tools, like there are for translation (poeditor) , that would allow for a collaborative curation of such things.

Update README.md

dd196d1

tloen approved these changes Mar 22, 2023

View reviewed changes

tloen merged commit c7eabb8 into tloen:main Mar 22, 2023

bofenghuang deleted the patch-1 branch March 24, 2023 09:36

AngainorDev mentioned this pull request Mar 27, 2023

About gpt-translated datasets and loss of some NLP capabilities #180

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add french version "vigogne" #127

Add french version "vigogne" #127

bofenghuang commented Mar 22, 2023

tloen left a comment

AngainorDev commented Mar 24, 2023

bofenghuang commented Mar 26, 2023

AngainorDev commented Mar 27, 2023

AngainorDev commented Mar 27, 2023

bofenghuang commented Mar 27, 2023

AngainorDev commented Mar 27, 2023

Add french version "vigogne" #127

Add french version "vigogne" #127

Conversation

bofenghuang commented Mar 22, 2023

tloen left a comment

Choose a reason for hiding this comment

AngainorDev commented Mar 24, 2023

bofenghuang commented Mar 26, 2023

AngainorDev commented Mar 27, 2023

AngainorDev commented Mar 27, 2023

bofenghuang commented Mar 27, 2023

AngainorDev commented Mar 27, 2023