You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be cool if there was an official finetuning script.
I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀
I then finetuned unsloth/Qwen2.5-Coder-7B on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.
However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.
Some uncertainties I've had:
Which data format for training should be used? Currently, I am finetuning with FIM template (f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards?
I have skipped the global context/extra chunks entirely for now. I skimmed over the technical description, but was unsure how to best sample that data. Not sure how important it is, since the model has already seen the context during finetuning, but maybe it helps?
How should the prefix/suffix/middle parts be sampled? For now, I chose them like this:
Choose a random file
Choose a random character index within that file
Choose the rest of the line (or the next line if next character is EOL) as the middle to be predicted, or up to 256 characters if the line is longer.
Completing only the current/next line is sufficient for me. I'd rather get lines one by one and press TAB if I am happy with the predicted line. But maybe other people have different tastes. This could probably be controlled in the extension instead of being backed into the model. I wanted to have at least one line all the time, even if the cursor is at the end of a line.
Choose a prefix of random length starting directly before the middle with up to 2048 characters (is that a good length?)
Skip a random number of characters (up to 512, number is made up) from the end of the middle until the suffix starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines.
Start the suffix of random length after the random offset behind the middle. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.
I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.
How long to train and on which context size? I've trained the 7B model for about 10 hours on 60k samples on a V100 over night, which seemed to work okay, but maybe more or less training is better.
Should the training data be filtered by some advanced criteria? I keep most of my code in a large repository, about 2000 files with a total size of around 11MB. I excluded unsloth files, some autogenerated files and very tiny files, but did no filtering otherwise.
Which rank to choose for the LoRA? I chose 64 because it seemed like a nice number.
Should lora_alpha be adjusted?
How to make packing of SFTTrainer work? It only seemed to make training much slower.
Most of those questions could be answered with a validation dataset, but I have not checked whether there is one and I do not have the compute to check all possible variations anyway.
@99991 Thank you for sharing your script and your thoughts! I like the idea of making it easier for the users of llama.vscode to finetune (using lora) the model with their codebase. Will do some research.
It would be cool if there was an official finetuning script.
I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀
I then finetuned
unsloth/Qwen2.5-Coder-7B
on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.
Some uncertainties I've had:
f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token
), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards?middle
to be predicted, or up to 256 characters if the line is longer.prefix
of random length starting directly before themiddle
with up to 2048 characters (is that a good length?)middle
until thesuffix
starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines.suffix
of random length after the random offset behind themiddle
. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.The entire training sample then looks like this:
I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.
lora_alpha
be adjusted?packing
ofSFTTrainer
work? It only seemed to make training much slower.I'll attach my training code here as a starting point, but it requires some modifications to be usable. Probably still better than nothing. It is mostly copied from https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ
dev_dataset.py
train.py
The text was updated successfully, but these errors were encountered: