Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Finetuning script for Qwen2.5-Coder FIM #40

Open
99991 opened this issue Mar 2, 2025 · 1 comment
Open

[Feature request] Finetuning script for Qwen2.5-Coder FIM #40

99991 opened this issue Mar 2, 2025 · 1 comment

Comments

@99991
Copy link

99991 commented Mar 2, 2025

It would be cool if there was an official finetuning script.

I have tried Qwen2.5-Coder of various sizes, but only the 32B model was barely usable quality-wise. The latency with an RTX 3090 was amazing with all models. 🚀

I then finetuned unsloth/Qwen2.5-Coder-7B on my own code and the resulting model was good enough for the code I usually write. If I did not have a free Copilot student subscription, I'd use this model from now on. The biggest advantage is that the context became much less important since most of it resides in the model now.

However, my finetuning script can probably use tons of improvement, so I wanted to suggest an official finetuning script from someone who has more experience with finetuning or llama.vscode.

Some uncertainties I've had:

  • Which data format for training should be used? Currently, I am finetuning with FIM template (f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}<|endoftext|>" + eos_token), but maybe I should finetune on raw code instead and then learn the FIM template format afterwards?
  • I have skipped the global context/extra chunks entirely for now. I skimmed over the technical description, but was unsure how to best sample that data. Not sure how important it is, since the model has already seen the context during finetuning, but maybe it helps?
  • How should the prefix/suffix/middle parts be sampled? For now, I chose them like this:
    1. Choose a random file
    2. Choose a random character index within that file
    3. Choose the rest of the line (or the next line if next character is EOL) as the middle to be predicted, or up to 256 characters if the line is longer.
    • Completing only the current/next line is sufficient for me. I'd rather get lines one by one and press TAB if I am happy with the predicted line. But maybe other people have different tastes. This could probably be controlled in the extension instead of being backed into the model. I wanted to have at least one line all the time, even if the cursor is at the end of a line.
    1. Choose a prefix of random length starting directly before the middle with up to 2048 characters (is that a good length?)
    2. Skip a random number of characters (up to 512, number is made up) from the end of the middle until the suffix starts. The idea is that the model should not be forced to complete a function in just one line. It is fine to do it in multiple lines.
    3. Start the suffix of random length after the random offset behind the middle. I choose up to 1024 characters because the suffix is probably less important than the prefix, but again, the number is entirely made up.

The entire training sample then looks like this:

[prefix][middle][offset][suffix]

len(prefix) < 2048
len(middle) <= 257
offset < 512
len(suffix) < 1024

I have not done any overly scientific ablation studies to validate any choices I've made, except for a small test with Qwen2.5-Coder-0.5B, which was not great when finetuned. It was able to recite samples from the training data, but could not generalize them extremely well.

  • How long to train and on which context size? I've trained the 7B model for about 10 hours on 60k samples on a V100 over night, which seemed to work okay, but maybe more or less training is better.
  • Should the training data be filtered by some advanced criteria? I keep most of my code in a large repository, about 2000 files with a total size of around 11MB. I excluded unsloth files, some autogenerated files and very tiny files, but did no filtering otherwise.
  • Which rank to choose for the LoRA? I chose 64 because it seemed like a nice number.
  • Should lora_alpha be adjusted?
  • How to make packing of SFTTrainer work? It only seemed to make training much slower.
  • Most of those questions could be answered with a validation dataset, but I have not checked whether there is one and I do not have the compute to check all possible variations anyway.

I'll attach my training code here as a starting point, but it requires some modifications to be usable. Probably still better than nothing. It is mostly copied from https://colab.research.google.com/drive/1Kose-ucXO1IBaZq5BvbwWieuubP7hxvQ

dev_dataset.py

from pathlib import Path
import random

# adjust this to the directory with the Python files you want to train on
train_dir = "../.."

texts = []
for path in Path(train_dir).rglob("*.py"):
    if "unsloth" in str(path): continue
    if path.name.startswith("__"): continue

    text = path.read_text()

    # Small files are probably not interesting
    if len(text) < 100: continue

    texts.append(text)

def apply_fim_template(prefix, suffix, middle=None, eos_token=None):
    text = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"

    if middle is not None:
        text += f"{middle}<|endoftext|>" + eos_token

    return text

def yield_samples(n, eos_token, debug=False):
    rng = random.Random(0)

    for _ in range(n):
        text = rng.choice(texts)

        start_middle = rng.randrange(len(text) - 1)

        # Complete current line
        end_middle = start_middle + 1
        for _ in range(256):
            if end_middle >= len(text) or text[end_middle] == "\n": break
            end_middle += 1

        # make prefix start up to 2048 characters before middle
        start_prefix = max(0, start_middle - rng.randrange(2048))

        # make suffix start way after middle
        suffix_offset = rng.randrange(512)

        start_suffix = min(end_middle + suffix_offset, len(text))
        end_suffix = min(start_suffix + rng.randrange(1024), len(text))

        prefix = text[start_prefix:start_middle]
        middle = text[start_middle:end_middle]
        suffix = text[start_suffix:end_suffix]

        text = apply_fim_template(prefix, suffix, middle, eos_token)

        if debug:
            print("#" * 80)
            print(red(prefix) + green(middle) + yellow(suffix))
            print()

        yield {"text": text}

def red(text):
    return f"\x1b[31m{text}\x1b[0m"

def green(text):
    return f"\x1b[32m{text}\x1b[0m"

def yellow(text):
    return f"\x1b[33m{text}\x1b[0m"

if __name__ == "__main__":
    for _ in yield_samples(5, eos_token="<TODO eos token>", debug=True):
        pass
    print("files:", len(texts))
    print(sum(len(t) for t in texts))

train.py

from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import dev_dataset
from dev_dataset import red, green, yellow

def gen():
    if 1:
        yield from dev_dataset.yield_samples(n=60_000, eos_token=tokenizer.eos_token)
    else:
        # for quick testing, should run this first to ensure that saving works
        # (llama.cpp failed to compile automatically on some of my computers)
        yield from dev_dataset.yield_samples(n=10, eos_token=tokenizer.eos_token)

dataset = Dataset.from_generator(gen)

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    # More models: https://huggingface.co/unsloth
    model_name = "unsloth/Qwen2.5-Coder-7B",
    max_seq_length = max_seq_length,
    #dtype = torch.bfloat16, # bfloat16 not supported by V100
    dtype = torch.float16,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,

        # choose either
        num_train_epochs = 1, # Set this for 1 full training run.
        #max_steps = 60,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

trainer_stats = trainer.train()

FastLanguageModel.for_inference(model) # Enable inference

# Test on some random strings which occur in my code to see whether the model can correctly predict it
for prefix, suffix in [
    ['os.path.expanduser("../../../data/rh', "def"],
    ['print(f"{gb', "def"],
    ['extern "C"\n__glob', "}"],
    ['from cup', "def"],
]:
    input_text = dev_dataset.apply_fim_template(prefix, suffix)

    inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens=64)

    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    new_text = output_text[len(input_text):]

    print(red(prefix) + green(new_text) + yellow(suffix))
    print()

model.save_pretrained_gguf("model_quantized", tokenizer, quantization_method = "q4_k_s")
@igardev
Copy link
Collaborator

igardev commented Mar 6, 2025

@99991 Thank you for sharing your script and your thoughts! I like the idea of making it easier for the users of llama.vscode to finetune (using lora) the model with their codebase. Will do some research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants