My Experiments with Finetuning Seamless M4T #459

zrthxn · 2024-05-19T20:23:46Z

zrthxn
May 19, 2024

I started working on finetuning Seamless M4T and in this post I will describe how I got on and what the process was like. If you are unaware, finetuning is what LLM circles use to refer to what used to be called transfer learning. The idea is, a pretrained model may not have the best performance on all the tasks it supports, thus to improve performance on a specific task
we train on examples for that task.

SeamlessM4T supports many tasks like Speech-to-Text-Translation (S2TT), Text-to-Text-Translation (T2TT), Transcription (ASR). For the purpose of this, I will talk about the transcription task, also called automatic speech recognition (ASR).

Installation and Setup

The simplest way to get Seamless CLIs installed and running is to run these

git clone [email protected]:facebookresearch/seamless_communication.git
pip install fairseq2 pydub sentencepiece datasets=="2.18.0" whisper_normalizer
pip install -e ./seamless_communication

Preparing a Dataset for Finetuning

The first thing to do is prepare a dataset on which we will run our experiements. There are two ways to do this:

Use a dataset that Seamless supports with the m4t_prepare_dataset CLI.
Build your own from one of the datasets on Huggingface.

Use Supported Datasets

Currently the m4t_prepare_dataset CLI supports two datasets,

Google Fleurs, short clean audios in many languages: google/fleurs
Speechcolab GigaSpeech, mixed audios in English speechcolab/gigaspeech

Run this to download Google Fleurs into a folder.

mkdir -p fleurs/english
m4t_prepare_dataset \
    --name google/fleurs \
    --split test \
    --source_lang eng \
    --target_lang eng \
    --save_dir fleurs/english

To download Gigaspeech, you can change the argument --name google/fleurs to --name speechcolab/gigaspeech. However you will need to go to the Gigaspeech page and fill-in the access form, and get a Huggingface token and supply that as an argument.

mkdir -p gigaspeech/s
m4t_prepare_dataset \
    --name speechcolab/gigaspeech \
    --split s \
    --source_lang eng \
    --target_lang eng \
    --save_dir gigaspeech/s \
    --huggingface_token <TOKEN>

After running these you will notice that the CLI generates a <...>_manifest.json file. This is the file that the evaluation and finetune CLI can digest.

Build your Own

To use your own dataset you need to write some code to generate the manifest.json file. This is just a long file containing locations of the audio files and some metadata.

Here I have shown an example for Gigaspeech. To use this for another dataset, you need to know the columns and build the JSON object accordingly.

import os
import json
from datasets import load_dataset

name = "speechcolab/gigaspeech"
subset = "xl"
save_to = "gigaspeech-xl"

ds = load_dataset(name, subset, cache_dir=save_to)

for split in ds:
    manifest_path = os.path.join(save_directory, f"{subset}_{split}_manifest.json")
    print(f"Preparing {split} split...")
    
    with open(manifest_path, "w") as f:
        for sample in tqdm(ds[split]):
            f.write(json.dumps({
            "source": {
                "id": sample["segment_id"],
                "text": sample["text"],
                "lang":"eng",
                "audio_local_path": sample["audio"]["path"],
                "sampling_rate": sample["audio"]["sampling_rate"],
            },
            "target": {
                "id": sample["segment_id"],
                "text": sample["text"],
                "lang": "eng",
            }
            }) + "\n")
    print(f"Manifest saved to: {manifest_path}")

Getting Baseline Performance Figures

The next thing we want to do is get some performance metrics for the task we want to finetune on, so that we have some numbers to compare results with. We need to simply run the model and see how well it performs out of the box.

See Notebook: Baseline Performance Evaluation

Continuing from the Google Fluers example, we will now start training a model for ASR. We will use the seamlessM4T_medium model from now because this is not too big to manage. The same things will apply to seamlessM4T_small and seamlessM4T_v2_large.

m4t_evaluate \
  --model_name seamlessM4T_medium \
  --task ASR \
  --tgt_lang eng \
  --data_file fleurs/eng/test_manifest.json \
  --output_path fleurs/eng

The output shows the Word Error Rate (WER) which is the standard metric for evaluating ASR.

Processed 647 samples
ASR : {
 "name": "WER",
 "score": 0.24226839947181875,
 "signature": "wer is 0.24226839947181875"
}

I ran the same evaluation on 6 languages and the results are shown here:

Language	WER
Español	0.12018
हिन्दी	0.15618
Francias	0.16409
English	0.24226
Deutsch	0.31006
Русский	0.31538

You can find details in my notebook.

Running a Training Session

We are now ready to run our first round of training. For this we need to have already downloaded a dataset of our choice and we will use the manifest.json file of its train and validation splits.

There are various modes we can run our finetune with:

Training the Full Model
Training with Frozen Layers
Training with Low-Rank Adaptation (LoRA) (not implemented yet)

See Notebook: Finetuning on English

See Notebook: Finetuning with Frozen Layers

Training the Full Model

Run the m4t_finetune CLI to start training a model. We use a small batch size and a very small learning rate, and we use a patience of 10 evaluations which means that if the loss does not improve after 10 evaluations we stop training (early stopping).

m4t_finetune \
  --train_dataset gigaspeech/s/s_train_manifest.json \
  --eval_dataset gigaspeech/s/s_validation_manifest.json \
  --batch_size 10 \
  --eval_steps 1000 \
  --learning_rate 0.00005 \
  --patience 10 \
  --save_model_to checkpoints/ft_gs_m4tM.pt \
  --model_name seamlessM4T_medium

This method trains the whole model and so, we need to make sure we don't use a very high learning rate as that can lead to the model forgetting its pre-training and only focussing on the finetune examples.

Training with Frozen Layers

Alternatively you can choose to train some parts of the model which you know might improve performance. Usually the later layers are trained for this purpose as it is assumed that these will have learned the task specific details that we want to finetune for.

Run the m4t_finetune CLI with the --freeze_layers option and the layer names to freeze.

m4t_finetune \
  --train_dataset gigaspeech/s/s_train_manifest.json \
  --eval_dataset gigaspeech/s/s_validation_manifest.json \
  --batch_size 10 \
  --eval_steps 1000 \
  --learning_rate 0.00008 \
  --patience 10 \
  --save_model_to checkpoints/ft_gs_m4tM.pt \
  --model_name seamlessM4T_medium \
  --freeze_layers \
      model.speech_encoder \
      model.speech_encoder_frontend \
      model.text_encoder_frontend \
      model.text_decoder.layers.0 \
      model.text_decoder.layers.1 \
      model.text_decoder.layers.2

Naturally you may not know the layer names for all the layers in a M4T model. For this all you need to do is use model.named_parameters() to get the names in a list like so...

for name, in model.named_modules():
    print(name)

This will give you a pretty long list with all the nested modules as well but you can use the name of a parent to freeze everything inside it. For example --freeze_layers model.text_decoder freezes all model.text_decoder.layers[0...6].

You can find names of layers in seamlessM4T_medium in my notebook.

Evaluating the Trained Model

Now after getting a trained model checkpoint, we can evaluate the performance again to see how much better this new model is compared to the stock model.

See Notebook: Trained Model Evaluation

See Notebook: Finetuning with Frozen Layers

Run the m4t_evaluate CLI with the --load_checkpoint option to load the checkpoint.

m4t_evaluate \
  --model_name seamlessM4T_v2_large \
  --load_checkpoint checkpoints/ft_eng_m4tv2.pt \
  --task ASR \
  --tgt_lang eng \
  --data_file fleurs/test/test_manifest.json \
  --output_path fleurs/eng

However, this takes quite a while because it evaluates over the whole test dataset. Alternatively you can also run a mini-evaluation, which you will find in my Finetuning with Frozen Layers notebook.. I recommend using the mini-evaluation, because it makes iterations faster.

Here is an example evaluation output from the evaluation.

Mini Evaluation: 300it [01:45,  2.84it/s]
WER non-tuned: 0.3308
Mini Evaluation: 300it [03:16,  1.53it/s]
WER tuned: 0.2698

Results

Here I thought I will also mention the results I got from different modes of finetuning.

Training Mode	WER
Un-Tuned (No-Finetuning)	0.3308
Full-Model training	0.2598
Frozen speech_encoder and speech_encoder_frontend	0.2878
Frozen speech_encoder and first half of layers of text_decoder	0.3639
Unfrozen last 3 layers of decoder	0.9190
Unfrozen decoder and few last layers of encoder	0.2711
Unfreeze last 6 layers of decoder	0.3656

Parameters
batch_size	10
eval_steps	1000
learning_rate	0.00008
patience	10

Though I know that I could have gotten a lower error rate with full tuning, the interesting fact is that I get almost the same score with partial tuning. I was able to get almost identical tuning results with training only around 40% of the model weights.

Notes and Observations

Further Work - LoRA Finetuning

Low-Rank Adaptation (LoRA) is a method of fine-tuning that can use even less memory than freezing layers. This is yet to be implemented in the finetune CLI, although the basic components already exist in FairSeq2. The implementation would just be adding another option to the CLI to choose LoRA finetuning and set parameters.

Training Hardware

All of this was run on a A100 GPU provided by Google Colab Pro+, with 40GB of GPU RAM and 85GB of system RAM. The seamlessM4T_v2_large model needs more memory than this, so I based all this on the medium model. However it is possible that the large model may be run with frozen layers.

Notebooks

MattLiutt · 2024-06-20T08:58:34Z

MattLiutt
Jun 20, 2024

Thanks for the great work! I was following your scripts, but I ran on multiple GPUs (4) setting with 16GB VRAM. I came across with issues like your notebook on English, do you have any ideas on this? It would be great if you can share a bit more on this! Thanks!

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.34 GiB. GPU 0 has a total capacity of 14.75 GiB of which 3.09 GiB is free. Process 241510 has 11.66 GiB memory in use. Of the allocated memory 8.14 GiB is allocated by PyTorch, and 3.38 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/m4t_finetune", line 8, in <module>
    sys.exit(main())
  File "/content/seamless_communication/src/seamless_communication/cli/m4t/finetune/finetune.py", line 212, in main
    finetune.run(stop_at=args.max_batches)
  File "/content/seamless_communication/src/seamless_communication/cli/m4t/finetune/trainer.py", line 401, in run
    self._eval_model()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/seamless_communication/src/seamless_communication/cli/m4t/finetune/trainer.py", line 341, in _eval_model
    loss = self.calc_loss(batch, *self.model(batch))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/seamless_communication/src/seamless_communication/cli/m4t/finetune/trainer.py", line 104, in forward
    with torch.no_grad() if self.freeze_s2t else dummy_context:  # type:ignore
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
AttributeError: 'list_iterator' object has no attribute 'throw'

6 replies

MattLiutt Jun 24, 2024

thanks for the prompt reply. I set batch size to 1, and I used torchrun which further increases VRAM usage even I set --nproc-per-node=1 (assuming to be the same as m4t_finetune only). I might raise issue if you didn't use torchrun, thanks!

RRThivyan Sep 4, 2024

Hi, how can you get out of cuda out of memory issue? did you encounter it?

MattLiutt Oct 28, 2024

sorry for missing the reply, I just upgrade my GPU to A100 with 80GB VRAM.

RRThivyan Oct 28, 2024

Have you faced any issue with weights after finetuning? how did you infer with the finetuned model?

MattLiutt Oct 29, 2024

As I tried to finetuned based on low-resource language, the weight seems didn't change much

amirmfarzane · 2024-07-06T13:51:09Z

amirmfarzane
Jul 6, 2024

When i use m4t_evaluate to use checkpoints from fine-tuning part,
I get this

RuntimeError: Error(s) in loading state_dict for Wav2Vec2Frontend:
Missing key(s) in state_dict: "post_extract_layer_norm.weight", "post_extract_layer_norm.bias", "model_dim_proj.weight", "model_dim_proj.bias".

6 replies

amirmfarzane Jul 8, 2024

Thanks.
I used the new load_checkpoints function during the fine-tuning process, but I didn't get the previous best loss during evaluation. I mean the best loss of the checkpoints in the new process is different from the last update of the previous fine-tuning.
Is Ok?!
I think some of the layers were fine-tuned but weren't added to the checkpoints.

zrthxn Jul 12, 2024
Author

Well your training loss would naturally be lower than your evaluation loss, a not-too-large difference is expected depending on how long you fine-tuned for and what dataset you used. But if the difference is huge then I might suspect that something is wrong.

amirmfarzane Jul 28, 2024

Hey, unfortunately you cannot yet use m4t_evaluate with a finetuned checkpoint. This is because the checkpoint that is saved from the finetuning CLI is not the whole model, it only contains some of the trained layers. To evaluate a trained checkpoint, you can use the "Mini Evaluation" block from my notebook

Hi again :)

How should I save the whole model? I tried to change the save_model function in the trainer.py, but it didn't work.

Thanks in advance.

zrthxn Jul 29, 2024
Author

I think you should be able to save the whole model with torch.save(model.state_dict(), "path") after finetune.run() in finetune.py. But I'm not completely sure because that's what the Trainer._save_model function does as well. Try this and see if the file size is roughly equal to that of the model published on Huggingface

amirmfarzane Aug 1, 2024

Didn't work :(
It seems that save_model only saves these checkpoints, even after finetune.run()
Size of saved checkpoint is 4.1G but whole model checkpoints is more than 6G.

amirmfarzane · 2024-07-28T02:10:36Z

amirmfarzane
Jul 28, 2024

Hi again!

I have another question. How is it possible to fine-tune using torch run with a fixed batch size?

When I fine-tune on a single GPU, I can use a batch size of 10. However, when I use two GPUs, I can't use a batch size of 10.

Thanks in advance.

2 replies

zrthxn Jul 29, 2024
Author

I'm not completely sure what you mean. Torch should automatically divide the load between multiple GPUs so you can increase the batch size on multiple GPUs.
https://discuss.pytorch.org/t/run-pytorch-on-multiple-gpus/20932/3
https://discuss.pytorch.org/t/a-question-concerning-batchsize-and-multiple-gpus-in-pytorch/33767/9

amirmfarzane Aug 1, 2024

I ran multi-GPU training but used DataParallel instead of DistributedDataParallel in trainer.py. It was interesting to see by this significant change worked, but I checked the results, and everything seems to be okay.
What do you think ؟

RRThivyan · 2024-09-02T06:24:34Z

RRThivyan
Sep 2, 2024

Hi, i am working on TTS for indic languages, since some doesn't have support for speech. But while finetuning, I am getting the following error. I don't know what is happening since I am following the documentation.

2 replies

zrthxn Sep 10, 2024
Author

are you using the latest version here? maybe try uninstalling your seamless_communication, clone this repository somewhere and then do a local install with pip install -e ./path/to/seamless_communication

RRThivyan Sep 10, 2024

Hi, I am using the latest version only. I have followed the steps you mentioned. but in evaluation, the shape of the finetuned model differs from the original model. it throws the following error.
How can we overcome this.

I am new to this field. so any help would be greatly appreciated.

hizening · 2024-10-25T12:38:28Z

hizening
Oct 25, 2024

Hello, I'm fine-tuning seamlessstreaming and would like to ask if you're working on something similar to share?

0 replies

RazhanHameed · 2025-01-26T23:14:37Z

RazhanHameed
Jan 26, 2025

Any luck on fine-tuning on a new language?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My Experiments with Finetuning Seamless M4T #459

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

My Experiments with Finetuning Seamless M4T #459

Installation and Setup

Contents

Preparing a Dataset for Finetuning

Use Supported Datasets

Build your Own

Getting Baseline Performance Figures

Running a Training Session

Training the Full Model

Training with Frozen Layers

Evaluating the Trained Model

Results

Notes and Observations

Further Work - LoRA Finetuning

Training Hardware

Notebooks

Replies: 6 comments · 16 replies

zrthxn Jul 12, 2024 Author

zrthxn Jul 29, 2024 Author

zrthxn Jul 29, 2024 Author

zrthxn Sep 10, 2024 Author

Replies: 6 comments 16 replies

zrthxn Jul 12, 2024
Author

zrthxn Jul 29, 2024
Author

zrthxn Jul 29, 2024
Author

zrthxn Sep 10, 2024
Author