whisper v3微调过程中出现乱码的情况 #13

Kevinarcsin001 · 2024-12-21T07:46:53Z

我这边使用2卡的4090，实验数据是aishell1 带标点的数据。
环境如下：
Package Version Editable project location

accelerate 0.28.0
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.7.0
async-timeout 5.0.1
attrs 24.2.0
audioread 3.0.1
av 14.0.1
bitsandbytes 0.41.3
Brotli 1.0.9
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.3.2
click 8.1.7
coloredlogs 15.0.1
ctranslate2 4.5.0
dataclasses 0.6
datasets 3.2.0
decorator 5.1.1
dill 0.3.8
evaluate 0.4.3
exceptiongroup 1.2.2
fastapi 0.115.6
faster-whisper 1.1.0
filelock 3.13.1
flatbuffers 24.3.25
frozenlist 1.5.0
fsspec 2024.9.0
gmpy2 2.1.2
h11 0.14.0
huggingface-hub 0.26.5
humanfriendly 10.0
idna 3.10
Jinja2 3.1.4
jiwer 3.0.5
joblib 1.4.2
lazy_loader 0.4
librosa 0.10.2.post1
llvmlite 0.43.0
MarkupSafe 3.0.2
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.2.1
numba 0.60.0
numpy 2.0.1
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
onnxruntime 1.16.3
packaging 24.2
pandas 2.2.3
peft 0.7.0 # 我自己的路径，使用源码安装的
pillow 11.0.0
pip 24.2
platformdirs 4.3.6
pooch 1.8.2
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
pyarrow 18.1.0
pycparser 2.22
pydantic 2.10.3
pydantic_core 2.27.1
pydub 0.25.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
RapidFuzz 3.10.1
regex 2024.11.6
requests 2.32.3
safetensors 0.4.5
scikit-learn 1.5.2
scipy 1.13.1
setuptools 75.1.0
six 1.17.0
sniffio 1.3.1
SoundCard 0.4.3
soundfile 0.12.1
soxr 0.5.0.post1
starlette 0.41.3
sympy 1.13.1
tensorboardX 2.6.2.2
threadpoolctl 3.5.0
tokenizers 0.21.0
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tqdm 4.67.1
transformers 4.47.0
triton 3.1.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.1
wheel 0.44.0
xxhash 3.5.0
yarl 1.18.3
zhconv 1.4.3

实验结果如下

请问您这边可以提供一些建议吗？麻烦啦

gody7334 · 2024-12-28T02:13:15Z

I have same issue
Reckon its due to the discrepancy between the training (testing) data and the real data
And the model overfit to the training data,

You need to start to think what kind of data (or augmentation) will better represent your real data?

shuaijiang · 2025-01-02T10:06:39Z

check the format of training data(text) is utf-8?

Kevinarcsin001 · 2025-01-14T08:31:05Z

I have checked and the data is fine, it is in UTF-8 encoding format. I rented a GPU , A800 with 80g memory. I tryed to finetune all parameters use the scripts 'finetune_all.py', but there were also garbled characters. May I ask how you prepare data on your end? Can we share the training parameter environment information?

check the format of training data(text) is utf-8?

gody7334 · 2025-01-14T08:38:04Z

Use this project, it has active maintenance: https://github.com/yeyupiaoling/Whisper-Finetune
And try to finetune with wenet-speech dataset,
the ai-shell data is too short, which can break the original model,
as model will overfit to short audio, rather than full 30 sec audio

I have successfully train on wenet-speech dataset without having the garble

Kevinarcsin001 · 2025-01-20T07:43:50Z

Thank you for your guidance. I used wenet-speech dataset to finetune , with a learning rate of 1e-5 and 200 hours of data, the audios length are between 20 seconds and 30 seconds. But still garbled. Is it a problem with my environment here? Can you give me some training advice? What is used https://github.com/yeyupiaoling/Whisper-Finetune fine- tuning scripts in the project.

gody7334 · 2025-01-20T08:55:35Z

try to out put the file and listen what kind of data you input to the model

in reader.py, around augmentation output the audio and label
2
2 soundfile.write(f'./auditorize/{idx}-origin.wav', sample, sample_rate)
2
2 # 数据增强
2 if self.augment_configs:
2 sample, sample_rate, transcript = self.augment(sample, sample_rate, transcript)
2 # 重采样
2 if self.sample_rate != sample_rate:
2 sample = self.resample(sample, orig_sr=sample_rate, target_sr=self.sample_rate)
2
2 soundfile.write(f'./auditorize/{idx}-augmented.wav', sample, self.sample_rate)
2 with open(f'./auditorize/{idx}-transcript.json', 'w', encoding ='utf8') as json_file:
2 json.dump(transcript, json_file)
write an simple uttest.py to to loop through the data like this
def test_dataset():
| processor = WhisperProcessor.from_pretrained(
| "./output/whisper-large-v3-turbo-tw/",
| language="Chinese",
| task="transcribe",
| no_timestamps=False,
| local_files_only=False)
|
| train_dataset = CustomDataset(
| # data_list_path="./dataset/data-aishell-1/train.json",
| # data_list_path="./dataset/data-wenetspeech/dataset/train_wenet.json",
| data_list_path="./dataset/data-taiwanese-youtube/dataset/train_taiwanese.json",
| processor=processor,
| language="taiwanese",
| timestamps=True,
| min_duration=0.5,
| max_duration=30,
| augment_config_path='./configs/augmentation.json')
|
| for d in train_dataset:
| ipdb.set_trace()
| pp(d)

Another thing is timestamp, it will break the model easily if you don't handle it properly
In the original paper, they train 50% with and without timestamp,
I also modify this to replicate the original training process,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper v3微调过程中出现乱码的情况 #13

whisper v3微调过程中出现乱码的情况 #13

Kevinarcsin001 commented Dec 21, 2024

gody7334 commented Dec 28, 2024

shuaijiang commented Jan 2, 2025

Kevinarcsin001 commented Jan 14, 2025

gody7334 commented Jan 14, 2025 •

edited

Loading

Kevinarcsin001 commented Jan 20, 2025

gody7334 commented Jan 20, 2025 •

edited

Loading

whisper v3微调过程中出现乱码的情况 #13

whisper v3微调过程中出现乱码的情况 #13

Comments

Kevinarcsin001 commented Dec 21, 2024

gody7334 commented Dec 28, 2024

shuaijiang commented Jan 2, 2025

Kevinarcsin001 commented Jan 14, 2025

gody7334 commented Jan 14, 2025 • edited Loading

Kevinarcsin001 commented Jan 20, 2025

gody7334 commented Jan 20, 2025 • edited Loading

gody7334 commented Jan 14, 2025 •

edited

Loading

gody7334 commented Jan 20, 2025 •

edited

Loading