Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper v3微调过程中出现乱码的情况 #13

Open
Kevinarcsin001 opened this issue Dec 21, 2024 · 6 comments
Open

whisper v3微调过程中出现乱码的情况 #13

Kevinarcsin001 opened this issue Dec 21, 2024 · 6 comments

Comments

@Kevinarcsin001
Copy link

我这边使用2卡的4090,实验数据是aishell1 带标点的数据。
环境如下:
Package Version Editable project location


accelerate 0.28.0
aiohappyeyeballs 2.4.4
aiohttp 3.11.10
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.7.0
async-timeout 5.0.1
attrs 24.2.0
audioread 3.0.1
av 14.0.1
bitsandbytes 0.41.3
Brotli 1.0.9
certifi 2024.8.30
cffi 1.17.1
charset-normalizer 3.3.2
click 8.1.7
coloredlogs 15.0.1
ctranslate2 4.5.0
dataclasses 0.6
datasets 3.2.0
decorator 5.1.1
dill 0.3.8
evaluate 0.4.3
exceptiongroup 1.2.2
fastapi 0.115.6
faster-whisper 1.1.0
filelock 3.13.1
flatbuffers 24.3.25
frozenlist 1.5.0
fsspec 2024.9.0
gmpy2 2.1.2
h11 0.14.0
huggingface-hub 0.26.5
humanfriendly 10.0
idna 3.10
Jinja2 3.1.4
jiwer 3.0.5
joblib 1.4.2
lazy_loader 0.4
librosa 0.10.2.post1
llvmlite 0.43.0
MarkupSafe 3.0.2
mkl_fft 1.3.11
mkl_random 1.2.8
mkl-service 2.4.0
mpmath 1.3.0
msgpack 1.1.0
multidict 6.1.0
multiprocess 0.70.16
networkx 3.2.1
numba 0.60.0
numpy 2.0.1
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
onnxruntime 1.16.3
packaging 24.2
pandas 2.2.3
peft 0.7.0 # 我自己的路径,使用源码安装的
pillow 11.0.0
pip 24.2
platformdirs 4.3.6
pooch 1.8.2
propcache 0.2.1
protobuf 5.29.1
psutil 6.1.0
pyarrow 18.1.0
pycparser 2.22
pydantic 2.10.3
pydantic_core 2.27.1
pydub 0.25.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
RapidFuzz 3.10.1
regex 2024.11.6
requests 2.32.3
safetensors 0.4.5
scikit-learn 1.5.2
scipy 1.13.1
setuptools 75.1.0
six 1.17.0
sniffio 1.3.1
SoundCard 0.4.3
soundfile 0.12.1
soxr 0.5.0.post1
starlette 0.41.3
sympy 1.13.1
tensorboardX 2.6.2.2
threadpoolctl 3.5.0
tokenizers 0.21.0
torch 2.5.1
torchaudio 2.5.1
torchvision 0.20.1
tqdm 4.67.1
transformers 4.47.0
triton 3.1.0
typing_extensions 4.12.2
tzdata 2024.2
urllib3 2.2.3
uvicorn 0.32.1
wheel 0.44.0
xxhash 3.5.0
yarl 1.18.3
zhconv 1.4.3

实验结果如下
1734767161822
请问您这边可以提供一些建议吗?麻烦啦

@gody7334
Copy link

I have same issue
Reckon its due to the discrepancy between the training (testing) data and the real data
And the model overfit to the training data,

You need to start to think what kind of data (or augmentation) will better represent your real data?

@shuaijiang
Copy link
Owner

check the format of training data(text) is utf-8?

@Kevinarcsin001
Copy link
Author

I have checked and the data is fine, it is in UTF-8 encoding format. I rented a GPU , A800 with 80g memory. I tryed to finetune all parameters use the scripts 'finetune_all.py', but there were also garbled characters. May I ask how you prepare data on your end? Can we share the training parameter environment information?

check the format of training data(text) is utf-8?

@gody7334
Copy link

gody7334 commented Jan 14, 2025

Use this project, it has active maintenance: https://github.com/yeyupiaoling/Whisper-Finetune
And try to finetune with wenet-speech dataset,
the ai-shell data is too short, which can break the original model,
as model will overfit to short audio, rather than full 30 sec audio

I have successfully train on wenet-speech dataset without having the garble

@Kevinarcsin001
Copy link
Author

Thank you for your guidance. I used wenet-speech dataset to finetune , with a learning rate of 1e-5 and 200 hours of data, the audios length are between 20 seconds and 30 seconds. But still garbled. Is it a problem with my environment here? Can you give me some training advice? What is used https://github.com/yeyupiaoling/Whisper-Finetune fine- tuning scripts in the project.

Image

@gody7334
Copy link

gody7334 commented Jan 20, 2025

try to out put the file and listen what kind of data you input to the model

  • in reader.py, around augmentation output the audio and label
    2
    2 soundfile.write(f'./auditorize/{idx}-origin.wav', sample, sample_rate)
    2
    2 # 数据增强
    2 if self.augment_configs:
    2 sample, sample_rate, transcript = self.augment(sample, sample_rate, transcript)
    2 # 重采样
    2 if self.sample_rate != sample_rate:
    2 sample = self.resample(sample, orig_sr=sample_rate, target_sr=self.sample_rate)
    2
    2 soundfile.write(f'./auditorize/{idx}-augmented.wav', sample, self.sample_rate)
    2 with open(f'./auditorize/{idx}-transcript.json', 'w', encoding ='utf8') as json_file:
    2 json.dump(transcript, json_file)

  • write an simple uttest.py to to loop through the data like this
    def test_dataset():
    | processor = WhisperProcessor.from_pretrained(
    | "./output/whisper-large-v3-turbo-tw/",
    | language="Chinese",
    | task="transcribe",
    | no_timestamps=False,
    | local_files_only=False)
    |
    | train_dataset = CustomDataset(
    | # data_list_path="./dataset/data-aishell-1/train.json",
    | # data_list_path="./dataset/data-wenetspeech/dataset/train_wenet.json",
    | data_list_path="./dataset/data-taiwanese-youtube/dataset/train_taiwanese.json",
    | processor=processor,
    | language="taiwanese",
    | timestamps=True,
    | min_duration=0.5,
    | max_duration=30,
    | augment_config_path='./configs/augmentation.json')
    |
    | for d in train_dataset:
    | ipdb.set_trace()
    | pp(d)

Another thing is timestamp, it will break the model easily if you don't handle it properly
In the original paper, they train 50% with and without timestamp,
I also modify this to replicate the original training process,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants