纯文本指令数据和多模态指令数据混在一起finetune #179

Luccadoremi · 2023-11-27T11:05:18Z

发现单独多模态数据或者纯文本数据fintune就不会有问题

但如果混合纯文本指令数据和多模态指令数据一起训练，会卡住。处理数据的逻辑可能有问题？

vealocia · 2023-11-28T22:54:57Z

你好，感谢你对我们工作的关注。
建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度，如果出现这种情况，是会直接卡住的。

Luccadoremi · 2023-11-29T02:34:13Z

你好，感谢你对我们工作的关注。建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度，如果出现这种情况，是会直接卡住的。

感谢回复，有什么具体的解决办法吗？

ShuaiBai623 · 2023-12-05T07:32:40Z

可以同步下huggingface上最新代码，可以解决纯文本和图像的联合训练

Luccadoremi · 2023-12-06T01:48:56Z

thx 获取Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Shusheng Yang ***@***.***> Sent: Wednesday, November 29, 2023 6:55:09 AM To: QwenLM/Qwen-VL ***@***.***> Cc: Luccadoremi ***@***.***>; Author ***@***.***> Subject: Re: [QwenLM/Qwen-VL] 纯文本指令数据和多模态指令数据混在一起finetune (Issue #179) 你好，感谢你对我们工作的关注。建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度，如果出现这种情况，是会直接卡住的。 ― Reply to this email directly, view it on GitHub<#179 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJQ7J243GQ7RKURBSTVSQQLYGZTU3AVCNFSM6AAAAAA732ZRM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQHA4DONZYHE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

tcye · 2023-12-06T06:20:03Z

可以同步下huggingface上最新代码，可以解决纯文本和图像的联合训练

@ShuaiBai623 huggingface上有上传代码吗？貌似近期只有一个tokenizer加了个空格的修改，同步了最新代码后，依然还是会卡住

limitedfxw · 2023-12-11T10:33:47Z

可以同步下huggingface上最新代码，可以解决纯文本和图像的联合训练

huggingface上现在只更新了tokenizer，这个解决不了混合训练卡住的问题吧

ShuaiBai623 · 2023-12-18T07:41:28Z

有什么新错误吗

ZhihuaGao · 2023-12-18T17:02:06Z

@ShuaiBai623 请问这个问题解决了没？

ZhihuaGao · 2023-12-19T05:50:31Z

thx 获取Outlook for Androidhttps://aka.ms/AAb9ysg
…
________________________________ From: Shusheng Yang @.> Sent: Wednesday, November 29, 2023 6:55:09 AM To: QwenLM/Qwen-VL @.> Cc: Luccadoremi @.>; Author @.> Subject: Re: [QwenLM/Qwen-VL] 纯文本指令数据和多模态指令数据混在一起finetune (Issue #179) 你好，感谢你对我们工作的关注。建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度，如果出现这种情况，是会直接卡住的。 ― Reply to this email directly, view it on GitHub<#179 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJQ7J243GQ7RKURBSTVSQQLYGZTU3AVCNFSM6AAAAAA732ZRM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQHA4DONZYHE. You are receiving this because you authored the thread.Message ID: @.***>

请问你决绝这个问题了嘛

ShuaiBai623 · 2023-12-19T07:50:50Z

hi @ZhihuaGao , 这几个更新是针对混合训练的

luxinglong · 2023-12-21T07:10:07Z

更新了，还是会卡住

ZhihuaGao · 2023-12-21T07:11:57Z

不会啊，我更新了没问题发自我的iPhone

…

------------------ 原始邮件 ------------------ 发件人: 星火燎原 ***@***.***> 发送时间: 2023年12月21日 15:10 收件人: QwenLM/Qwen-VL ***@***.***> 抄送: zhihua ***@***.***>, Mention ***@***.***> 主题: Re: [QwenLM/Qwen-VL] 纯文本指令数据和多模态指令数据混在一起finetune (Issue #179) 更新了，还是会卡住 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

luxinglong · 2023-12-21T12:44:15Z

配置是全参微调zero3吗？@ZhihuaGao

chuangzhidan · 2024-03-28T09:50:13Z

hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率，微调后发现幻觉显著增加了，3.5k的数据，这个仅仅更改配置是不是有问题？

TAOSHss · 2024-04-03T02:42:53Z

hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率，微调后发现幻觉显著增加了，3.5k的数据，这个仅仅更改配置是不是有问题？

技术指南中阶段2 阶段3 的训练都是 448*448 ，如果仅仅扩大分辨率需要较多数据去训练，vit参数全放开训练才行我认为

chuangzhidan · 2024-04-08T02:04:21Z

hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率，微调后发现幻觉显著增加了，3.5k的数据，这个仅仅更改配置是不是有问题？

技术指南中阶段2 阶段3 的训练都是 448*448 ，如果仅仅扩大分辨率需要较多数据去训练，vit参数全放开训练才行我认为

我在尝试，但是发现报错了。
self._dummy_overflow_buf = get_accelerator().IntTensor([0])
Traceback (most recent call last):
File "finetune.py", line 367, in
train()
File "finetune.py", line 360, in train
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
**engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1497, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in init
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 647, in initialize_optimizer_states
self.optimizer.step()
File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call
*return op(self.chunk_size, noop_flag_buffer, tensor_lists, args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
**what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb49b474617 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)**
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb49b42f98d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb49b530128 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x16e76 (0x7fb49b4f8e76 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x19bad (0x7fb49b4fbbad in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x19fcd (0x7fb49b4fbfcd in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x510d36 (0x7fb4de26dd36 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x55ca7 (0x7fb49b459ca7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fb49b451cb3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb49b451e49 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #10: + 0x7c18f8 (0x7fb4de51e8f8 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb4de51eca5 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x1586a7 (0x557bd830d6a7 in /root/miniconda3/bin/python)
frame #13: _PyModule_ClearDict + 0x714 (0x557bd8363364 in /root/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x537 (0x557bd838af47 in /root/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x79 (0x557bd83bca49 in /root/miniconda3/bin/python)
frame #16: Py_RunMain + 0x183 (0x557bd83be893 in /root/miniconda3/bin/python)
frame #17: Py_BytesMain + 0x39 (0x557bd83beca9 in /root/miniconda3/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fb51cd12083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: + 0x1e21c7 (0x557bd83971c7 in /root/miniconda3/bin/python)

HalcyonLiang · 2024-07-02T12:08:18Z

配置是全参微调zero3吗？@ZhihuaGao

大佬，混合训练的问题有解决吗？

liuyijiang1994 · 2025-01-06T08:49:10Z

配置是全参微调zero3吗？@ZhihuaGao

大佬，混合训练的问题有解决吗？

解决了吗大佬有思路么

liuyijiang1994 · 2025-01-06T08:49:30Z

不会啊，我更新了没问题发自我的iPhone
…
------------------ 原始邮件 ------------------ 发件人: 星火燎原 @.> 发送时间: 2023年12月21日 15:10 收件人: QwenLM/Qwen-VL @.> 抄送: zhihua @.>, Mention @.> 主题: Re: [QwenLM/Qwen-VL] 纯文本指令数据和多模态指令数据混在一起finetune (Issue #179) 更新了，还是会卡住 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

用的zero3么？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

纯文本指令数据和多模态指令数据混在一起finetune #179

纯文本指令数据和多模态指令数据混在一起finetune #179

Luccadoremi commented Nov 27, 2023

vealocia commented Nov 28, 2023

Luccadoremi commented Nov 29, 2023

ShuaiBai623 commented Dec 5, 2023

Luccadoremi commented Dec 6, 2023 via email

tcye commented Dec 6, 2023

limitedfxw commented Dec 11, 2023

ShuaiBai623 commented Dec 18, 2023

ZhihuaGao commented Dec 18, 2023

ZhihuaGao commented Dec 19, 2023

ShuaiBai623 commented Dec 19, 2023

luxinglong commented Dec 21, 2023

ZhihuaGao commented Dec 21, 2023 via email

luxinglong commented Dec 21, 2023

chuangzhidan commented Mar 28, 2024

TAOSHss commented Apr 3, 2024

chuangzhidan commented Apr 8, 2024 •

edited

Loading

HalcyonLiang commented Jul 2, 2024

liuyijiang1994 commented Jan 6, 2025

liuyijiang1994 commented Jan 6, 2025

纯文本指令数据 和 多模态指令数据混在一起finetune #179

纯文本指令数据 和 多模态指令数据混在一起finetune #179

Comments

Luccadoremi commented Nov 27, 2023

vealocia commented Nov 28, 2023

Luccadoremi commented Nov 29, 2023

ShuaiBai623 commented Dec 5, 2023

Luccadoremi commented Dec 6, 2023 via email

tcye commented Dec 6, 2023

limitedfxw commented Dec 11, 2023

ShuaiBai623 commented Dec 18, 2023

ZhihuaGao commented Dec 18, 2023

ZhihuaGao commented Dec 19, 2023

ShuaiBai623 commented Dec 19, 2023

luxinglong commented Dec 21, 2023

ZhihuaGao commented Dec 21, 2023 via email

luxinglong commented Dec 21, 2023

chuangzhidan commented Mar 28, 2024

TAOSHss commented Apr 3, 2024

chuangzhidan commented Apr 8, 2024 • edited Loading

HalcyonLiang commented Jul 2, 2024

liuyijiang1994 commented Jan 6, 2025

liuyijiang1994 commented Jan 6, 2025

纯文本指令数据和多模态指令数据混在一起finetune #179

纯文本指令数据和多模态指令数据混在一起finetune #179

chuangzhidan commented Apr 8, 2024 •

edited

Loading