Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

纯文本指令数据 和 多模态指令数据混在一起finetune #179

Open
Luccadoremi opened this issue Nov 27, 2023 · 19 comments
Open

Comments

@Luccadoremi
Copy link

发现单独多模态数据 或者 纯文本数据fintune就不会有问题

但如果混合纯文本指令数据和多模态指令数据一起训练,会卡住。处理数据的逻辑可能有问题?

@vealocia
Copy link
Contributor

你好,感谢你对我们工作的关注。
建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度,如果出现这种情况,是会直接卡住的。

@Luccadoremi
Copy link
Author

你好,感谢你对我们工作的关注。 建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度,如果出现这种情况,是会直接卡住的。

感谢回复,有什么具体的解决办法吗?

@ShuaiBai623
Copy link
Collaborator

可以同步下huggingface上最新代码,可以解决纯文本和图像的联合训练

@Luccadoremi
Copy link
Author

Luccadoremi commented Dec 6, 2023 via email

@tcye
Copy link

tcye commented Dec 6, 2023

可以同步下huggingface上最新代码,可以解决纯文本和图像的联合训练

@ShuaiBai623 huggingface上有上传代码吗?貌似近期只有一个tokenizer加了个空格的修改,同步了最新代码后,依然还是会卡住

@limitedfxw
Copy link

可以同步下huggingface上最新代码,可以解决纯文本和图像的联合训练

huggingface上现在只更新了tokenizer,这个解决不了混合训练卡住的问题吧

@ShuaiBai623
Copy link
Collaborator

有什么新错误吗

@ZhihuaGao
Copy link

@ShuaiBai623 请问这个问题解决了没?

@ZhihuaGao
Copy link

thx 获取Outlook for Androidhttps://aka.ms/AAb9ysg

________________________________ From: Shusheng Yang @.> Sent: Wednesday, November 29, 2023 6:55:09 AM To: QwenLM/Qwen-VL @.> Cc: Luccadoremi @.>; Author @.> Subject: Re: [QwenLM/Qwen-VL] 纯文本指令数据 和 多模态指令数据混在一起finetune (Issue #179) 你好,感谢你对我们工作的关注。 建议检查一下是否因为多模态+纯文本的混合finetune导致了某些卡上ViT没有梯度,如果出现这种情况,是会直接卡住的。 ― Reply to this email directly, view it on GitHub<#179 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJQ7J243GQ7RKURBSTVSQQLYGZTU3AVCNFSM6AAAAAA732ZRM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQHA4DONZYHE. You are receiving this because you authored the thread.Message ID: @.***>

请问你决绝这个问题了嘛

@ShuaiBai623
Copy link
Collaborator

image
hi @ZhihuaGao , 这几个更新是针对混合训练的

@luxinglong
Copy link

更新了,还是会卡住

@ZhihuaGao
Copy link

ZhihuaGao commented Dec 21, 2023 via email

@luxinglong
Copy link

配置是全参微调zero3吗?@ZhihuaGao

@chuangzhidan
Copy link

image hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率,微调后发现幻觉显著增加了,3.5k的数据,这个仅仅更改配置是不是有问题?

@TAOSHss
Copy link

TAOSHss commented Apr 3, 2024

image hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率,微调后发现幻觉显著增加了,3.5k的数据,这个仅仅更改配置是不是有问题?

技术指南中 阶段2 阶段3 的训练都是 448*448 ,如果仅仅扩大分辨率 需要较多数据去训练,vit参数 全放开训练才行 我认为

@chuangzhidan
Copy link

chuangzhidan commented Apr 8, 2024

image hi @ZhihuaGao , 这几个更新是针对混合训练的

想问下qwenvl大佬,仅仅改变img_size扩大了一倍多的分辨率,微调后发现幻觉显著增加了,3.5k的数据,这个仅仅更改配置是不是有问题?

技术指南中 阶段2 阶段3 的训练都是 448*448 ,如果仅仅扩大分辨率 需要较多数据去训练,vit参数 全放开训练才行 我认为

我在尝试,但是发现报错了。
self._dummy_overflow_buf = get_accelerator().IntTensor([0])
Traceback (most recent call last):
File "finetune.py", line 367, in
train()
File "finetune.py", line 360, in train
trainer.train()
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1675, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
**engine, optimizer, _, lr_scheduler = deepspeed.initialize(kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1497, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in init
self.initialize_optimizer_states()
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 647, in initialize_optimizer_states
self.optimizer.step()
File "/root/miniconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
File "/root/miniconda3/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in call
*return op(self.chunk_size, noop_flag_buffer, tensor_lists, args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
**what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb49b474617 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)**
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb49b42f98d in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb49b530128 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x16e76 (0x7fb49b4f8e76 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x19bad (0x7fb49b4fbbad in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x19fcd (0x7fb49b4fbfcd in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x510d36 (0x7fb4de26dd36 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x55ca7 (0x7fb49b459ca7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x1e3 (0x7fb49b451cb3 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb49b451e49 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #10: + 0x7c18f8 (0x7fb4de51e8f8 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb4de51eca5 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x1586a7 (0x557bd830d6a7 in /root/miniconda3/bin/python)
frame #13: _PyModule_ClearDict + 0x714 (0x557bd8363364 in /root/miniconda3/bin/python)
frame #14: PyImport_Cleanup + 0x537 (0x557bd838af47 in /root/miniconda3/bin/python)
frame #15: Py_FinalizeEx + 0x79 (0x557bd83bca49 in /root/miniconda3/bin/python)
frame #16: Py_RunMain + 0x183 (0x557bd83be893 in /root/miniconda3/bin/python)
frame #17: Py_BytesMain + 0x39 (0x557bd83beca9 in /root/miniconda3/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7fb51cd12083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: + 0x1e21c7 (0x557bd83971c7 in /root/miniconda3/bin/python)

@HalcyonLiang
Copy link

配置是全参微调zero3吗?@ZhihuaGao

大佬,混合训练的问题有解决吗?

@liuyijiang1994
Copy link

配置是全参微调zero3吗?@ZhihuaGao

大佬,混合训练的问题有解决吗?

解决了吗大佬 有思路么

@liuyijiang1994
Copy link

不会啊,我更新了没问题 发自我的iPhone

------------------ 原始邮件 ------------------ 发件人: 星火燎原 @.> 发送时间: 2023年12月21日 15:10 收件人: QwenLM/Qwen-VL @.> 抄送: zhihua @.>, Mention @.> 主题: Re: [QwenLM/Qwen-VL] 纯文本指令数据 和 多模态指令数据混在一起finetune (Issue #179) 更新了,还是会卡住 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

用的zero3么?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests