Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练了2轮半,强制结束,任务微调时出现错误:NameError: name '加完班回到家窝在沙发里' is not defined #20

Closed
ipfgao opened this issue Sep 14, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@ipfgao
Copy link

ipfgao commented Sep 14, 2024

#预训练指令
deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
#微调指令
torchrun --nproc_per_node 2 3-full_sft.py

微调时错误如下:

[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] 
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-14 13:31:27,107] torch.distributed.run: [WARNING] *****************************************
LLM总参数量:26.878 百万
Epoch:[0/19](0/24681) loss:8.882 lr:0.00020000 epoch_Time:351.0min:
Epoch:[0/19](100/24681) loss:5.368 lr:0.00020000 epoch_Time:82.0min:
Traceback (most recent call last):
  File "/data/minimind/3-full_sft.py", line 212, in <module>
    train_epoch(epoch)
  File "/data/minimind/3-full_sft.py", line 48, in train_epoch
    for step, (X, Y, loss_mask) in enumerate(train_loader):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
NameError: Caught NameError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/minimind/model/dataset.py", line 74, in __getitem__
    history = eval(sample['history'])
  File "<string>", line 1, in <module>
NameError: name '加完班回到家窝在沙发里' is not defined

[2024-09-14 13:32:12,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 157780 closing signal SIGTERM
[2024-09-14 13:32:12,221] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 157779) of binary: /home/nlp/anaconda3/envs/minimind/bin/python
Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
3-full_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-14_13:32:12
  host      : nlp-Z790-UD
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 157779)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

这是不让加完班回到家窝在沙发里吗?

@jingyaogong
Copy link
Owner

https://github.com/jingyaogong/minimind/blob/master/model/dataset.py

    def safe_eval(self, s):
        try:
            res = eval(s)
        except Exception as e:
            return []
        return res

用新代码跳过个别这种所谓的“非法Python标识符的字符串”异常数据吧
有意思哈哈

@ipfgao
Copy link
Author

ipfgao commented Sep 14, 2024

还是有报错

[2024-09-14 15:06:04,513] torch.distributed.run: [WARNING] 
[2024-09-14 15:06:04,513] torch.distributed.run: [WARNING] *****************************************
[2024-09-14 15:06:04,513] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-09-14 15:06:04,513] torch.distributed.run: [WARNING] *****************************************
LLM总参数量:26.878 百万
Epoch:[0/19](0/24681) loss:8.875 lr:0.00020000 epoch_Time:372.0min:
Traceback (most recent call last):
  File "/data/minimind/3-full_sft.py", line 212, in <module>
    train_epoch(epoch)
  File "/data/minimind/3-full_sft.py", line 48, in train_epoch
    for step, (X, Y, loss_mask) in enumerate(train_loader):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 7.
Original Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/minimind/model/dataset.py", line 101, in __getitem__
    new_prompt = self.tokenizer.apply_chat_template(
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1844, in apply_chat_template
    rendered_chat = compiled_template.render(
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
TypeError: can only concatenate str (not "float") to str

[2024-09-14 15:06:29,551] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 213248 closing signal SIGTERM
[2024-09-14 15:06:29,615] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 213249) of binary: /home/nlp/anaconda3/envs/minimind/bin/python
Traceback (most recent call last):
  File "/home/nlp/anaconda3/envs/minimind/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/nlp/anaconda3/envs/minimind/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
3-full_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-14_15:06:29
  host      : nlp-Z790-UD
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 213249)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@jingyaogong
Copy link
Owner

你的sft_data.csv文件可以分享一下吗,否则可能无法复现你的结果

@ipfgao
Copy link
Author

ipfgao commented Sep 14, 2024

你的sft_data.csv文件可以分享一下吗,否则可能无法复现你的结果

昨天从百度网盘下载的,就是Readme里提供的这个:
百度网盘
用的是sft_data_single.csv。
我改了一下model/dataset.py,强制转化为字符串,现在倒是正常跑了。

    def __getitem__(self, index: int):
        #
        sample = self.df.iloc[index]
        history = self.safe_eval(sample['history'])
        q = sample['q']
        a = sample['a']

        messages = []
        for history_message in history:
            if len(history_message) <= 1:
                continue
            # 确保 content 是字符串
            messages.append(
                {"role": 'user', "content": str(history_message[0])[:self.max_length // 2]}
            )
            messages.append(
                {"role": 'assistant', "content": str(history_message[1])[:self.max_length // 2]}
            )

        # 确保 q 和 a 是字符串
        messages += [
            {"role": "user", "content": str(q)},
            {"role": "assistant", "content": str(a)},
        ]

@jingyaogong
Copy link
Owner

你的sft_data.csv文件可以分享一下吗,否则可能无法复现你的结果

昨天从百度网盘下载的,就是Readme里提供的这个: 百度网盘 用的是sft_data_single.csv。 我改了一下model/dataset.py,强制转化为字符串,现在倒是正常跑了。

    def __getitem__(self, index: int):
        #
        sample = self.df.iloc[index]
        history = self.safe_eval(sample['history'])
        q = sample['q']
        a = sample['a']

        messages = []
        for history_message in history:
            if len(history_message) <= 1:
                continue
            # 确保 content 是字符串
            messages.append(
                {"role": 'user', "content": str(history_message[0])[:self.max_length // 2]}
            )
            messages.append(
                {"role": 'assistant', "content": str(history_message[1])[:self.max_length // 2]}
            )

        # 确保 q 和 a 是字符串
        messages += [
            {"role": "user", "content": str(q)},
            {"role": "assistant", "content": str(a)},
        ]

好的,很奇怪我这边测试正常,那我也先加上str强转了

3Q

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants