Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader supprot dict str #31481

Merged

Conversation

heavengate
Copy link
Contributor

@heavengate heavengate commented Mar 8, 2021

PR types

Function optimization

PR changes

APIs

Describe

DataLoader optimization

  • support data format: dict, list, str

  • log ERROR info when shared memory insufficient
    shm

  • refine blocking queue kill ENFORCE check

  • re-raise worker exception in main process
    error

  • add CPU place guard for collate in workers to ensure tensor operations runs on CPU

no effect on speed

Model batch_size develop This PR
ResNet50 1*128 343.19 samples/s 343.73 samples/s
ResNet50 8*128 2456.51 samples/s 2462.64 ms/step
MobileNetV1 1*128 1045.33 samples/s 1043.87 samples/s
MobileNetV1 8*128 3227.82 samples/s 3225.13 sample/s

TODO:

  • remove ENFORCE check in blocking queue Receive
  • refine CPU tensor pipeline
  • enhance main process check when SIGBUS kill sub-process

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 8, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -0,0 +1,87 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2021 for new file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

structure.append('{}{}'.format(FIELD_PREFIX, field_idx))
flat_batch.append(field.numpy())
field_idx += 1
elif isinstance(field, (str, bytes, numbers.Number, np.number)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between numbers.Number and np.number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@heavengate heavengate force-pushed the dataloader_supprot_dict_str branch from 7eab9e2 to 4be15f0 Compare March 10, 2021 16:34
"DataLoader workers.\n");
REGISTER_SIGNAL_HANDLER(
SIGBUS, SIGBUS_handler,
"ERROR: Unexpected BUS error encountered in DataLoader worker. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方便在comment里附一个此类报错的示例结果吗?想看下格式

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

self.exc_msg = "".join(traceback.format_exception(*exc_info))

def reraise(self):
msg = "DataLoader worker({}) caught {} with message:\n{}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有这个改进后的报错示例也想看下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

chenwhql
chenwhql previously approved these changes Mar 12, 2021
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先approve,但目前从这两种报错格式来看,还是会给用户的使用造成困惑,估计还是会DataLoader issue不断,预估凯鹏还是会被各种问题打断日常工作,所以希望后续还能完善下,主要问题有以下几点

  1. 首先用户看到报错的时候没有红框,所以大概率Get不到重点
  2. 第一种类型的报错,我们之前写了很详细的报错提示,报到最后这个问题是能解决的,可以后面再看看
  3. 然后blocking queue的报错应该是对用户调试无帮助的,建议去掉,如果需要也可以改现有单测

@heavengate
Copy link
Contributor Author

先approve,但目前从这两种报错格式来看,还是会给用户的使用造成困惑,估计还是会DataLoader issue不断,预估凯鹏还是会被各种问题打断日常工作,所以希望后续还能完善下,主要问题有以下几点

  1. 首先用户看到报错的时候没有红框,所以大概率Get不到重点
  2. 第一种类型的报错,我们之前写了很详细的报错提示,报到最后这个问题是能解决的,可以后面再看看
  3. 然后blocking queue的报错应该是对用户调试无帮助的,建议去掉,如果需要也可以改现有单测

是的,blocking queue的Receive里的EnforceNotKilled现在被test_multiprocess_reader单测里老版本PyReader的单测依赖没法删除,这个后续配合CPU tensor pipeline的调整下个PR继续优化,优化后应该能继续改进SIGBUS信号的捕获处理,这个会配合后续工作持续改进

@heavengate heavengate requested a review from TCChenlong March 12, 2021 04:48

def default_collate_fn(batch):
"""
Default batch collating function for :code:`fluid.io.DataLoader`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle.io.DataLoader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

WorkerInfo: an instance of WorkerInfo which contains fields above.

.. note::
For mode usage and exampls, please see :code:`paddle.io.IterableDataset`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mode usage and exampls -> For more usage and examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test whether the flatten_batch affect the original speed or not

@heavengate
Copy link
Contributor Author

Need to test whether the flatten_batch affect the original speed or not

affect on original model is tested above, this PR has no affects on original model(original model datas are all in list format)

Model batch_size develop This PR
ResNet50 1*128 343.19 samples/s 343.73 samples/s
ResNet50 8*128 2456.51 samples/s 2462.64 ms/step
MobileNetV1 1*128 1045.33 samples/s 1043.87 samples/s
MobileNetV1 8*128 3227.82 samples/s 3225.13 sample/s

change PaddleClas data format to dict as follow, this PR also has no affects on dict format data model, speed testing result as follows

{'image': transform(img, self.ops), 'label': int(label)}
Model batch_size develop This PR
ResNet50 1*128 343.19 samples/s 343.45 samples/s
ResNet50 8*128 2456.51 samples/s 2459.33 ms/step
MobileNetV1 1*128 1045.33 samples/s 1044.12 samples/s
MobileNetV1 8*128 3227.82 samples/s 3225.76 sample/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants