You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
运行结果:
推理报错,程序异常结束。
W0829 05:36:41.933918 170078 operator.cc:804] while raises an exception std::bad_alloc, std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
bug描述 Describe the Bug
前提条件:
export FLAGS_cache_inference_while_scope=true
运行测试脚本:
card_num=8
for((i=0;i<$card_num;i++))
do
echo "start card $i"
numactl --cpunodebind=$i --membind $i python -m paddle.distributed.launch --log_dir=$log_dir --master 127.0.0.1:49123 --rank $i --devices $i --nnodes $card_num xxx/PaddleNLP/model_zoo/gpt-3/projects/gpt/benchmark.py --seq_len 128 --iter 1 --mp_degree $card_num --model_dir xxx/models/gpt-13b-mp8 &> multi_$i.log &
done
echo "=====RUNNING====="
运行结果:
推理报错,程序异常结束。
W0829 05:36:41.933918 170078 operator.cc:804] while raises an exception std::bad_alloc, std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
调用栈:
C++ Traceback (most recent call last):
0 paddle::distributed::TaskLoopThread::Loop()
1 paddle::distributed::TaskLoop::Loop()
2 paddle::distributed::Interceptor::LoopOnce()
3 paddle::distributed::Interceptor::Handle(paddle::distributed::InterceptorMessage const&)
4 paddle::distributed::ComputeInterceptor::Compute(paddle::distributed::InterceptorMessage const&)
5 paddle::distributed::ComputeInterceptor::Run()
6 paddle::distributed::ComputeInterceptor::RunOps()
7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
8 paddle::operators::WhileOp::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
9 paddle::framework::InterpreterCore::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
10 paddle::framework::ProgramInterpreter::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
11 paddle::framework::ProgramInterpreter::RunImpl()
12 paddle::framework::ProgramInterpreter::ExecuteInstructionList(std::vector<paddle::framework::Instruction, std::allocatorpaddle::framework::Instruction > const&)
13 paddle::framework::ProgramInterpreter::RunInstructionAsync(unsigned long)
14 paddle::framework::ProgramInterpreter::RunInstruction(paddle::framework::Instruction const&)
15 paddle::framework::ProgramInterpreter::RunOperator(paddle::framework::Instruction const&)
16 std::_Function_handler<void (paddle::framework::InferShapeContext*), paddle::framework::details::OpInfoFiller<MemcpyD2HInferShapeFunctor, (paddle::framework::details::OpInfoFillType)4>::operator()(char const*, paddle::framework::OpInfo*) const::{lambda(paddle::framework::InferShapeContext*)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::InferShapeContext*&&)
17 MemcpyD2HInferShapeFunctor::operator()(paddle::framework::InferShapeContext*) const
18 paddle::framework::CompatMetaTensor::share_meta(phi::MetaTensor const&)
19 paddle::framework::CompatMetaTensor::share_dims(phi::MetaTensor const&)
20 paddle::framework::CompatMetaTensor::set_dims(phi::DDim const&)
其他补充信息 Additional Supplementary Information
打开GLOG_v=10,推理无法进行,进度停在这条日志:
LAUNCH INFO 2023-08-27 02:29:02,321 Waiting peer start...
打开GLOG_v=4,报错调用栈有变化。可能是由于日志打印,更早访问到了内存有问题的var。
C++ Traceback (most recent call last):
0 paddle::distributed::TaskLoopThread::Loop()
1 paddle::distributed::TaskLoop::Loop()
2 paddle::distributed::Interceptor::LoopOnce()
3 paddle::distributed::Interceptor::Handle(paddle::distributed::InterceptorMessage const&)
4 paddle::distributed::ComputeInterceptor::Compute(paddle::distributed::InterceptorMessage const&)
5 paddle::distributed::ComputeInterceptor::Run()
6 paddle::distributed::ComputeInterceptor::RunOps()
7 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
8 paddle::operators::WhileOp::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
9 paddle::framework::InterpreterCore::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
10 paddle::framework::ProgramInterpreter::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
11 paddle::framework::ProgramInterpreter::RunImpl()
12 paddle::framework::ProgramInterpreter::ExecuteInstructionList(std::vector<paddle::framework::Instruction, std::allocatorpaddle::framework::Instruction > const&)
13 paddle::framework::ProgramInterpreter::RunInstructionAsync(unsigned long)
14 paddle::framework::ProgramInterpreter::RunInstruction(paddle::framework::Instruction const&)
15 paddle::framework::ProgramInterpreter::RunOperator(paddle::framework::Instruction const&)
16 paddle::framework::OperatorBase::DebugStringEx[abi:cxx11](paddle::framework::Scope const*) const
17 paddle::framework::Scope::FindVar(std::string const&) const
18 paddle::framework::Scope::FindVarInternal(std::string const&) const
19 paddle::framework::Scope::FindVarLocally(std::string const&) const
20 std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<paddle::framework::Variable, std::default_deletepaddle::framework::Variable > >, std::allocator<std::pair<std::string const, std::unique_ptr<paddle::framework::Variable, std::default_deletepaddle::framework::Variable > > >, std::__detail::_Select1st, std::equal_to<std::string >, paddle::framework::Scope::KeyHasher, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
部分日志:
LAUNCH INFO 2023-08-29 04:34:13,247 ------------------------- ERROR LOG DETAIL -------------------------
LAUNCH INFO 2023-08-29 04:34:13,247 Exit code -11
I0829 04:34:13.248173 46079 imperative.cc:2207] Tracer(0x2bf49630) set expected place Place(npu:0)
I0829 04:34:13.248396 46079 mmap_allocator.cc:321] PID: 46079, MemoryMapFdSet: set size - 0
I0829 04:34:13.255030 46079 mmap_allocator.cc:321] PID: 46079, MemoryMapFdSet: set size - 0
I0829 04:33:18.297309 46252 graph_helper.h:104] adj matmul_v20x3c743000 -> scale0x3c743dc0 via matmul_v2_81.tmp_00x3c743ac0
I0829 04:33:18.297315 46252 graph_helper.h:104] adj split0x3c605110 -> concat0x3c609590 via split_33.tmp_10x3c6091d0
I0829 04:33:18.297322 46252 graph_helper.h:104] adj split0x3c7b9ef0 -> concat0x3c7c1830 via split_43.tmp_20x3c7be300
I0829 04:33:18.297328 46252 graph_helper.h:104] adj matmul_v20x3c66e020 -> c_allreduce_sum0x3c66f100 via c_allreduce_sum.tmp_700x3c66ed20
I0829 04:33:18.297334 46252 graph_helper.h:104] adj dropout0x3c4b5d80 -> matmul_v20x3c4b6c80 via dropout_77.tmp_00x3c4b6900
I0829 04:33:18.297339 46252 graph_helper.h:104] adj transpose20x3c4b1210 -> matmul_v20x3c4b6c80 via transpose_102.tmp_00x3c4b1b90
I0829 04:33:18.297345 46252 graph_helper.h:104] adj transpose20x3c4db5c0 -> scale0x3c4dd8e0 via transpose_104.tmp_00x3c4dbe60
I0829 04:33:18.297356 46252 graph_helper.h:104] adj elementwise_add0x3c541810 -> elementwise_add0x3c54ded0 via tmp_1020x3c542410
I0829 04:33:18.297361 46252 graph_helper.h:104] adj dropout0x3c54cfd0 -> elementwise_add0x3c54ded0 via dropout_88.tmp_00x3c54db50
I0829 04:33:18.297369 46252 graph_helper.h:104] adj elementwise_add0x3c5986b0 -> layer_norm0x3c599230 via tmp_1080x3c598fa0
I0829 04:33:18.297374 46252 graph_helper.h:104] adj transpose20x3c6ba880 -> scale0x3c6bcbd0 via transpose_148.tmp_00x3c6bb100
I0829 04:33:18.297380 46252 graph_helper.h:104] adj concat0x3c55d9b0 -> transpose20x3c55f8e0 via concat_12.tmp_00x3c55df40
I0829 04:33:18.297387 46252 graph_helper.h:104] adj elementwise_add0x3c577af0 -> dropout0x3c578760 via linear_119.tmp_10x3c5783c0
I0829 04:33:18.297394 46252 graph_helper.h:104] adj split0x3c7e5490 -> transpose20x3c7ed5e0 via split_44.tmp_00x3c7e9500
I0829 04:33:18.297399 46252 graph_helper.h:104] adj split0x3c4d3470 -> concat0x3c4d7a70 via split_26.tmp_10x3c4d76b0
I0829 04:33:18.297406 46252 graph_helper.h:104] adj matmul_v20x3c509b40 -> scale0x3c50a900 via matmul_v2_55.tmp_00x3c50a600
I0829 04:33:18.297412 46252 graph_helper.h:104] adj c_identity0x3c477220 -> matmul_v20x3c478f60 via c_identity.tmp_490x3c478cb0
I0829 04:33:18.297420 46252 graph_helper.h:104] adj elementwise_add0x3c5539b0 -> reshape20x3c554d20 via linear_116.tmp_10x3c554940
I0829 04:33:18.297425 46252 graph_helper.h:104] adj c_allreduce_sum0x3c470490 -> elementwise_add0x3c473300 via embedding_2.tmp_00x3c471db0
I0829 04:33:18.297430 46252 graph_helper.h:104] adj lookup_table_v20x3c472030 -> elementwise_add0x3c473300 via embedding_3.tmp_00x3c472ef0
I0829 04:33:18.297436 46252 graph_helper.h:104] adj elementwise_add0x3c51bc10 -> gelu0x3c51cf80 via linear_110.tmp_10x3c51cba0
I0829 04:33:18.297442 46252 graph_helper.h:104] adj scale0x3c81aee0 -> matmul_v20x3c81bbd0 via scale_90.tmp_00x3c81b940
I0829 04:33:18.297447 46252 graph_helper.h:104] adj transpose20x3c8196c0 -> matmul_v20x3c81bbd0 via transpose_181.tmp_00x3c81a020
I0829 04:33:18.297453 46252 graph_helper.h:104] adj softmax0x3c58f110 -> dropout0x3c58fe90 via softmax_33.tmp_00x3c58fb90
I0829 04:33:18.297461 46252 graph_helper.h:104] adj elementwise_add0x3c693d60 -> softmax0x3c694c80 via tmp_1250x3c694a10
I0829 04:33:18.297466 46252 graph_helper.h:104] adj elementwise_add0x3c4bec60 -> elementwise_add0x3c4cb320 via tmp_930x3c4bf860
I0829 04:33:18.297472 46252 graph_helper.h:104] adj dropout0x3c4ca420 -> elementwise_add0x3c4cb320 via dropout_79.tmp_00x3c4cafa0
I0829 04:33:18.297478 46252 graph_helper.h:104] adj c_identity0x3c6ad690 -> matmul_v20x3c6af410 via c_identity.tmp_750x3c6af190
I0829 04:33:18.297484 46252 graph_helper.h:104] adj elementwise_add0x3c5a2d50 -> dropout0x3c5a3b30 via linear_123.tmp_10x3c5a37d0
I0829 04:33:18.297482 46260 graph_pattern_detector.cc:125] layer_norm_0 can't find matched Node, early stop
I0829 04:33:18.297490 46252 graph_helper.h:104] adj elementwise_add0x3c4efed0 -> gelu0x3c4f1240 via linear_106.tmp_10x3c4f0e60
I0829 04:33:18.297500 46252 graph_helper.h:104] adj layer_norm0x3c475180 -> c_identity0x3c477220 via layer_norm_49.tmp_20x3c476e40
I0829 04:33:18.297506 46252 graph_helper.h:104] adj scale0x3c4b1e10 -> matmul_v20x3c4b25f0 via scale_50.tmp_00x3c4b23e0
I0829 04:33:18.297513 46252 graph_helper.h:104] adj transpose20x3c4b0610 -> matmul_v20x3c4b25f0 via transpose_101.tmp_00x3c4b0f90
rating system.
[TimeInfo: *** Aborted at 1693254841 (unix time) try "date -d @1693254841" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0xb4b4) received by PID 46260 (TID 0xfffbf97fa170) from PID 46260 ***]
The text was updated successfully, but these errors were encountered: