Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "iter != spillPartitionSet_.end()" from probe side spilling #9325

Closed
zhztheplayer opened this issue Apr 1, 2024 · 4 comments
Closed
Assignees
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@zhztheplayer
Copy link
Contributor

zhztheplayer commented Apr 1, 2024

Bug description

The following error is reported using Velox's commit 70c9f6c

2024-04-01T07:03:21.9003801Z Executing SQL query from resource path /tpcds-queries/q67.sql...
2024-04-01T07:03:24.3875546Z W20240401 07:03:24.386922  8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:24.6263103Z W20240401 07:03:24.625738  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.75KB, node pool usage: 20.00MB
2024-04-01T07:03:27.1143938Z W20240401 07:03:27.113842  8768 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:27.2642109Z W20240401 07:03:27.263653  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.38KB, node pool usage: 21.00MB
2024-04-01T07:03:31.0730661Z W20240401 07:03:31.072510  8773 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:31.4003666Z W20240401 07:03:31.399794  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 240.88KB, node pool usage: 20.00MB
2024-04-01T07:03:33.3298753Z W20240401 07:03:33.329300  8771 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:34.8949359Z W20240401 07:03:34.894352  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.0343120Z W20240401 07:03:36.033732  8768 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 992.75KB, node pool usage: 21.00MB
2024-04-01T07:03:36.4535174Z W20240401 07:03:36.452941  8764 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.1457062Z W20240401 07:03:38.145025  8770 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:38.6754924Z W20240401 07:03:38.674748  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[RUNNING], nonReclaimableSection_[0], op.10.0.0.HashProbe, usage: 241.00KB, node pool usage: 20.00MB
2024-04-01T07:03:42.1649884Z W20240401 07:03:42.164289  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:42.7976060Z W20240401 07:03:42.796917  8621 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 6.81MB
2024-04-01T07:03:43.5208389Z W20240401 07:03:43.520148  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:43.8806747Z W20240401 07:03:43.879967  8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 476.00KB
2024-04-01T07:03:45.0089599Z W20240401 07:03:45.008244  8766 HashBuild.cpp:1030] Can't reclaim from hash build operator, state_[WAIT_FOR_PROBE], nonReclaimableSection_[0], spiller_[cleared] op.10.3.0.HashBuild, usage: 3.00MB
2024-04-01T07:03:45.2284630Z W20240401 07:03:45.227731  8766 HashProbe.cpp:1570] Can't reclaim from hash probe operator, state_[FINISH], nonReclaimableSection_[0], op.6.0.0.HashProbe, usage: 944.50KB, node pool usage: 21.00MB
2024-04-01T07:03:45.3797499Z E20240401 07:03:45.378913  8766 Exceptions.h:69] Line: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp:277, Function:maybeSetupSpillInputReader, Expression: iter != spillPartitionSet_.end() , Source: RUNTIME, ErrorCode: INVALID_STATE
2024-04-01T07:03:45.4701984Z 24/04/01 07:03:45 ERROR TaskResources: Task 48 failed by error: 
2024-04-01T07:03:45.4703353Z org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
2024-04-01T07:03:45.4704939Z Error Source: RUNTIME
2024-04-01T07:03:45.4705549Z Error Code: INVALID_STATE
2024-04-01T07:03:45.4706104Z Retriable: False
2024-04-01T07:03:45.4706630Z Expression: iter != spillPartitionSet_.end()
2024-04-01T07:03:45.4707290Z Function: maybeSetupSpillInputReader
2024-04-01T07:03:45.4708655Z File: /__w/incubator-gluten/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/HashProbe.cpp
2024-04-01T07:03:45.4709643Z Line: 277
2024-04-01T07:03:45.4709980Z Stack trace:
2024-04-01T07:03:45.4710459Z # 0  _ZN8facebook5velox7process10StackTraceC1Ei
2024-04-01T07:03:45.4711627Z # 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
2024-04-01T07:03:45.4713439Z # 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorENS1_22CompileTimeEmptyStringEEEvRKNS1_18VeloxCheckFailArgsET0_
2024-04-01T07:03:45.4715426Z # 3  _ZN8facebook5velox4exec9HashProbe26maybeSetupSpillInputReaderERKSt8optionalINS1_16SpillPartitionIdEE
2024-04-01T07:03:45.4716807Z # 4  _ZN8facebook5velox4exec9HashProbe21asyncWaitForHashTableEv
2024-04-01T07:03:45.4717764Z # 5  _ZN8facebook5velox4exec9HashProbe17getOutputInternalEb
2024-04-01T07:03:45.4718595Z # 6  _ZN8facebook5velox4exec9HashProbe9getOutputEv
2024-04-01T07:03:45.4719774Z # 7  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
2024-04-01T07:03:45.4721212Z # 8  _ZN8facebook5velox4exec6Driver4nextERSt10shared_ptrINS1_13BlockingStateEE
2024-04-01T07:03:45.4722319Z # 9  _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
2024-04-01T07:03:45.4723273Z # 10 _ZN6gluten24WholeStageResultIterator4nextEv

Perhaps it's related to probe-side spilling support? Since the backtrace includes the call to function maybeSetupSpillInputReader which is added by the probe-side spilling PR.

@xiaoxmeng If you can get some information from the backtrace. Thanks.

System information

Velox System Info v0.0.2
Commit: eb2077d
CMake Version: 3.16.3
System: Linux-5.4.0-156-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Full log at https://github.com/apache/incubator-gluten/actions/runs/8504258236/job/23292003386?pr=5224 (Gluten)

@zhztheplayer zhztheplayer added bug Something isn't working triage Newly created issue that needs attention. labels Apr 1, 2024
@xiaoxmeng xiaoxmeng self-assigned this Apr 1, 2024
@zhztheplayer
Copy link
Contributor Author

zhztheplayer commented Apr 2, 2024

@amitkdutta
Copy link
Contributor

@xiaoxmeng We started observing this error in meta internal queries as well (e.g. 20240401_015934_01820_gbew7)

@zhztheplayer
Copy link
Contributor Author

Will be fixed by #9334

@xiaoxmeng
Copy link
Contributor

Verified by @zhztheplayer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests

3 participants