-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT farm crash in run 379617 #44769
Comments
cms-bot internal usage |
A new Issue was created by @jalimena. @Dr15Jones, @antoniovilela, @rappoccio, @makortel, @smuzaffar, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, heterogeneous |
@cms-sw/tracking-pog-l2 @cms-sw/trk-dpg-l2 @AdrianoDee FYI |
New categories assigned: hlt,heterogeneous @Martin-Grunewald,@mmusich,@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
for the record, the same script as above run on
|
From the stack trace, it looks to me like an exception happened while the system was handling another exception. When that happens, the C++ runtime aborts the job. |
The assertion referred in the issue description is Does this condition mean there are more hits than the allocated size of the output buffer? @AdrianoDee |
assign reconstruction FYI @cms-sw/trk-dpg-l2 |
New categories assigned: reconstruction @jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
type trk |
Sort of but not entirely Matti, so the culprit is cmssw/RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/SiPixelRawToClusterKernel.dev.cc Lines 509 to 513 in 6e82cc8
where we cut the number of hits to "only":
And the event of the crash has:
The poor man solution would be to rise the max to something safer such as So I have a proposal for a solution here that involves a slightly bigger code refactoring in order to drop the fixed number of hits. That branch is for |
proposed fixes:
|
I tested I repacked all the error stream files in Then I tested with: #!/bin/bash -ex
# CMSSW_14_0_5_patch2
hltGetConfiguration run:379617 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input file:converted.root > hlt.py
cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF
for inputfile in $(eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$'); do
outputfile="${inputfile%.root}"
cp hlt.py hlt_toRun.py
sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_toRun.py
cmsRun hlt_toRun.py &> "${outputfile}.log"
done on both CPU ( |
Is there more information on the stuck jobs somewhere? |
I don't have more information than what reported by @fwyzard. We've been trying (unsuccessfully so far) offline to reproduce using streamer files from Andrea. |
hi @makortel ,
(I also have the full stack trace for all thread here) From the look of it:
From a first look I didn't spot anything suspicious - it really looked like the CUDA runtime was somehow hanging, or maybe our code was waiting for an event that would never complete. @mmusich has re-run over the files one job was processing when it got stuck, and he reproduced the same crash as the rest. So maybe the stuck jobs are a different symptom of the same underlying problem 🤷🏻♂️ |
while reviewing more in details the whole list of error streamer files I've found another crash not cured by |
Thanks @fwyzard for the details. Looking at the full stack trace
I was surprised to see 4 threads (31, 19, 15, 5) calling
I agree. Maybe thread 4 (or/and 3?) is (are?) holding the lock(s?) that |
This behavior was caused by |
+hlt
|
+heterogeneous |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Reporting the crashes in run 379617
Seems to crash on CPU and GPU
To reproduce:
triggers
@cms-sw/hlt-l2 FYI
@cms-sw/heterogeneous-l2 FYI
The text was updated successfully, but these errors were encountered: