-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on multicore jobs when running on condor with CMSSW_11_3_0_pre6 #33466
Comments
A new Issue was created by @srimanob Phat Srimanobhas. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@srimanob Do I understand correctly that yout get the stack trace (**) with a single-thread run? Does the job have any relevant printouts before the crash? On (***) I have no idea. The message comes from ROOT (that we convert to an exception by default). @pcanal, do you have idea on that? @srimanob Do you have |
On (***) it might be interesting to see the stack trace of the exception. You can obtain it with gdb cmsRun
(gdb) catch throw
(gdb) run <your_config>
# wait until the breakpoint hits, possibly 'continue' if some exceptions are thrown before
(gdb) where |
Hi @makortel
No, I don't have it. This is the error I got back from condor. I can try run again with printout. The issue I face comes from both ntherads = 1 or 8. The strange is that I can't reproduce it when I try to run on lxplus using the same script.
I don't have Regarding Thanks very much. |
wrt (***) see https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_6_with_Fatal_Root_Erro Apparently condor doesn't set the |
also wrt (**) there should have been some kind of error message just before the stack trace. You'll probably find it is something similar, as from the stack trace it looks to be in the initialization of libgui. In general, you shouldn't be linking to libgui in a batch job. |
Thanks very much, @dan131riley I am trying to solve ROOT, will let you know. I still have no idea on libgui as I don't call it (at least from myself). the script is basically run cmsDriver config. I used the same submission script with 11_2 and everything work out-of-box, any idea what actually changes in 11_3 which may cause this? |
By the way for the ROOT issue, does it mean RecoTauCleaner/pfTausProducerSansRefs module tries to use ROOT which makes reference to $HOME somehow? If that is the case, it should be fixed I think. |
Confirm that with setting $HOME to the job, the RECO step works well. |
libgui is being loaded via the plugin mechanism:
so some CMSSW module is needing it. (or so it seems). |
How could we figure out which plugin would be pulling in the dependence on |
From the stack trace, I would say that examine the function parameter in those frames:
should give the answer. |
If you add process.add_(cms.Service("PrintLoadingPlugins")) it will print each time a plugin is loaded and will say what was requested. |
@srimanob Could you repeat (**) with the |
Hi All, Thanks for all information. I found that for the GEN step in my local test, it crashed due to ROOT without HOME env also. With this set, everything runs fine. Is this an expected change? Why does running CMSSW job without any private modules need $HOME for root? My condor script worked well before with 10_6 and 11_2 without the need to set HOME. I just try to run 10_6 with unset HOME interactively, everything seems to work properly in all steps. |
Combined (or caused by?) the evidence for the job loading ROOT's |
Here is the log file from To reproduce the issue, one can use cmsDriver above (ZEE one), with CMSSW_11_3_0_pre6. unsetenv HOME first. |
Thanks @srimanob, quoting here the printouts before exception
(plus a similar stack trace as in the issue description). |
The
and
so Can the FYI @cms-sw/simulation-l2 @cms-sw/ctpps-dpg-l2 |
assign simulation |
New categories assigned: simulation @civanch,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks |
FYI @clemencia @mundim |
Odd, nothing seems to have changed. The hector Makefile does
and |
The selector to
and the args to
Stack trace (-70 lines of
|
And the argment type passed to |
So looking at the TypeWithDict code we can see that the string cmssw/FWCore/Reflection/src/TypeWithDict.cc Lines 342 to 343 in b218083
|
That would be bad :( But the again I am still a bit confused. The type give to
@dan131riley what is the content of |
|
And what about
|
So cmssw/CommonTools/Utils/src/findMethod.cc Line 168 in b218083
which calls cmssw/FWCore/Reflection/src/TypeWithDict.cc Lines 602 to 604 in b218083
cmssw/FWCore/Reflection/src/TypeWithDict.cc Lines 613 to 614 in b218083
|
So the final call to the constructor is here: cmssw/CommonTools/Utils/src/findMethod.cc Lines 84 to 85 in b218083
with the input cmssw/FWCore/Reflection/src/FunctionWithDict.cc Lines 98 to 103 in b218083
so it is just using |
|
And it seems that this is "suprisingly" failing. The question is (of course) why. So one additional piece of information is to have the name of the TClass object this called for and verify the arguments. |
It should be :) but is it. If it isn't then we need to figure out why, if it is then it is even more puzzling (why would a search that is (intended to be) restricted to that class find something completely different. |
For some reason gdb keeps insisting
despite including TClass.h, so I called
|
Here is what I did in ROOT
|
@Dr15Jones Cool :) or :( I should say. We need to open a github issue on the ROOT side for this :( |
Similarly
|
Thanks for all the inline debugging :) You can follow the eventual resolution at root-project/root#7955 |
@srimanob , it seems , that ROOT upgrade is in CMSSW, so the issue is fixed? |
The issue is not fixed in ROOT yet. On the other hand, maybe there isn't much point in keeping this issue open since there is an issue in ROOT and there is a workaround. |
+1 |
Yeah, agree. As the issue is in ROOT side, we can monitor from that. And when the ROOT is available and in CMSSW, we can check again. Thanks All. |
+1 |
This issue is fully signed and ready to be closed. |
Here is the report from https://hypernews.cern.ch/HyperNews/CMS/get/edmFramework/3920.html
With the following cmsDrivers (*), I found the issue when trying to run on the condor. Note that, everything runs fine on lxplus.
(*)
cmsDriver.py ZEE_14TeV_TuneCP5_cfi --mc --conditions auto:phase1_2021_realistic -n 500 --era Run3 --eventcontent FEVTDEBUG -s GEN --datatier GEN --geometry DB:Extended --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --python step1_ZEE_GEN_temp.py --no_exec --fileout file:step1_GEN.root --nThreads 1 --customise_commands "from IOMC.RandomEngine.RandomServiceHelper import RandomNumberServiceHelper ; randSvc = RandomNumberServiceHelper(process.RandomNumberGeneratorService) ; randSvc.populate()\n process.source.firstLuminosityBlock = cms.untracked.uint32(3)"
cmsDriver.py step2 --mc --conditions auto:phase1_2021_realistic -n -1 --era Run3 --eventcontent FEVTDEBUG -s SIM --datatier GEN-SIM --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --geometry DB:Extended --python step2_SIM_GFlashNo.py --no_exec --filein file:step1_GEN.root --fileout file:step2_SIM.root --nThreads 8 --customise_commands "from IOMC.RandomEngine.RandomServiceHelper import RandomNumberServiceHelper ; randSvc = RandomNumberServiceHelper(process.RandomNumberGeneratorService) ; randSvc.populate()" --customise Configuration/DataProcessing/Utils.addMonitoring
cmsDriver.py step3 --mc --conditions auto:phase1_2021_realistic -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --datatier GEN-SIM-DIGI-RAW -n -1 --geometry DB:Extended --era Run3 --eventcontent FEVTDEBUGHLT --python step3_DIGIL1HLT.py --no_exec --filein file:step2_SIM.root --fileout file:step3_DIGIL1HLT.root --nThreads 8 --customise_commands "from IOMC.RandomEngine.RandomServiceHelper import RandomNumberServiceHelper ; randSvc = RandomNumberServiceHelper(process.RandomNumberGeneratorService) ; randSvc.populate()"
cmsDriver.py step4 --mc --conditions auto:phase1_2021_realistic -n -1 --era Run3 --eventcontent MINIAODSIM,DQM -s RAW2DIGI,L1Reco,RECO,RECOSIM,EI,PAT,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM --datatier MINIAODSIM,DQMIO --geometry DB:Extended --python step4_RECO.py --no_exec --filein file:step3_DIGIL1HLT.root --fileout file:step4_RECO.root --nThreads 8
(**)
(***)
The text was updated successfully, but these errors were encountered: