-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQM: Optimize DQMIO memory use #30889
Conversation
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17274
|
A new Pull Request was created by @schneiml (Marcel Schneider) for master. It involves the following packages: DQMServices/FwkIO @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
The tests are being triggered in jenkins.
|
-1 Tested at: 6128e00 CMSSW: CMSSW_11_2_X_2020-07-23-2300 I found follow errors while testing this PR Failed tests: UnitTests RelVals
I found errors in the following unit tests: ---> test TestDQMServicesFwkIOScripts had ERRORS
When I ran the RelVals I found an error in the following workflows: runTheMatrix-results/4.22_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC/step5_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC.log8.0 step5 runTheMatrix-results/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/step5_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS.log7.3 step5 runTheMatrix-results/7.3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18/step5_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18.log140.53 step3 runTheMatrix-results/140.53_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI/step3_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI.log9.0 step4 runTheMatrix-results/9.0_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST/step4_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST.log25.0 step4 runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step4_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log1306.0 step4 runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step4_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log4.53 step4 runTheMatrix-results/4.53_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step4_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log1330.0 step4 runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15/step4_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15.log136.731 step4 runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step4_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log158.0 step5 runTheMatrix-results/158.0_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO/step5_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO.log1000.0 step4 runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step4_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log10042.0 step4 runTheMatrix-results/10042.0_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log10024.0 step4 runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log10824.0 step4 runTheMatrix-results/10824.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log136.793 step4 runTheMatrix-results/136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017/step4_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017.log25202.0 step4 runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25/step4_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25.log11634.0 step4 runTheMatrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log12434.0 step4 runTheMatrix-results/12434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023.log10224.0 step4 runTheMatrix-results/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log136.874 step4 runTheMatrix-results/136.874_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log23234.0 step3 runTheMatrix-results/23234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log140.56 step3 runTheMatrix-results/140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log28234.0 step4 runTheMatrix-results/28234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60.log250202.181 step5 runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log |
Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped) |
Wow, this It also looks like this causes a new memory leak in harvesting jobs: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=harvestwithreset.json&reference=harvestwithout.json&pid=_sum . Not sure if that ultimately causes the crashes or if it is unrelated. @dpiparo @pcanal any hints what I might be getting wrong? Edit: Merging some random 2018 UL per-lumi saved data I get pretty weird results. With this patch it seems to be a lot faster and use less memory, but probably the results are incorrect. |
Since Reset() seems to damage the tree.
The code-checks are being triggered in jenkins. |
Ok, I suspect the So I added one of those, let's see if that fixes the issues. A test on some per-lumi saved data from 2018 UL looks promising: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2018withreset.json&reference=2018base.json&pid=_sum Not huge savings, but still a fairly clear signal. In this test, two files were read and there were a bunch of alternating reads (lumis from one file, then the other, then the first again). |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17368
|
Pull request #30889 was updated. @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please check and sign again. |
please test |
The tests are being triggered in jenkins.
|
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
I did another test with some 2016 UL Rereco data: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2016eithreset.json&reference=2016base.json&pid=_sum There is a consistent saving in memory, not much, but better than nothing. This also seems to come at a cost in speed (maybe 10%? not really measurable in these tests but it does seem real). So, overall, I think this is save to go in. |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
As suggested by @dpiparo around ROOT-10927, the way we use TTrees in DQMIO reading is not very optimal from memory utilization point of view.
I suggested that we could create/destroy the trees as needed (since in the end, we only do a fairly small number of sequential reads), but that lead to segfaults everywhere (at least in my stand-alone test). But, a tactically placed
TTree::Reset
call might give us the same benefit, for very little effort, so that is what I try here.PR validation:
So far I only tried Phat's merge job sample (with the huge HGCAL MEs in it). Results look like this: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=withreset.json&reference=base.json&pid=_sum
There is a slight reduction in memory usage. It might be more on a job where the TTree baskets are more significant compared to the total job; we expect maybe saving 10MB in each of 10 trees? Not sure.
Also, needs more tests to see if the slowdown is systematic or just a random variation.