DQM: Optimize DQMIO memory use #30889

schneiml · 2020-07-23T17:00:17Z

PR description:

As suggested by @dpiparo around ROOT-10927, the way we use TTrees in DQMIO reading is not very optimal from memory utilization point of view.

I suggested that we could create/destroy the trees as needed (since in the end, we only do a fairly small number of sequential reads), but that lead to segfaults everywhere (at least in my stand-alone test). But, a tactically placed TTree::Reset call might give us the same benefit, for very little effort, so that is what I try here.

PR validation:

So far I only tried Phat's merge job sample (with the huge HGCAL MEs in it). Results look like this: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=withreset.json&reference=base.json&pid=_sum

There is a slight reduction in memory usage. It might be more on a job where the TTree baskets are more significant compared to the total job; we expect maybe saving 10MB in each of 10 trees? Not sure.

Also, needs more tests to see if the slowdown is systematic or just a random variation.

cmsbuild · 2020-07-23T17:00:40Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-07-23T17:05:50Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17274

This PR adds an extra 16KB to repository

cmsbuild · 2020-07-23T17:06:11Z

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

DQMServices/FwkIO

@andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

andrius-k · 2020-07-24T09:42:12Z

please test

cmsbuild · 2020-07-24T09:42:37Z

The tests are being triggered in jenkins.

CMSSW_11_2_X_2020-07-23-2300/slc7_amd64_gcc820: https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/8264/console Started: 2020/07/24 11:43

cmsbuild · 2020-07-24T10:48:52Z

-1

Tested at: 6128e00

CMSSW: CMSSW_11_2_X_2020-07-23-2300
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8264/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests RelVals

Unit Tests:

I found errors in the following unit tests:

---> test TestDQMServicesFwkIOScripts had ERRORS

RelVals:

When I ran the RelVals I found an error in the following workflows:
4.22 step5

runTheMatrix-results/4.22_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC/step5_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC.log

8.0 step5

runTheMatrix-results/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/step5_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS.log

7.3 step5

runTheMatrix-results/7.3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18/step5_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18.log

140.53 step3

runTheMatrix-results/140.53_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI/step3_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI.log

9.0 step4

runTheMatrix-results/9.0_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST/step4_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST.log

25.0 step4

runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step4_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log

1306.0 step4

runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step4_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log

4.53 step4

runTheMatrix-results/4.53_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step4_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log

1330.0 step4

runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15/step4_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15.log

136.731 step4

runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step4_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log

158.0 step5

runTheMatrix-results/158.0_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO/step5_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO.log

1000.0 step4

runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step4_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

10042.0 step4

runTheMatrix-results/10042.0_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log

10024.0 step4

runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log

10824.0 step4

runTheMatrix-results/10824.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log

136.793 step4

runTheMatrix-results/136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017/step4_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017.log

25202.0 step4

runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25/step4_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25.log

11634.0 step4

runTheMatrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log

12434.0 step4

runTheMatrix-results/12434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023.log

10224.0 step4

runTheMatrix-results/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log

136.874 step4

runTheMatrix-results/136.874_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log

23234.0 step3

runTheMatrix-results/23234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log

140.56 step3

runTheMatrix-results/140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log

28234.0 step4

runTheMatrix-results/28234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60.log

250202.181 step5

runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

cmsbuild · 2020-07-24T10:48:54Z

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

schneiml · 2020-07-27T16:36:32Z

Wow, this Reset seems to have a lot more effects than I expected. From the docs it sounded more like it wouldn't have any visible impact on what we are doing...

It also looks like this causes a new memory leak in harvesting jobs: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=harvestwithreset.json&reference=harvestwithout.json&pid=_sum . Not sure if that ultimately causes the crashes or if it is unrelated.

@dpiparo @pcanal any hints what I might be getting wrong?

Edit: Merging some random 2018 UL per-lumi saved data I get pretty weird results. With this patch it seems to be a lot faster and use less memory, but probably the results are incorrect.

Since Reset() seems to damage the tree.

cmsbuild · 2020-07-28T17:08:07Z

The code-checks are being triggered in jenkins.

schneiml · 2020-07-28T17:11:01Z

Ok, I suspect the Reset screws up the tree (resets number of entries?), but a quick skim over the docs got me to the Refresh call, which promises to re-initialize the tree from the file.

So I added one of those, let's see if that fixes the issues. A test on some per-lumi saved data from 2018 UL looks promising: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2018withreset.json&reference=2018base.json&pid=_sum

Not huge savings, but still a fairly clear signal. In this test, two files were read and there were a bunch of alternating reads (lumis from one file, then the other, then the first again).

cmsbuild · 2020-07-28T17:13:58Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17368

This PR adds an extra 20KB to repository

cmsbuild · 2020-07-28T17:14:18Z

Pull request #30889 was updated. @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please check and sign again.

schneiml · 2020-07-28T17:14:21Z

please test

cmsbuild · 2020-07-28T17:14:40Z

The tests are being triggered in jenkins.

CMSSW_11_2_X_2020-07-28-1100/slc7_amd64_gcc820: https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/8371/console Started: 2020/07/28 19:15

cmsbuild · 2020-07-28T18:22:39Z

+1
Tested at: c867554
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8371/summary.html
CMSSW: CMSSW_11_2_X_2020-07-28-1100
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-07-28T18:22:42Z

Comparison job queued.

cmsbuild · 2020-07-28T19:46:00Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8371/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 34
DQMHistoTests: Total histograms compared: 2525444
DQMHistoTests: Total failures: 7
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2525390
DQMHistoTests: Total skipped: 47
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
Checked 144 log files, 17 edm output root files, 34 DQM output files

schneiml · 2020-07-29T13:33:02Z

I did another test with some 2016 UL Rereco data: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2016eithreset.json&reference=2016base.json&pid=_sum

There is a consistent saving in memory, not much, but better than nothing. This also seems to come at a cost in speed (maybe 10%? not really measurable in these tests but it does seem real).

So, overall, I think this is save to go in.

schneiml · 2020-07-29T13:33:12Z

+1

cmsbuild · 2020-07-29T13:33:28Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

silviodonato · 2020-07-30T08:21:53Z

+1
cc: @srimanob @cms-sw/pdmv-l2

Add a tactical Reset call to save memory.

6128e00

cmsbuild added this to the CMSSW_11_2_X milestone Jul 23, 2020

cmsbuild added code-checks-pending comparison-pending dqm-pending orp-pending pending-signatures tests-pending labels Jul 23, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 23, 2020

cmsbuild added tests-started and removed tests-pending labels Jul 24, 2020

cmsbuild added comparison-notrun tests-rejected and removed comparison-pending tests-started labels Jul 24, 2020

Add a Refresh() call.

c867554

Since Reset() seems to damage the tree.

cmsbuild added code-checks-pending comparison-pending and removed code-checks-approved comparison-notrun tests-rejected labels Jul 28, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 28, 2020

cmsbuild added tests-started and removed tests-pending labels Jul 28, 2020

cmsbuild added tests-approved and removed tests-started labels Jul 28, 2020

cmsbuild added comparison-available and removed comparison-pending labels Jul 28, 2020

cmsbuild mentioned this pull request Jul 29, 2020

BS online for 11 2 x #30591

Closed

cmsbuild added dqm-approved fully-signed and removed dqm-pending pending-signatures labels Jul 29, 2020

cmsbuild added orp-approved and removed orp-pending labels Jul 30, 2020

cmsbuild merged commit 5d1f62c into cms-sw:master Jul 30, 2020

makortel mentioned this pull request May 22, 2023

High memory usage in DQM harvesting job for Express #38976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQM: Optimize DQMIO memory use #30889

DQM: Optimize DQMIO memory use #30889

schneiml commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

andrius-k commented Jul 24, 2020

cmsbuild commented Jul 24, 2020 •

edited

Loading

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

schneiml commented Jul 27, 2020 •

edited

Loading

cmsbuild commented Jul 28, 2020

schneiml commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

schneiml commented Jul 28, 2020

cmsbuild commented Jul 28, 2020 •

edited

Loading

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

schneiml commented Jul 29, 2020

schneiml commented Jul 29, 2020

cmsbuild commented Jul 29, 2020

silviodonato commented Jul 30, 2020

DQM: Optimize DQMIO memory use #30889

DQM: Optimize DQMIO memory use #30889

Conversation

schneiml commented Jul 23, 2020

PR description:

PR validation:

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

andrius-k commented Jul 24, 2020

cmsbuild commented Jul 24, 2020 • edited Loading

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

schneiml commented Jul 27, 2020 • edited Loading

cmsbuild commented Jul 28, 2020

schneiml commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

schneiml commented Jul 28, 2020

cmsbuild commented Jul 28, 2020 • edited Loading

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

cmsbuild commented Jul 28, 2020

schneiml commented Jul 29, 2020

schneiml commented Jul 29, 2020

cmsbuild commented Jul 29, 2020

silviodonato commented Jul 30, 2020

cmsbuild commented Jul 24, 2020 •

edited

Loading

schneiml commented Jul 27, 2020 •

edited

Loading

cmsbuild commented Jul 28, 2020 •

edited

Loading