Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQM: Optimize DQMIO memory use #30889

Merged
merged 2 commits into from
Jul 30, 2020
Merged

Conversation

schneiml
Copy link
Contributor

PR description:

As suggested by @dpiparo around ROOT-10927, the way we use TTrees in DQMIO reading is not very optimal from memory utilization point of view.

I suggested that we could create/destroy the trees as needed (since in the end, we only do a fairly small number of sequential reads), but that lead to segfaults everywhere (at least in my stand-alone test). But, a tactically placed TTree::Reset call might give us the same benefit, for very little effort, so that is what I try here.

PR validation:

So far I only tried Phat's merge job sample (with the huge HGCAL MEs in it). Results look like this: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=withreset.json&reference=base.json&pid=_sum

There is a slight reduction in memory usage. It might be more on a job where the TTree baskets are more significant compared to the total job; we expect maybe saving 10MB in each of 10 trees? Not sure.

Also, needs more tests to see if the slowdown is systematic or just a random variation.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17274

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @schneiml (Marcel Schneider) for master.

It involves the following packages:

DQMServices/FwkIO

@andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please review it and eventually sign? Thanks.
@barvic this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@andrius-k
Copy link

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 24, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-1

Tested at: 6128e00

CMSSW: CMSSW_11_2_X_2020-07-23-2300
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8264/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests RelVals

  • Unit Tests:

I found errors in the following unit tests:

---> test TestDQMServicesFwkIOScripts had ERRORS

  • RelVals:

When I ran the RelVals I found an error in the following workflows:
4.22 step5

runTheMatrix-results/4.22_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC/step5_RunCosmics2011A+RunCosmics2011A+RECOCOSD+ALCACOSD+SKIMCOSD+HARVESTDC.log

8.0 step5
runTheMatrix-results/8.0_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS/step5_BeamHalo+BeamHalo+DIGICOS+RECOCOS+ALCABH+HARVESTCOS.log

7.3 step5
runTheMatrix-results/7.3_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18/step5_CosmicsSPLoose_UP18+CosmicsSPLoose_UP18+DIGICOS_UP18+RECOCOS_UP18+ALCACOS_UP18+HARVESTCOS_UP18.log

140.53 step3
runTheMatrix-results/140.53_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI/step3_RunHI2011+RunHI2011+RECOHID11+HARVESTDHI.log

9.0 step4
runTheMatrix-results/9.0_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST/step4_Higgs200ChargedTaus+Higgs200ChargedTaus+DIGI+RECO+HARVEST.log

25.0 step4
runTheMatrix-results/25.0_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT/step4_TTbar+TTbar+DIGI+RECOAlCaCalo+HARVEST+ALCATT.log

1306.0 step4
runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step4_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log

4.53 step4
runTheMatrix-results/4.53_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT/step4_RunPhoton2012B+RunPhoton2012B+HLTD+RECODR1reHLT+HARVESTDR1reHLT.log

1330.0 step4
runTheMatrix-results/1330.0_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15/step4_ZMM_13+ZMM_13+DIGIUP15+RECOUP15_L1TMuDQM+HARVESTUP15_L1TMuDQM+NANOUP15.log

136.731 step4
runTheMatrix-results/136.731_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2/step4_RunSinglePh2016B+RunSinglePh2016B+HLTDR2_2016+RECODR2_2016reHLT_skimSinglePh_HIPM+HARVESTDR2.log

158.0 step5
runTheMatrix-results/158.0_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO/step5_HydjetQ_B12_5020GeV_2018_ppReco+HydjetQ_B12_5020GeV_2018_ppReco+DIGIHI2018PPRECO+RECOHI2018PPRECO+ALCARECOHI2018PPRECO+HARVESTHI2018PPRECO.log

1000.0 step4
runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step4_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

10042.0 step4
runTheMatrix-results/10042.0_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_ZMM_13+ZMM_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017.log

10024.0 step4
runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017+NanoFull_2017/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log

10824.0 step4
runTheMatrix-results/10824.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018/step4_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2018_GenSimFull+DigiFull_2018+RecoFull_2018+HARVESTFull_2018+ALCAFull_2018+NanoFull_2018.log

136.793 step4
runTheMatrix-results/136.793_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017/step4_RunDoubleEG2017C+RunDoubleEG2017C+HLTDR2_2017+RECODR2_2017reHLT_skimDoubleEG_Prompt+HARVEST2017.log

25202.0 step4
runTheMatrix-results/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25/step4_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25+NANOUP15_PU25.log

11634.0 step4
runTheMatrix-results/11634.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2021_GenSimFull+DigiFull_2021+RecoFull_2021+HARVESTFull_2021+ALCAFull_2021.log

12434.0 step4
runTheMatrix-results/12434.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2023_GenSimFull+DigiFull_2023+RecoFull_2023+HARVESTFull_2023+ALCAFull_2023.log

10224.0 step4
runTheMatrix-results/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFull+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU+NanoFull_2017PU/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log

136.874 step4
runTheMatrix-results/136.874_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM/step4_RunEGamma2018C+RunEGamma2018C+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Offline_L1TEgDQM+HARVEST2018_L1TEgDQM.log

23234.0 step3
runTheMatrix-results/23234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D49_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D49+RecoFullGlobal_2026D49+HARVESTFullGlobal_2026D49/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log

140.56 step3
runTheMatrix-results/140.56_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18/step3_RunHI2018+RunHI2018+RECOHID18+HARVESTDHI18.log

28234.0 step4
runTheMatrix-results/28234.0_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60/step4_TTbar_14TeV+TTbar_14TeV_TuneCP5_2026D60_GenSimHLBeamSpotFull14+DigiFullTrigger_2026D60+RecoFullGlobal_2026D60+HARVESTFullGlobal_2026D60.log

250202.181 step5
runTheMatrix-results/250202.181_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step5_TTbar_13UP18+TTbar_13UP18+PREMIXUP18_PU25+DIGIPRMXLOCALUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log

@cmsbuild
Copy link
Contributor

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

@schneiml
Copy link
Contributor Author

schneiml commented Jul 27, 2020

Wow, this Reset seems to have a lot more effects than I expected. From the docs it sounded more like it wouldn't have any visible impact on what we are doing...

It also looks like this causes a new memory leak in harvesting jobs: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=harvestwithreset.json&reference=harvestwithout.json&pid=_sum . Not sure if that ultimately causes the crashes or if it is unrelated.

@dpiparo @pcanal any hints what I might be getting wrong?

Edit: Merging some random 2018 UL per-lumi saved data I get pretty weird results. With this patch it seems to be a lot faster and use less memory, but probably the results are incorrect.

Since Reset() seems to damage the tree.
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@schneiml
Copy link
Contributor Author

Ok, I suspect the Reset screws up the tree (resets number of entries?), but a quick skim over the docs got me to the Refresh call, which promises to re-initialize the tree from the file.

So I added one of those, let's see if that fixes the issues. A test on some per-lumi saved data from 2018 UL looks promising: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2018withreset.json&reference=2018base.json&pid=_sum

Not huge savings, but still a fairly clear signal. In this test, two files were read and there were a bunch of alternating reads (lumis from one file, then the other, then the first again).

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30889/17368

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

Pull request #30889 was updated. @andrius-k, @kmaeshima, @schneiml, @cmsbuild, @jfernan2, @fioriNTU can you please check and sign again.

@schneiml
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 28, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+1
Tested at: c867554
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8371/summary.html
CMSSW: CMSSW_11_2_X_2020-07-28-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b601ea/8371/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2525444
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2525390
  • DQMHistoTests: Total skipped: 47
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 144 log files, 17 edm output root files, 34 DQM output files

@schneiml
Copy link
Contributor Author

I did another test with some 2016 UL Rereco data: https://mschneid.web.cern.ch/mschneid/merge/mbGraph.html#?profile=2016eithreset.json&reference=2016base.json&pid=_sum

There is a consistent saving in memory, not much, but better than nothing. This also seems to come at a cost in speed (maybe 10%? not really measurable in these tests but it does seem real).

So, overall, I think this is save to go in.

@schneiml
Copy link
Contributor Author

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1
cc: @srimanob @cms-sw/pdmv-l2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants