-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage in DQM harvesting job for Express #38976
Comments
A new Issue was created by @rvenditti . @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
could you post the igprof results please? |
(eg, the html version is more easily understandable) |
assign dqm |
New categories assigned: dqm @jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Hi @davidlange6 , unfortunately i am not able to forward the browsable version to my cern site. However the related sql3 file is here: /eos/user/r/rosma/www/cgi-bin/data/igreport_perf.sql3 |
I took a quick look on On the other hand, I'm not sure how useful the IgProf profile is for a harvesting job. We know that the job processes histograms, but IgProf profile won't tell much about what histograms exactly take a lot of memory (beyond the type). Have you checked the histogram sizes from a DQMIO file? I think that would be a useful number, even if |
One thing I noticed is the job is awfully slow under
Of these the |
A The maps are initialized in cmssw/DQM/GEM/plugins/GEMDQMHarvester.cc Lines 282 to 298 in bbffe91
The profile indicates
I'm afraid the |
FYI @cms-sw/gem-dpg-l2 |
@slowmoyang @quark2 can you take a look at this. |
@jshlee @slowmoyang @quark2 any update on this issue? |
Hi @mmusich, Sorry for my late reply, I'm working on it, but I was sick, so the task was delayed... I'll make a fix soon. Best regards, |
Hi @quark2, cmssw/DQMOffline/Configuration/python/DQMOffline_SecondStep_cff.py Lines 34 to 37 in 4c7b44b
please clarify. Marco (ORM) |
Hi @mmusich, I made a fix, but I have no idea how to measure the memory consumption with the fix. Could you provide the way? Best regards, |
@quark2 the best way would be to run the igProf profiler and there are some instructions here: https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof. |
Thanks, I'll use it! |
Hi, I've made a fix and its backport to 12_4_X. I've seen the reduction of the RSS. (Although it was still 1.9 GB; still too large?) |
How much is it reduced? |
The amount 1.9 GB is overall by the harvesting job. The RSS before the fix is 2.8 GB, so about 900 MB is saved. |
Thanks, can you also provide what would the number in case the harvesting job is executed without |
I executed it without |
Btw, does anyone know how to reproduce |
I see... I need to look at the geometry part more, and reproducing that file should give a good hint. |
There have been recently failures of the same type e.g. for run 357479. I think the streamers for that file are still available:
you would need to generate the configuration via: python RunExpressProcessing.py --scenario ppEra_Run3 --lfn /store/t0streamer/Data/Express/000/357/479/run357479_ls1012_streamExpress_StorageManager.dat --global-tag 101X_dataRun2_Express_v7 --fevt --dqmio manually modifying the output configuration process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(-1)
)
process.source = cms.Source("NewEventStreamFileReader",
fileNames = cms.untracked.vstring('<put here all your input streamer files>')
)
process.options = cms.untracked.PSet(
numberOfStreams = cms.untracked.uint32(0),
numberOfThreads = cms.untracked.uint32(8)
) to get the correct source. |
@makortel we tried to run the igprof profiler with @sarafiorendi and we got these:
but we don't observe a substantial reduction from the GEM DQM code. |
thanks. Still I am not sure to understand why there is no contribution from |
@mmusich @sarafiorendi Was this the part of the recipe you followed for the MEM_LIVE profile?
|
this are the commands I ran
as from so maybe the second one is not correct (?) |
Thanks. I'm a bit confused what dump exactly cmssw/Validation/Performance/python/IgProfInfo.py Lines 5 to 9 in 7654b67
causes a profile to be dumped after some events with the event entry number in the file name. Maybe it's the "end of process" dump, given that the recipe does not specify an output file ( -o ). In that case the MEM_LIVE would report memory leaks (which is not particularly relevant for this problem).
To see the memory usage of process.IgProfService.reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz") (where |
Just for the record, this is keeping happening also in 2023, despite an increase of the memory allowance for harvesting jobs at Tier-0. |
Is there a recipe to reproduce? (the link wasn't clear for me) |
Hi,
The job has 598 files to process, when runnign the memory slowly increasing as the files are opened, and then increasing much faster at the end. It starts off at ~1.4GB, it's 1.7GB after 500 file but then jumps up. Logging every few seconds the number of opened files and the memory usage in kB and got
And then there's lots of LogErrors in the form
and the memory usage grows furtherat least up to I'm testing also igprof with the command lines and reportToFileAtPostEndRun suggested above |
If the job is let to run, the memory usage climbs all the way to 5 GB. I didn't manage to get anything out of igprof, despite trying to set in process.IgProfService
I don't get any of those dumps. I got some dumps from valgrind massif, which seem to imply that it's mostly histograms and associated root stuff (TLists, THashLists, ...), maybe as a consequence of the failed merging the job saves a lot of duplicate histograms? If anyone has more up-to-date instructions to profile the heap I can try them. For those interested in reproducing the problem more quickly, limiting the job to 5-10 files it runs much more quickly (few minutes) but still exhibits a large memory growth at the end. |
The IgProf memory profiling setup has, unfortunately, become broken. @gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)? I'll try out VTune. In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable? |
From a log it seems the input files are all held open until the end. Is that expected? |
That is how cmssw/DQMServices/FwkIO/plugins/DQMRootSource.cc Lines 474 to 480 in 2334fc9
(since #28622, IIUC; another PR related to DQMRootSource memory usage #30889)
|
Besides setting up the version of the jemalloc with profiling enabled |
Hum. So some growth with run length is expected... Seems we (Chris) fixed this already once long ago
On May 22, 2023 5:13 PM, Matti Kortelainen ***@***.***> wrote:
From a log it seems the input files are all held open until the end. Is that expected?
That is how DQMRootSource has been coded
https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480
(since #28622<#28622>, IIUC)
—
Reply to this email directly, view it on GitHub<#38976 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQ56GKOPMS3N7S46RILXHN7ATANCNFSM55WFIVFQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
VTune wasn't very useful, but I managed to get the following total sizes by type
In addition, |
For the record, a new slimmed DQM sequence was defined in #41944 (backported to #42018 for data-taking operations). Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0.
Tarballs can be found at @cms-sw/dqm-l2 can you have a look? [*]
which I find somewhat peculiar. |
cms-bot internal usage |
Assuming the exception being thrown originates from cmssw/DQM/EcalCommon/src/DQWorker.cc Lines 118 to 124 in 3548707
I see the edso_.electronicsMap is set incmssw/DQM/EcalCommon/src/DQWorker.cc Lines 111 to 116 in 3548707
(note that this pattern of taking an address of an EventSetup data product in endLumi and de-referencing the pointer in endProcessBlock is not supported and not guaranteed to work @cms-sw/dqm-l2 @cms-sw/ecal-dpg-l2) In the log I see a printout
I guess the job didn't even start processing LuminosityBlocks (although I see begin/end Run messages in the log), and then the |
Is there any monitoring of memory requirements per release (slimmed things tend to grow…)?
On Sep 2, 2024, at 7:54 AM, Marco Musich ***@***.***> wrote:
For the record, a new slimmed DQM sequence was defined in #41944<#41944> (backported to #42018<#42018> for data-taking operations).
This was deployed in era Run2023D via dmwm/T0#4844<dmwm/T0#4844> (Tier0 replay: dmwm/T0#4847<dmwm/T0#4847>).
Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0.
Two recent cases [*]:
* https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/Express_Run385168_StreamExpress
* https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/Express_Run385194_StreamExpress
Tarballs can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/vocms0314.cern.ch-3498656-3-log.tar.gz
@cms-sw/dqm-l2<https://github.com/orgs/cms-sw/teams/dqm-l2> can you have a look?
[*]
Incidentally both of them contain this concurrent cmssw exception in the logs:
…----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC-----------------------
An exception of category 'InvalidCall' occurred while
[0] Processing end ProcessBlock
[1] Calling method for module EcalMEFormatter/'ecalMEFormatter'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------
which I find somewhat peculiar.
—
Reply to this email directly, view it on GitHub<#38976 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQ27FRVM7PCRHBXK6UDZUR33PAVCNFSM6AAAAABNQPYEAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUHEZDQOBXGY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for this analysis! Indeed adding this basic test: diff --git a/DQM/EcalCommon/test/BuildFile.xml b/DQM/EcalCommon/test/BuildFile.xml
new file mode 100644
index 00000000000..67518e7f759
--- /dev/null
+++ b/DQM/EcalCommon/test/BuildFile.xml
@@ -0,0 +1,4 @@
+<bin file="testEcalCommon.cc" name="testEcalCommon">
+ <use name="FWCore/TestProcessor"/>
+ <use name="catch2"/>
+ </bin>
diff --git a/DQM/EcalCommon/test/testEcalCommon.cc b/DQM/EcalCommon/test/testEcalCommon.cc
new file mode 100644
index 00000000000..d4eb9db6a08
--- /dev/null
+++ b/DQM/EcalCommon/test/testEcalCommon.cc
@@ -0,0 +1,43 @@
+#include "FWCore/TestProcessor/interface/TestProcessor.h"
+#include "FWCore/Utilities/interface/Exception.h"
+#include "FWCore/ServiceRegistry/interface/Service.h"
+
+#define CATCH_CONFIG_MAIN
+#include "catch.hpp"
+
+// Function to run the catch2 tests
+//___________________________________________________________________________________________
+void runTestForAnalyzer(const std::string& baseConfig, const std::string& analyzerName) {
+ edm::test::TestProcessor::Config config{baseConfig};
+
+ SECTION(analyzerName + " base configuration is OK") { REQUIRE_NOTHROW(edm::test::TestProcessor(config)); }
+
+ SECTION("Run with no LuminosityBlocks") {
+ edm::test::TestProcessor tester(config);
+ REQUIRE_NOTHROW(tester.testRunWithNoLuminosityBlocks());
+ }
+}
+
+// Function to generate base configuration string
+//___________________________________________________________________________________________
+std::string generateBaseConfig(const std::string& cfiName, const std::string& analyzerName) {
+ // Define a raw string literal
+ constexpr const char* rawString = R"_(from FWCore.TestProcessor.TestProcess import *
+from DQM.EcalCommon.{}_cfi import {}
+process = TestProcess()
+process.harvester = {}
+process.moduleToTest(process.harvester)
+process.add_(cms.Service('MessageLogger'))
+process.add_(cms.Service('JobReportService'))
+process.add_(cms.Service('DQMStore'))
+ )_";
+
+ // Format the raw string literal using fmt::format
+ return fmt::format(rawString, cfiName, analyzerName, analyzerName);
+}
+
+//___________________________________________________________________________________________
+TEST_CASE("EcalMEFormatter tests", "[EcalMEFormatter]") {
+ const std::string baseConfig = generateBaseConfig("EcalMEFormatter", "ecalMEFormatter");
+ runTestForAnalyzer(baseConfig, "EcalMEFormatter");
+} results in:
@cms-sw/ecal-dpg-l2 FYI |
not that I am aware of. |
There seems to be another failure of this kind in run number 385889. Here's the log: https://cmst0.web.cern.ch/CMST0/tier0/pausedJobs/data/vocms0313/wmagentJob_jobid988223.txt |
Hi, we'll have a look at this issue -- Kyungmin on behalf of ECAL DQM team. |
DQMHarvesting is exceeding maxPSS in Express reconstruction at T0 in runs 356381, 356615, 356719 link to cmsTalk.
We re-run job on lxplus and they have been completed successfully even if with some warnings (see below)
Looking inside the tarball and local running jobs, it seems that the output of cmsRun shows some warnings in HLTConfigProvider for HLT-EGamma client and at merging step for CTTPS.
In order to investigate these warnings two GitHub issues have been opened:
CTPPS ME merging problems in Harvesting step #38969
Problem in DQM Harvesting step with EgHLTOfflineClient #38970
However, looking at a similar issue observed in May, it seems that the warnings are there since long time and are not the root of the problem.
Running the IgProf tool for the memory profiling, it seems that the main memory consumer is the DQMFileSaver DQMFileSaver::saveForOffline function (see /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and /afs/cern.ch/work/r/rosma/public/Run356615/).... but we don't know how much this check is reliable given that Harvesting is not running on events. Can software expert have a look and give us some suggestions? @makortel
The text was updated successfully, but these errors were encountered: