-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent fetching of condition data #36487
Comments
A new Issue was created by @smorovic Srecko Morovic. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core,db |
New categories assigned: core,db @Dr15Jones,@smuzaffar,@ggovi,@makortel,@francescobrivio,@malbouis,@tvami you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Here is another issue related to improving CondDB access performance #31084 that could be worked on now that esConsumes is enforced at run time. |
To make use of #36539 you'd change cmssw/CondCore/ESSources/interface/DataProxy.h Lines 28 to 33 in 1081695
to class DataProxy : public edm::eventsetup::ESSourceConcurrentDataProxyTemplate<DataT> {
public:
explicit DataProxy(std::shared_ptr<cond::persistency::PayloadProxy<DataT>> pdata)
: m_data{pdata} {} and make the appropriate changes to the ESSource when creating the proxies. |
Thanks, Chris! Could you please give us a high level description of how the concurrency feature in the PR works? In particular, during discussions you were concerned about threads being blocked by locks, leaving them unavailable to other tasks that could be using cores for other things. Does this PR avoid that? |
The new code just removes both the std::mutex and the 'serial task queue' which was being used to serialize access to the underlying ESSource code. This code assumes there are no locks being used by the ESSource (or at least not ones that take any appreciable amount of time) and therefore the bulk of the work can be done concurrently. Each of the DataProxy objects is replicated based on how many concurrent IOVs are being allowed. An individual DataProxy is guaranteed to never be called by more than one thread at a time. For PoolDBESSource, each DataProxy is associated to one https://github.com/cms-sw/cmssw/blob/master/CondCore/CondDB/interface/PayloadProxy.h So as long as different PayloadProxys can be run on different threads at the same time everything should be fine. |
The thread safety question I have is if different
can be called concurrently. If so, then I think PoolDBESSource will be fine to use the new DataProxy. If not then we can't do concurrent accesses and must use the code we have now. |
Hello, sorry for the latency. |
Hello, There have been no crashes or issues seen with either concurrent or non-concurrent version applied. |
@smorovic you can use the updated StallMonitor Service in CMSSW_12_3_0_pre5 (where the changes first appear) to look at concurrency in the EventSetup system. To use
Do not let the job run too long (say over 10 minutes) else you may gather more data the then simple script can handle (i.e. it will run out of memory). If you compare the resultant graphs before and after the change you should see more concurrency during the begin run period if things are working. |
We ran tests with CMSSW_12_3_0_pre5 (using ESSourceConcurrentDataProxyTemplate change suggested above), thread-safe version of frontier_client (with threadsafe=1) and a special version of Coral where locks protecting acess to frontier_client calls have been removed.
Note: @ggovi said that there is another level of locking in Coral and it is at present not thread-safe. We did not do any change to CondDBESSource (it is not clear if anything needs to be done, or new DataProxy will be automatically used). So far, this test points out that there is no attempt to run condition loading concurrently.
|
@smorovic thanks for doing the checks! So the backtrace you are showing is not when EventSetup data products are begin retried (which is what the Proxies do). Instead, you are seeing the request to the PoolDBESSource to get the IOVs. That happens serially. Fortunately, the PoolDBESSource knows what Records it is supposed to get so it would be possible to change |
It looks like we can confirm my hypothesis by turning on LogInfo for the DB cmssw/CondCore/ESSources/plugins/CondDBESSource.cc Lines 496 to 510 in 767a838
|
I believe the output can be added by doing process.MessageLogger.cerr.CondDBESSource = cms.untracked.PSet(
limit = cms.untracked.int32(10000000),
reportEvery = cms.untracked.int32(1)
) |
@Dr15Jones A fatal system signal has occurred: segmentation violation Fri Mar 11 08:50:38 CET 2022
Current Modules: Module: none (crashed) A fatal system signal has occurred: segmentation violation |
Hi @Dr15Jones, from the file it seems that conditions are loaded sequentially (no overlap in logs for different records). |
I had a run with the Stepper (and still the old version without any CondDBESSource.cc/h changes, only having the change in DataProxy.h), which was suggested at the meeting:
this continues for several seconds. Then, at the end of condition loading:
I.e. it turns out that that there are no logged steps around CondDBESSource actions. A question: is it possible to increase timestamp resolution with the MessageLogger? I woud like to have milliseconds for measuring which step takes most time. Here is the log file for the process above: |
@smorovic thanks for the log. I agree that it looks like no concurrency is happening within the data product requests. Can you point me to your code so I can double check that I didn't miss something? |
So I have a unit test designed to check that the new Proxy type does do concurrency: I turned on the Tracer for that job and I do see signs of concurrent running.
So now we need to figure out what is different for the PoolDBESSource implementation. |
Here is project area code I use: However I realize that potentially I did not recompile everything that could pull in this header indirectly (just |
To compile with all dependencies I had to also remove mutex and queue access from the code: According to logs, I still don't see concurrency in condition loading, so presumably this is due to CondDBESSource serializing it (I am using the unmodified version). Next I will try with Giacomo's changes. |
Hello, I've tried to implement a multi-thread pre-loading of condition, to be executed at run start. It needs to be discussed, as in the present implementation it does not provide any advantage. |
@ggovi I had a chance to look at the code. On a first pass through I noticed a few things.
|
@cms-sw/db-l2 Do you have plans to continue pursuing the work of #36487 (comment)? |
CMSSW currently limits fetching of condition data (from the ESSource) to one thread at the time.
With improvements to add thread safery to the frontier_client as well as the Coral library it has been discussed that implementation allowing parallel execution of requests could reduce time spent in beginning-of-run transitions (in HLT and possibly offline) since these jobs are multithreaded.
An important point to also assess if the limiting the parallelism is necessary for each online HLT as well as offline system. Main issue with this is to avoid loading Squid servers, therefore this option can allow tuning the amount of load.
This ticket is opened to track the issue and more details will (could) be added to it.
The text was updated successfully, but these errors were encountered: