Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent fetching of condition data #36487

Closed
smorovic opened this issue Dec 14, 2021 · 26 comments · Fixed by #36539
Closed

Concurrent fetching of condition data #36487

smorovic opened this issue Dec 14, 2021 · 26 comments · Fixed by #36539

Comments

@smorovic
Copy link
Contributor

smorovic commented Dec 14, 2021

CMSSW currently limits fetching of condition data (from the ESSource) to one thread at the time.

With improvements to add thread safery to the frontier_client as well as the Coral library it has been discussed that implementation allowing parallel execution of requests could reduce time spent in beginning-of-run transitions (in HLT and possibly offline) since these jobs are multithreaded.
An important point to also assess if the limiting the parallelism is necessary for each online HLT as well as offline system. Main issue with this is to avoid loading Squid servers, therefore this option can allow tuning the amount of load.

This ticket is opened to track the issue and more details will (could) be added to it.

@cmsbuild
Copy link
Contributor

A new Issue was created by @smorovic Srecko Morovic.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core,db

@cmsbuild
Copy link
Contributor

New categories assigned: core,db

@Dr15Jones,@smuzaffar,@ggovi,@makortel,@francescobrivio,@malbouis,@tvami you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Here is another issue related to improving CondDB access performance #31084 that could be worked on now that esConsumes is enforced at run time.

@Dr15Jones
Copy link
Contributor

To make use of #36539 you'd change

class DataProxy : public edm::eventsetup::ESSourceDataProxyTemplate<DataT> {
public:
explicit DataProxy(std::shared_ptr<cond::persistency::PayloadProxy<DataT>> pdata,
edm::SerialTaskQueue* iQueue,
std::mutex* iMutex)
: edm::eventsetup::ESSourceDataProxyTemplate<DataT>(iQueue, iMutex), m_data{pdata} {}

to

class DataProxy : public edm::eventsetup::ESSourceConcurrentDataProxyTemplate<DataT> {
public:
  explicit DataProxy(std::shared_ptr<cond::persistency::PayloadProxy<DataT>> pdata)
      : m_data{pdata} {}

and make the appropriate changes to the ESSource when creating the proxies.

@DrDaveD
Copy link

DrDaveD commented Dec 17, 2021

Thanks, Chris! Could you please give us a high level description of how the concurrency feature in the PR works? In particular, during discussions you were concerned about threads being blocked by locks, leaving them unavailable to other tasks that could be using cores for other things. Does this PR avoid that?

@Dr15Jones
Copy link
Contributor

The new code just removes both the std::mutex and the 'serial task queue' which was being used to serialize access to the underlying ESSource code.

This code assumes there are no locks being used by the ESSource (or at least not ones that take any appreciable amount of time) and therefore the bulk of the work can be done concurrently.

Each of the DataProxy objects is replicated based on how many concurrent IOVs are being allowed. An individual DataProxy is guaranteed to never be called by more than one thread at a time.

For PoolDBESSource, each DataProxy is associated to one PayloadProxy

https://github.com/cms-sw/cmssw/blob/master/CondCore/CondDB/interface/PayloadProxy.h

So as long as different PayloadProxys can be run on different threads at the same time everything should be fine.

@Dr15Jones
Copy link
Contributor

The thread safety question I have is if different Sessions created by the same ConnectionPool via the call

Session createReadOnlySession(const std::string& connectionString, const std::string& transactionId);

can be called concurrently. If so, then I think PoolDBESSource will be fine to use the new DataProxy. If not then we can't do concurrent accesses and must use the code we have now.

@ggovi
Copy link
Contributor

ggovi commented Feb 14, 2022

Hello, sorry for the latency.
The thread safety of ConnectionPool::createReadOnlySession brings to the implementation of the coral component framework, that is based on statics:
https://github.com/cms-externals/coral/blob/cms/CORAL_2_3_21py3/CoralKernel/CoralKernel/Context.h
So, there we have high level locks that seem to serialise the calls. At lower level we are back to the coral frontier plugins implementation, which should have been checked already...
Maybe we can discuss this further ( do we have specific test to run?) in the next meeting on Friday 18.

@smorovic
Copy link
Contributor Author

Hello,
I can add that in last 1-2 weeks I tested ESSourceConcurrentDataProxyTemplate with CMSSW_12_3_0_pre4 and a cosmic menu, prepared recently by the TSG group, in HLT setup, small scale cluster of 5 machines.

There have been no crashes or issues seen with either concurrent or non-concurrent version applied.
I also did not notice any difference in performance: all run-based condition data loads in span of 4-5 seconds for any of the tests.

@Dr15Jones
Copy link
Contributor

@smorovic you can use the updated StallMonitor Service in CMSSW_12_3_0_pre5 (where the changes first appear) to look at concurrency in the EventSetup system. To use

  1. Add process.add_(cms.Service("StallMonitor", fileName=cms.untracked.string("moduleTimings.stall"))) to your process (changing the file name to what ever is convenient for you)
  2. Run edmStreamStallGrapher.py -g -s moduleTimings.stall on the file to generate a pdf of what modules are running when.

Do not let the job run too long (say over 10 minutes) else you may gather more data the then simple script can handle (i.e. it will run out of memory).

If you compare the resultant graphs before and after the change you should see more concurrency during the begin run period if things are working.

@smorovic
Copy link
Contributor Author

We ran tests with CMSSW_12_3_0_pre5 (using ESSourceConcurrentDataProxyTemplate change suggested above), thread-safe version of frontier_client (with threadsafe=1) and a special version of Coral where locks protecting acess to frontier_client calls have been removed.

  • everything runs without crashes in any configuration. I tried with either 4 threads and streams per process or 12 threads and 6 streams, and also tried to saturate it with high rate of lumi-transitions and using most of available memory on the machine by running 48 processes per FU (5 HLT machines in total). This is with the menu that has both run-based and lumi-based conditions (a cosmic HLT menu).
  • time it takes to load lumi-based conditions is the same as before. I didn't test the standard setup now, but previously it was 5 seconds with the current menu, thus it is the same.
  • I tried several time to attach a debugger and dump stack trace while loading run-based conditions at the beginning of the run. Each time only Thread1 appears to be reading frontier (typically in frontier_getRawData). Thus no evidence of parallel access so far. Others threads are idle (not even trying the access).
  • For the record: tests above use thread-safe version of frontier with threadsafe=1, Coral with disabled frontier connection locking and DataProxy using the ESSourceConcurrentDataProxyTemplate.

Note: @ggovi said that there is another level of locking in Coral and it is at present not thread-safe.

We did not do any change to CondDBESSource (it is not clear if anything needs to be done, or new DataProxy will be automatically used). So far, this test points out that there is no attempt to run condition loading concurrently.
Stack trace of one of the processes (4 threads/streams):


Thread 7 (Thread 0x7f6d59957700 (LWP 308968)):
#0  0x00007f6d7adfb6fd in read () from /lib64/libpthread.so.0
#1  0x00007f6d747df22f in full_read.constprop.0 () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2  0x00007f6d747dfb0f in edm::service::InitRootHandlers::stacktraceHelperThread() ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3  0x00007f6d7b3f4f90 in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f6d32fdb700 (LWP 309114)):
#0  0x00007f6d7adf8965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6d7b3efa7c in __gthread_cond_wait (__mutex=<optimized out>, __cond=<optimized out>)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre2-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre2-build/BUILD/slc7_amd64_gcc10/external/gcc/10.3.0-84898dea653199466402e67d73657f10/gcc-10.3.0/obj/x86_64-unknown-linux-gnu/libstdc++-v3/include/x86_64-unknown-linux-gnu/bits/gthr-default.h:865
#2  std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:53
#3  0x00007f6d7369c17d in FedRawDataInputSource::readWorker(unsigned int) ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libEventFilterUtilities.so
#4  0x00007f6d7b3f4f90 in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#5  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f6d25568700 (LWP 309538)):
#0  0x00007f6d7aae4e2d in nanosleep () from /lib64/libc.so.6
#1  0x00007f6d7aae4cc4 in sleep () from /lib64/libc.so.6
#2  0x00007f6d7368a5e2 in evf::FastMonitoringService::snapshotRunner() ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libEventFilterUtilities.so
#3  0x00007f6d7b3f4f90 in execute_native_thread_routine () at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f6d247ff700 (LWP 309539)):
#0  0x00007f6d7ab181c9 in syscall () from /lib64/libc.so.6
#1  0x00007f6d7bbe42b7 in futex_wait (comparand=2, futex=0x7f6d751f20ac)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
#2  tbb::detail::r1::binary_semaphore::P (this=0x7f6d751f20ac)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730---Type <return> to continue, or q <return> to quit---
601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
#3  0x00007f6d7bbf68b8 in commit_wait (c=..., this=0x7f6d751f20a0)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/rml_thread_monitor.h:242
#4  run (this=0x7f6d751f2080)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:274
#5  tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f6d751f2080)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:221
#6  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f6d23dfe700 (LWP 309540)):
#0  0x00007f6d7ab181c9 in syscall () from /lib64/libc.so.6
#1  0x00007f6d7bbe42b7 in futex_wait (comparand=2, futex=0x7f6d751f212c)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
#2  tbb::detail::r1::binary_semaphore::P (this=0x7f6d751f212c)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
#3  0x00007f6d7bbf68b8 in commit_wait (c=..., this=0x7f6d751f2120)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/rml_thread_monitor.h:242
#4  run (this=0x7f6d751f2100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:274
#5  tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f6d751f2100)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:221
#6  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f6d22bff700 (LWP 309541)):
#0  0x00007f6d7ab181c9 in syscall () from /lib64/libc.so.6
#1  0x00007f6d7bbe42b7 in futex_wait (comparand=2, futex=0x7f6d751f202c)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
---Type <return> to continue, or q <return> to quit---
#2  tbb::detail::r1::binary_semaphore::P (this=0x7f6d751f202c)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/semaphore.h:290
#3  0x00007f6d7bbf68b8 in commit_wait (c=..., this=0x7f6d751f2020)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/rml_thread_monitor.h:242
#4  run (this=0x7f6d751f2000)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:274
#5  tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f6d751f2000)
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/private_server.cpp:221
#6  0x00007f6d7adf4dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f6d7ab1dead in clone () from /lib64/libc.so.6


Thread 1 (Thread 0x7f6d78c8d540 (LWP 305643)):
#0  0x00007f6d7ab1320d in poll () from /lib64/libc.so.6
#1  0x00007f6d39bd85e7 in frontier_read () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#2  0x00007f6d39bd5879 in http_read () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#3  0x00007f6d39bd58c3 in read_char () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#4  0x00007f6d39bd59a1 in read_line () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#5  0x00007f6d39bd62b1 in read_connection () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#6  0x00007f6d39bd66ff in frontierHttpClnt_post ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#7  0x00007f6d39bd660d in frontierHttpClnt_get ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#8  0x00007f6d39be2d74 in get_data () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#9  0x00007f6d39be35ba in frontier_postRawData ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#10 0x00007f6d39be320c in frontier_getRawData ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#11 0x00007f6d39bdab6b in frontier::Session::getData(std::vector<frontier::Request const*, std::allocator<frontier::Request const*> > const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/libfrontier_client.so.2
#12 0x00007f6d39c44327 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/liblcg_FrontierAccess.so
#13 0x00007f6d39c69d32 in coral::FrontierAccess::Query::execute() ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/external/slc7_amd64_gcc10/lib/liblcg_FrontierAccess.so
#14 0x00007f6d50755842 in cond::persistency::TAG::Table::select(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cond::TimeType&, ---Type <return> to continue, or q <return> to quit---
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, cond::SynchronizationType&, unsigned long long&, unsigned long long&, int&) ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreCondDB.so
#15 0x00007f6d50750871 in cond::persistency::IOVProxy::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::posix_time::ptime const&) () from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreCondDB.so
#16 0x00007f6d5075092d in cond::persistency::IOVProxy::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreCondDB.so
#17 0x00007f6d507813bf in cond::persistency::Session::readIov(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/offline/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreCondDB.so
#18 0x00007f6d503dbc6d in cond::DataProxyWrapperBase::loadTag(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreESSources.so
#19 0x00007f6d503dc1c9 in cond::DataProxyWrapperBase::reload() ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libCondCoreESSources.so
#20 0x00007f6d39d17506 in CondDBESSource::setIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&, edm::ValidityInterval&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/pluginCondCoreESSourcesPlugins.so
#21 0x00007f6d7d478931 in edm::EventSetupRecordIntervalFinder::findIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#22 0x00007f6d7d3f824c in edm::eventsetup::DependentRecordIntervalFinder::setIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&, edm::ValidityInterval&) () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#23 0x00007f6d7d478931 in edm::EventSetupRecordIntervalFinder::findIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#24 0x00007f6d7d479618 in edm::eventsetup::EventSetupRecordProvider::setValidityIntervalFor(edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#25 0x00007f6d7d3f8314 in edm::eventsetup::DependentRecordIntervalFinder::setIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&, edm::ValidityInterval&) () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#26 0x00007f6d7d478931 in edm::EventSetupRecordIntervalFinder::findIntervalFor(edm::eventsetup::EventSetupRecordKey const&, edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#27 0x00007f6d7d479618 in edm::eventsetup::EventSetupRecordProvider::setValidityIntervalFor(edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#28 0x00007f6d7d4650bf in edm::eventsetup::EventSetupProvider::setAllValidityIntervals(edm::IOVSyncValue const&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#29 0x00007f6d7d48034f in edm::eventsetup::EventSetupsController::eventSetupForInstanceAsync(edm::IOVSyncValue const&, edm::WaitingTaskHolder const&, edm::WaitingTaskList&, std::vector<std::shared_ptr<edm::EventSetupImpl const>, std::allocator<std::shared_ptr<edm::EventSetupImpl const> > >&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#30 0x00007f6d7d480617 in edm::eventsetup::synchronousEventSetupForInstance(edm::IOVSyncValue const&, tbb::detail::d1::task_group&, edm::eventsetup::EventSetupsController&) () from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#31 0x00007f6d7d42765e in edm::EventProcessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&) ()
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#32 0x00007f6d7d4286f5 in edm::EventProcessor::runToCompletion() ()
---Type <return> to continue, or q <return> to quit---
   from /cmsnfshome0/nfshome0/smorovic/bench22b/cmssw/CMSSW_12_3_0_pre5/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#33 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#34 0x00007f6d7bbeb15b in tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) ()
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0_pre5-slc7_amd64_gcc10/build/CMSSW_12_3_0_pre5-build/BUILD/slc7_amd64_gcc10/external/tbb/v2021.4.0-75e6d730601d8461f20893321f4f7660/tbb-v2021.4.0/src/tbb/arena.cpp:698
#35 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#36 0x000000000040971c in main ()

@Dr15Jones
Copy link
Contributor

@smorovic thanks for doing the checks! So the backtrace you are showing is not when EventSetup data products are begin retried (which is what the Proxies do). Instead, you are seeing the request to the PoolDBESSource to get the IOVs. That happens serially.

Fortunately, the PoolDBESSource knows what Records it is supposed to get so it would be possible to change CondDBESSource::setIntervalFor to ask for all the IOVs for all the Records in parallel and then cache the results. That would be an internal detail of CondDBESSource and not something the framework would have to deal with.

@Dr15Jones
Copy link
Contributor

It looks like we can confirm my hypothesis by turning on LogInfo for the DB

edm::LogInfo("CondDBESSource") << "Reconnecting to \"" << connStr << "\" for getting new payload for record \""
<< recordname << "\" and label \"" << pmIter->second->label()
<< "\" from IOV tag \"" << tcIter->second.tagName() << "\" to be consumed by "
<< iTime.eventID() << ", timestamp: " << iTime.time().value()
<< "; from CondDBESSource::setIntervalFor";
pmIter->second->session() = iSess->second.first;
pmIter->second->reload();
m_stats.nReconnect++;
} else {
edm::LogInfo("CondDBESSource") << "Refreshing IOV sequence labeled by tag \"" << tcIter->second.tagName()
<< "\" for getting new payload for record \"" << recordname << "\" and label \""
<< pmIter->second->label() << "\" to be consumed by " << iTime.eventID()
<< ", timestamp: " << iTime.time().value()
<< "; from CondDBESSource::setIntervalFor";
pmIter->second->reload();

@Dr15Jones
Copy link
Contributor

I believe the output can be added by doing

process.MessageLogger.cerr.CondDBESSource = cms.untracked.PSet(
               limit = cms.untracked.int32(10000000),
               reportEvery = cms.untracked.int32(1)
)

@ggovi
Copy link
Contributor

ggovi commented Mar 11, 2022

@Dr15Jones
I've started a (probably too) naive implementation of what you suggest above concerning the DataProxy. It is here:
https://github.com/ggovi/cmssw/tree/condcore-conddb-essources-concurrency-0-123X
A first 'blind' run of one of the integration tests fails:
[govi@cmsdev23 src]$ cmsRun CondCore/ESSources/test/python/loadall_from_prodglobaltag_cfg.py
%MSG-w CondDBESSource: (NoModuleName) 11-Mar-2022 08:50:36 CET pre-events
Plugin for Record PGeometricDetExtraRcd has not been found.
%MSG

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Fri Mar 11 08:50:38 CET 2022
Thread 2 (Thread 0x7f6a7ed22700 (LWP 17770) "cmsRun"):
#0 0x00007f6aa0c531d9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007f6a9a63ffd7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_
12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f6a9a640b4a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc
10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f6aa124af90 in std::execute_native_thread_routine (__p=0x7f6a908f1520) at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
#4 0x00007f6aa0c4bea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f6aa0973b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f6a9eac6540 (LWP 17751) "cmsRun"):
#0 0x00007f6aa0968ddd in poll () from /lib64/libc.so.6
#1 0x00007f6a9a64028f in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-
1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#2 0x00007f6a9a640c1c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10
/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#3 0x00007f6a9a64356b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-
09-1100/lib/slc7_amd64_gcc10/pluginFWCoreServicesPlugins.so
#4
#5 0x00007f6aa34f90ac in edm::SerialTaskQueue::pushAndGetNextTask(edm::SerialTaskQueue::TaskBase*) () from /cvmfs/cms-ib.cern.ch/nweek-027
23/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#6 0x00007f6aa34f9129 in edm::SerialTaskQueue::pushTask(edm::SerialTaskQueue::TaskBase*) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_am
d64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreConcurrency.so
#7 0x00007f6aa3259cda in edm::eventsetup::ESSourceDataProxyNonConcurrentBase::prefetchAsyncImpl(edm::WaitingTaskHolder, edm::eventsetup::E
ventSetupRecordImpl const&, edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext con
st&) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramew
ork.so
#8 0x00007f6aa3244898 in edm::eventsetup::DataProxy::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const&, e
dm::eventsetup::DataKey const&, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&) const () from /cvmfs/cms
-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#9 0x00007f6aa32c7c9c in edm::eventsetup::EventSetupRecordImpl::prefetchAsync(edm::WaitingTaskHolder, edm::ESProxyIndex, edm::EventSetupIm
pl const*, edm::ServiceToken const&, edm::ESParentContext) const () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW
_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#10 0x00007f6aa33934bf in edm::Worker::esPrefetchAsync(edm::WaitingTaskHolder, edm::EventSetupImpl const&, edm::Transition, edm::ServiceTok
en const&) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCore
Framework.so
#11 0x00007f6aa329e5ee in void edm::Worker::prefetchAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::Waiting
TaskHolder, edm::ServiceToken const&, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Transi
tionInfoType const&, edm::Transition) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib
/slc7_amd64_gcc10/libFWCoreFramework.so
#12 0x00007f6aa329e917 in void edm::Worker::doWorkAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::WaitingTa
skHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::TransitionInfoType const&, edm::ServiceToken const&, edm::Str
eamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib
.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#13 0x00007f6aa329ec83 in void edm::GlobalSchedule::processOneGlobalAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0

(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::TransitionInfoType&, edm::ServiceToken cons
t&, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreF
ramework.so
#14 0x00007f6aa3276ed4 in edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::AutoExceptionHandler<edm::EventProcessor::
beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&)::{lambda(auto:1)#4}>, edm::waiting_task::detail::AutoExceptionHandler<edm::EventP
rocessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&)::{lambda(auto:1)#3}> >::runLast(edm::WaitingTaskHolder) () from /cvmfs/
cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#15 0x00007f6aa32771ad in edm::waiting_task::detail::WaitingTaskChain<edm::waiting_task::detail::Conditional<edm::waiting_task::detail::Aut
oExceptionHandler<edm::EventProcessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&)::{lambda(auto:1)#5}> >, edm::waiting_task:
:detail::AutoExceptionHandler<edm::EventProcessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&)::{lambda(auto:1)#4}>, edm::wai
ting_task::detail::AutoExceptionHandler<edm::EventProcessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&)::{lambda(auto:1)#3}>
::runLast(edm::WaitingTaskHolder) () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/s
lc7_amd64_gcc10/libFWCoreFramework.so
#16 0x00007f6aa3278842 in edm::EventProcessor::beginRun(edm::Hash<2> const&, unsigned int, bool&, bool&) () from /cvmfs/cms-ib.cern.ch/nwee
k-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#17 0x00007f6aa3279745 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc10/cms/cmssw/CMSSW
_12_3_X_2022-03-09-1100/lib/slc7_amd64_gcc10/libFWCoreFramework.so
#18 0x000000000040a266 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() co
nst ()
#19 0x00007f6aa1a4215b in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/
slc7_amd64_gcc10/external/tbb/v2021.4.0-d0152ca29055e3a1bbf629673f6e97c4/tbb-v2021.4.0/src/tbb/arena.cpp:698
#20 0x000000000040b094 in main::{lambda()#1}::operator()() const ()
#21 0x000000000040971c in main ()

Current Modules:

Module: none (crashed)

A fatal system signal has occurred: segmentation violation
Segmentation fault

@smorovic
Copy link
Contributor Author

Hi @Dr15Jones,
I added the logging parameter and ran again. The full log file of one process:
/afs/cern.ch/user/s/smorovic/public/old_hlt_run2100000231_pid304187.log

from the file it seems that conditions are loaded sequentially (no overlap in logs for different records).

@smorovic
Copy link
Contributor Author

I had a run with the Stepper (and still the old version without any CondDBESSource.cc/h changes, only having the change in DataProxy.h), which was suggested at the meeting:

++++ finished: begin stream for module: stream = 3 label = 'RPCMONOutput' id = 56
++++ starting: begin process block
++++ finished: begin process block
++++ starting: source run
++++ finished: source run
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Getting data for record "DTMtimeRcd" to be consumed on Run: 2100000232 - Lumiblock:0 - Timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Preparing refresh for record "DTMtimeRcd" for run: 2100000232 lumi: 0 event: 0, timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Processing record "DTMtimeRcd" and label "" for run: 2100000232 lumi: 0 event: 0, timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Checking if the session must be closed and re-opened for getting correct conditions; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Re-opening the session with connection string frontier://FrontierProd/CMS_CONDITIONS and new transaction Id 2100000232; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Reconnecting to "frontier://FrontierProd/CMS_CONDITIONS" for getting new payload for record "DTMtimeRcd" and label "" from IOV tag "DT_vDrift_cosmic2009_V01_hlt" to be consumed by run: 2100000232 lumi: 0 event: 0, timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Validity coming from IOV sequence for record "DTMtimeRcd" and label "": (317686, 18446744073709551615) for time (type: runnumber) 2100000232; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Setting validity for record "DTMtimeRcd" and corresponding label(s): starting at run: 317686 lumi: 0 event: 0, timestamp: 0, ending at run: 2100000232 lumi: 4294967295 event: 18446744073709551615, timestamp: 0, for run: 2100000232 lumi: 0 event: 0, timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:44 CET  PostBeginProcessBlock
Getting data for record "DTTtrigRcd" to be consumed on Run: 2100000232 - Lumiblock:0 - Timestamp: 1; from CondDBESSource::setIntervalFor
%MSG

this continues for several seconds. Then, at the end of condition loading:

%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:53 CET  PostBeginProcessBlock
Validity coming from IOV sequence for record "ESGainRcd" and label "": (348714, 18446744073709551615) for tim
e (type: runnumber) 2100000232; from CondDBESSource::setIntervalFor
%MSG
%MSG-i CondDBESSource:  AfterSource 11-Mar-2022 18:44:53 CET  PostBeginProcessBlock
Setting validity for record "ESGainRcd" and corresponding label(s): starting at run: 348714 lumi: 0 event: 0,
 timestamp: 0, ending at run: 2100000232 lumi: 4294967295 event: 18446744073709551615, timestamp: 0, for run: 2100000232 lumi: 0 event: 0, timestamp: 1; from CondDBESSource::setIntervalFor
%MSG
++++ starting: global begin run 2100000232 : time = 1

I.e. it turns out that that there are no logged steps around CondDBESSource actions.

A question: is it possible to increase timestamp resolution with the MessageLogger? I woud like to have milliseconds for measuring which step takes most time.
In case it's not doable by configuration, I'll add high-res timestamp to all reevant logs.

Here is the log file for the process above:
/afs/cern.ch/user/s/smorovic/public/old_hlt_run2100000232_pid364333.log

@Dr15Jones
Copy link
Contributor

@smorovic thanks for the log. I agree that it looks like no concurrency is happening within the data product requests.

Can you point me to your code so I can double check that I didn't miss something?

@Dr15Jones
Copy link
Contributor

So I have a unit test designed to check that the new Proxy type does do concurrency:

https://github.com/cms-sw/cmssw/blob/master/FWCore/Integration/test/testConcurrentIOVsESConcurrentSource_cfg.py

I turned on the Tracer for that job and I do see signs of concurrent running.

++++++++ starting: processing esmodule: label = 'testESSource' type = TestESConcurrentSource in record = ESTestRecordI
Start TestESConcurrentSource::busyWait getImpl other
++++++++++ starting: processing esmodule: label = 'testESSource' type = TestESConcurrentSource in record = ESTestRecordI
Start TestESConcurrentSource::busyWait getImpl 
++++++++ finished: prefetching before processing event for module: stream = 3 label = 'busy1' id = 3
++++++++ starting: processing event for module: stream = 3 label = 'busy1' id = 3
Stop TestESConcurrentSource::busyWait getImpl  3.14159e-07
TestESConcurrentSourceTestProxy::getImpl '' startIOV = 1 endIOV = 3 IOV index = 0 cache identifier = 2
++++++++++ finished: processing esmodule: label = 'testESSource' type = TestESConcurrentSource in record = ESTestRecordI

So now we need to figure out what is different for the PoolDBESSource implementation.

@smorovic
Copy link
Contributor Author

Here is project area code I use:
/afs/cern.ch/user/s/smorovic/public/cond/src/
And I only changed this file as instructed:
/afs/cern.ch/user/s/smorovic/public/cond/src/CondCore/ESSources/interface/DataProxy.h

However I realize that potentially I did not recompile everything that could pull in this header indirectly (just FWCore/Framework and CondCore/ESSources ). I will try to do a full recompile and then try again.

@smorovic
Copy link
Contributor Author

To compile with all dependencies I had to also remove mutex and queue access from the code:
smorovic@813afb8

According to logs, I still don't see concurrency in condition loading, so presumably this is due to CondDBESSource serializing it (I am using the unmodified version). Next I will try with Giacomo's changes.
@govi please let me know where to find your version.

@ggovi
Copy link
Contributor

ggovi commented Apr 8, 2022

Hello, I've tried to implement a multi-thread pre-loading of condition, to be executed at run start.
Here is the prototype:
https://github.com/ggovi/cmssw/tree/condcore-conddb-essources-concurrency-1-123X

It needs to be discussed, as in the present implementation it does not provide any advantage.

@tvami
Copy link
Contributor

tvami commented Apr 8, 2022

Hi @ggovi @smorovic do you think opening a draft PR would be helpful for the discussion?

@Dr15Jones
Copy link
Contributor

@ggovi I had a chance to look at the code. On a first pass through I noticed a few things.

  1. Why was CondDBESSource.m_proxies data structure changed? From my limited understanding of the code, the change doesn't appear to have anything to do with concurrency. If so, it would be easier to just change one thing at a time. If it does have to do with concurrency, could you please let me know its purpose?
  2. Should CondDBESSource.m_lastRecordRuns be updated in preloadForRun? That should keep the setIntervalFor from calling reload() again for the proxy.

@makortel
Copy link
Contributor

@cms-sw/db-l2 Do you have plans to continue pursuing the work of #36487 (comment)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants