Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher simulation time in NEURON 9 compared to NEURON 8.2.4 #2787

Closed
SteMasoli opened this issue Mar 19, 2024 · 23 comments · Fixed by #2966
Closed

Higher simulation time in NEURON 9 compared to NEURON 8.2.4 #2787

SteMasoli opened this issue Mar 19, 2024 · 23 comments · Fixed by #2966
Assignees
Labels
Milestone

Comments

@SteMasoli
Copy link

Context

Testing NEURON 9 on a recent CPU shows longer simulation time compared to NEURON 8.2.4

Overview of the issue

Brand new installation of Ubuntu 23.10 on a 16 core Zen 4 AMD processor.
With the same model, same code, same number of cores, etc.., NEURON 9 does a simulation in 90s whereas NEURON 8.2.4 does it in 37s.

Expected result/behavior

Similar or better performance changing from NEURON 8 to NEURON 9.

NEURON setup

  • Version: VERSION 9.0a-176-g17be061e0 master (17be061) 2024-03-13 and NEURON 8.2.4

  • Installation method: Cmake from source in both cases with:
    -DCMAKE_BUILD_TYPE=RelWithDebInfo
    -DNRN_ENABLE_INTERVIEWS=ON
    -DNRN_ENABLE_MPI=ON
    -DNRN_ENABLE_PYTHON=ON
    -DNRN_ENABLE_RX3D:BOOL=ON
    -DCORENRN_ENABLE_NMODL=OFF
    -DNRN_ENABLE_CORENEURON=OFF

  • OS + Version: Ubuntu 23.10 latest upgrades

  • Compiler + Version: gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

Minimal working example - MWE

https://modeldb.science/266806?tab=1
Use the first model "Morphology_1".
Change protocol 01_SS.py with "p.change_nthread(12,1)" and h.tstop = 5000.
nrnivmodl mod_files
time nrniv -python protocols/01_SS.py


@SteMasoli SteMasoli added the bug label Mar 19, 2024
@pramodk pramodk added this to the Release v9.0 milestone Mar 19, 2024
@nrnhines nrnhines self-assigned this Mar 19, 2024
@nrnhines
Copy link
Member

Thanks for the notice. I'm seeing pretty much the same thing. It will need to be looked into in greater detail and resolved before the v9 release. On my machine ( also 16 cores) but tstop=1000

nthread      v8.2.4        master
  1             46s         87s
  2             21          47
  4             17          29
  8             12          24
 16             10          21

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 27, 2024

If I did it right, a bisect seems to point to 6c529a4:

2c3a21d89: 49.67user 2.12system 0:11.58elapsed 447%CPU (0avgtext+0avgdata 67344maxresident)k
a3bbaab7c: 50.71user 2.00system 0:12.08elapsed 436%CPU (0avgtext+0avgdata 93484maxresident)k
936c7d3ae: 85.93user 1.98system 0:20.43elapsed 430%CPU (0avgtext+0avgdata 80288maxresident)k
522c86681: 46.77user 1.94system 0:10.98elapsed 443%CPU (0avgtext+0avgdata 93448maxresident)k
67a672a2c: 46.29user 2.03system 0:11.16elapsed 432%CPU (0avgtext+0avgdata 93764maxresident)k
baf3b145a: 85.80user 1.98system 0:20.13elapsed 435%CPU (0avgtext+0avgdata 80360maxresident)k
fb1712738: 85.58user 2.00system 0:20.18elapsed 434%CPU (0avgtext+0avgdata 80544maxresident)k
dfdee8442: 46.50user 2.15system 0:11.24elapsed 432%CPU (0avgtext+0avgdata 93196maxresident)k
d2d811040: 46.34user 2.15system 0:11.41elapsed 424%CPU (0avgtext+0avgdata 93560maxresident)k
6c529a415: 86.48user 2.06system 0:20.35elapsed 434%CPU (0avgtext+0avgdata 80244maxresident)k
b4ade5509: 46.42user 2.06system 0:11.39elapsed 425%CPU (0avgtext+0avgdata 93228maxresident)k

Where the good ones are < 50s, and the ones above that I marked bad.

@ramcdougal
Copy link
Member

Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

@nrnhines
Copy link
Member

nrnhines commented Mar 27, 2024

Edited after fixing the Caliper region for state-update in version 8.2

Thanks! That is a great hint. Meanwhile I tried Caliper to see where the performance is very different. Here is what seems to me to be the relevant selection

master
fadvance                0.242032 78.523456   0.304608  98.825198 
  timestep              0.476875 78.281423   0.600168  98.520590 
    deliver-events      0.210294  0.285367   0.264664   0.359147 
      check-threshold   0.075073  0.075073   0.094483   0.094483 
    setup-tree-matrix   2.034998  5.770918   2.561134   7.262953 
...
    matrix-solver       0.126477  0.126477   0.159177   0.159177 
    second-order-cur    0.069821  0.069821   0.087873   0.087873 
    update              0.083985  0.083985   0.105698   0.105698 
    state-update        0.917186 71.467980   1.154318  89.945574 
      state-cdp5StCmod 49.011724 49.011724  61.683394  61.683394 
      state-Cav2_3      0.225994  0.225994   0.284424   0.284424 
      state-Cav3_1      0.353412  0.353412   0.444785   0.444785 
      state-GRC_CA      0.997766  0.997766   1.255732   1.255732 
      state-GRC_KM      0.106439  0.106439   0.133958   0.133958 
      state-HCN1        0.080375  0.080375   0.101155   0.101155 
      state-HCN2        0.078690  0.078690   0.099035   0.099035 
      state-Kca1_1      4.351285  4.351285   5.476282   5.476282 
      state-Kca2_2      1.352222  1.352222   1.701831   1.701831 
      state-Kca3_1      0.074793  0.074793   0.094131   0.094131 
      state-Kv1_1       0.074700  0.074700   0.094013   0.094013 
      state-Kv3_4       0.242558  0.242558   0.305269   0.305269 
      state-Kv4_3       0.089487  0.089487   0.112623   0.112623 
      state-Leak        0.070016  0.070016   0.088118   0.088118 
      state-Nav1_6     13.441334 13.441334  16.916505  16.916505 

real	1m21.576s
8.2
fadvance                0.132501 50.456262   0.259746  98.911012 
  timestep              0.470033 50.323761   0.921421  98.651267 
    deliver-events      0.208092  0.278673   0.407930   0.546292 
      check-threshold   0.070581  0.070581   0.138362   0.138362 
    setup-tree-matrix   2.143723  5.734394   4.202409  11.241314 
...
    matrix-solver       0.169878  0.169878   0.333018   0.333018 
    second-order-cur    0.070763  0.070763   0.138719   0.138719 
    update              0.109914  0.109914   0.215469   0.215469 
    state-update        0.906806 43.490105   1.777640  85.255034 
      state-cdp5StCmod 24.232049 24.232049  47.502855  47.502855 
      state-Cav2_3      0.229292  0.229292   0.449488   0.449488 
      state-Cav3_1      0.348932  0.348932   0.684023   0.684023 
      state-GRC_CA      0.518979  0.518979   1.017370   1.017370 
      state-GRC_KM      0.093657  0.093657   0.183599   0.183599 
      state-HCN1        0.078251  0.078251   0.153397   0.153397 
      state-HCN2        0.077570  0.077570   0.152064   0.152064 
      state-Kca1_1      4.129362  4.129362   8.094920   8.094920 
      state-Kca2_2      1.515642  1.515642   2.971161   2.971161 
      state-Kca3_1      0.072181  0.072181   0.141499   0.141499 
      state-Kv1_1       0.072002  0.072002   0.141148   0.141148 
      state-Kv3_4       0.235797  0.235797   0.462241   0.462241 
      state-Kv4_3       0.078983  0.078983   0.154832   0.154832 
      state-Leak        0.068475  0.068475   0.134233   0.134233 
      state-Nav1_6     10.832127 10.832127  21.234563  21.234563 

real	0m52.622s

I interpret for example advance 0.242032 78.523456 ... to mean that the exclusive time within advance was 0.24s but the time interval between entry and exit was 78.5s. The last two numbers of both lines is supposed to be % times.
From this perspective the largest master lines

    state-update        0.917186 71.467980   1.154318  89.945574 
      state-cdp5StCmod 49.011724 49.011724  61.683394  61.683394 
      state-Kca1_1      4.351285  4.351285   5.476282   5.476282
      state-Nav1_6     13.441334 13.441334  16.916505  16.916505

suggest for example that state integration is dominated by cdp5StCmod.
The equivalent lines for 8.2 are

    state-update        0.906806 43.490105   1.777640  85.255034 
      state-cdp5StCmod 24.232049 24.232049  47.502855  47.502855 
      state-Kca1_1      4.129362  4.129362   8.094920   8.094920
      state-Nav1_6     10.832127 10.832127  21.234563  21.234563

I conclude that the largest part of the master performance problem is with the KINETIC block of cdp5StCmod and I'll continue by focusing my attention there.

@nrnhines
Copy link
Member

Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

I hadn't realized that yet when I wrote my above comment. That being the introduction of SoA and data handles, it may be the case that many of the intermediate commits will not result in a working system.

@nrnhines
Copy link
Member

Now might be a good time to mention that I'm using a slightly modified version of [email protected]:ModelDBRepository/266806
for both 8.2, master, and coreneuron. Those mods are:

~/models/modeldb/266806/Morphology_1$ git diff --staged
diff --git a/Morphology_1/mod_files/Hcn1.mod b/Morphology_1/mod_files/Hcn1.mod
index 1df9e10..f985669 100755
--- a/Morphology_1/mod_files/Hcn1.mod
+++ b/Morphology_1/mod_files/Hcn1.mod
@@ -15,7 +15,7 @@ NEURON {
        NONSPECIFIC_CURRENT ih
        RANGE Q10_diff,Q10_channel,gbar_Q10, ic
        RANGE o_fast_inf, o_slow_inf, tau_f, tau_s, Erev
-       RANGE gbar,r,g, o
+       RANGE gbar,g, o
 }       
         
 UNITS {
diff --git a/Morphology_1/mod_files/cdp5StCmod.mod b/Morphology_1/mod_files/cdp5StCmod.mod
index bbdd811..9be5b05 100755
--- a/Morphology_1/mod_files/cdp5StCmod.mod
+++ b/Morphology_1/mod_files/cdp5StCmod.mod
@@ -123,9 +123,10 @@ ASSIGNED {
        mgi     (mM)
        vrat    (1)     
         icazz (nA)
+        cao (mM)
 }
 
-CONSTANT { cao = 2     (mM) }
+:CONSTANT { cao = 2    (mM) }
 
 STATE {
        : ca[0] is equivalent to cai
@@ -246,10 +247,11 @@ PROCEDURE factors() {
 }
 
 
-LOCAL dsq, dsqvol  : can't define local variable in KINETIC block
+: LOCAL dsq, dsqvol  : can't define local variable in KINETIC block
                    :   or use in COMPARTMENT statement
 
 KINETIC state {
+  LOCAL dsq, dsqvol
   COMPARTMENT diam*diam*vrat {ca mg Buff1 Buff1_ca Buff2 Buff2_ca BTC BTC_ca DMNPE DMNPE_ca PV PV_ca PV_mg}
   COMPARTMENT (1e10)*parea {pump pumpca}

and the temporary changes (there is currently an array bug in NMODL)

~/models/modeldb/266806/Morphology_1$ git diff
diff --git a/Morphology_1/mod_files/Cav2_3.mod b/Morphology_1/mod_files/Cav2_3.mod
index 0d10ca1..db53b82 100755
--- a/Morphology_1/mod_files/Cav2_3.mod
+++ b/Morphology_1/mod_files/Cav2_3.mod
@@ -10,7 +10,7 @@ NEURON {
          THREADSAFE
          USEION ca READ eca WRITE ica
          RANGE gcabar, m, h, g, gmax, ica
-         RANGE inf, tau
+         RANGE inf0, inf1, tau0, tau1
 }
 
 UNITS {
@@ -29,8 +29,10 @@ STATE {      m h }            : unknown activation and inactivation parameters to be
 
 ASSIGNED {               : parameters needed to solve DE
          ica    (mA/cm2)
-    inf[2]
-         tau[2] (ms)
+    inf0
+         tau0 (ms)
+    inf1
+         tau1 (ms)
     g      (mho/cm2)
     gmax   (mho/cm2)
 }
@@ -46,8 +48,8 @@ BREAKPOINT {
 
 INITIAL {
     mhn(v)
-    m = inf[0]
-    h = inf[1]
+    m = inf0
+    h = inf1
     g = gcabar*m*m*m*h
     ica = g*(v - eca) : initial Ca++ current value
     gmax = g
@@ -55,8 +57,8 @@ INITIAL {
 
 DERIVATIVE states {
          mhn(v)
-         m' =  (inf[0] - m)/tau[0]
-         h' =  (inf[1] - h)/tau[1]
+         m' =  (inf0 - m)/tau0
+         h' =  (inf1 - h)/tau1
 }      
 
 FUNCTION varss(v (mV), i) {
@@ -79,11 +81,11 @@ FUNCTION vartau(v (mV), i) (ms) {
 }      
 
 PROCEDURE mhn(v (mV)) {LOCAL a, b :rest = -70
-    TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
-       FROM i=0 TO 1 {
-             tau[i] = vartau(v,i)
-                   inf[i] = varss(v,i)
-         }
+:    TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
+             tau0 = vartau(v,0)
+                   inf0 = varss(v,0)
+             tau1 = vartau(v,1)
+                   inf1 = varss(v,1)
 }
 
 
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index 4e5cc4b..ff0e622 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -21,9 +21,10 @@ Fixed_step.active(0)
 cpu = multiprocessing.cpu_count()
 h.load_file("parcom.hoc")
 p = h.ParallelComputeTool()
-p.change_nthread(cpu,1)
-p.multisplit(1)
-print(cpu)
+if h.mycpu > 1.0:
+    p.change_nthread(h.mycpu,1)
+    p.multisplit(1)
+print(h.mycpu)
 
 #Voltage graph
 h('load_file("vm.ses")')
@@ -37,13 +38,23 @@ h.celsius = 32
 h.tstop = 1000
 h.v_init = -65
 
+pc = h.ParallelContext()
+pc.set_gid2node(0, 0)
+pc.cell(0, cell.nc_spike)
+pc.set_maxstep(10)
+from neuron import coreneuron
+coreneuron.enable = 0
+
 #Initialization 
 def initialize():
     h.finitialize()
     h.run()
+#    h.stdinit()
+#    pc.psolve(h.tstop)
 
 initialize()
 
+'''
 #Save the results into an image
 fig, ax = plt.subplots()
 ax.plot(np.array(cell.time_vector), np.array(cell.vm), 'b', label='spikes')
@@ -59,4 +70,4 @@ plt.ylabel("membrane voltage (mv) ")
 
 plt.savefig('01_SS_trace.eps')
 plt.close()
-
+'''

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

@ramcdougal > Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

I gave it a quick shot, but many of the commits don't even build; one of the only ones I got to build was 1bb5bf1, and it was slow, so that at least partly brackets which of the many commits it was. However, out of the 19 other commits I randomly tried, none of them built without error.

@nrnhines > Those mods are:
I did something similar; I haven't compared the output between the fast commit and the regression, though.

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

A couple other short notes:

perf stat for:
b4ade5509: (fast commit)

     Performance counter stats for 'python 01_SS.py':

             48,928.26 msec task-clock                #    4.263 CPUs utilized
               631,731      context-switches          #    0.013 M/sec
                    29      cpu-migrations            #    0.001 K/sec
                28,020      page-faults               #    0.573 K/sec
       234,176,288,198      cycles                    #    4.786 GHz                      (31.06%)
       555,016,050,075      instructions              #    2.37  insn per cycle           (38.92%)
        70,189,788,913      branches                  # 1434.545 M/sec                    (38.80%)
            99,889,950      branch-misses             #    0.14% of all branches          (38.56%)
       224,807,810,920      L1-dcache-loads           # 4594.642 M/sec                    (38.56%)
         1,585,023,985      L1-dcache-load-misses     #    0.71% of all L1-dcache hits    (38.43%)
           329,502,640      LLC-loads                 #    6.734 M/sec                    (30.57%)
             1,118,627      LLC-load-misses           #    0.34% of all LL-cache hits     (30.47%)
       <not supported>      L1-icache-loads
           625,499,048      L1-icache-load-misses                                         (30.37%)
       225,842,474,219      dTLB-loads                # 4615.788 M/sec                    (30.51%)
               784,591      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.62%)
            45,054,175      iTLB-loads                #    0.921 M/sec                    (30.96%)
            28,229,053      iTLB-load-misses          #   62.66% of all iTLB cache hits   (31.07%)
       <not supported>      L1-dcache-prefetches
       <not supported>      L1-dcache-prefetch-misses

          11.476448534 seconds time elapsed

          46.780273000 seconds user
           3.108108000 seconds sys

6c529a415: Regression:

     Performance counter stats for 'python 01_SS.py':

             88,420.11 msec task-clock                #    4.289 CPUs utilized
               700,525      context-switches          #    0.008 M/sec
                    38      cpu-migrations            #    0.000 K/sec
                23,665      page-faults               #    0.268 K/sec
       424,431,007,847      cycles                    #    4.800 GHz                      (30.92%)
     1,157,209,550,388      instructions              #    2.73  insn per cycle           (38.58%)
       168,992,981,295      branches                  # 1911.251 M/sec                    (38.30%)
           112,577,199      branch-misses             #    0.07% of all branches          (38.41%)
       474,884,904,576      L1-dcache-loads           # 5370.779 M/sec                    (38.33%)
         2,880,849,265      L1-dcache-load-misses     #    0.61% of all L1-dcache hits    (38.44%)
           464,556,719      LLC-loads                 #    5.254 M/sec                    (30.96%)
               943,476      LLC-load-misses           #    0.20% of all LL-cache hits     (30.84%)
       <not supported>      L1-icache-loads
           884,842,762      L1-icache-load-misses                                         (30.84%)
       476,144,529,914      dTLB-loads                # 5385.025 M/sec                    (30.63%)
             1,151,626      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.74%)
            62,166,895      iTLB-loads                #    0.703 M/sec                    (30.76%)
            31,708,346      iTLB-load-misses          #   51.01% of all iTLB cache hits   (30.83%)
       <not supported>      L1-dcache-prefetches
       <not supported>      L1-dcache-prefetch-misses

          20.613380152 seconds time elapsed

          85.971388000 seconds user
           3.464232000 seconds sys

Looks to be that many more instructions are executed - not cache missing or branch missing.

perf record / `perf report:

b4ade5509: (fast commit)

Overhead  Command  Shared Object                                     Symbol
  23.57%  python   libnrnmech.so                                     [.] _nrn_state__cdp5StCmod
  17.57%  python   libnrnmech.so                                     [.] state__cdp5StCmod
   7.86%  python   libnrnmech.so                                     [.] _nrn_state__Nav1_6
   6.57%  python   libnrnmech.so                                     [.] neuron::scopmath::_nrn_thread_getelm
   6.31%  python   libnrnmech.so                                     [.] activation__Nav1_6
   5.81%  python   libnrnmech.so                                     [.] neuron::scopmath::sparse_thread<double*, int (*)(void*, double*, double*, Datum*, Datum*, NrnThread*), int*, double*&, Datum*&, Datum*&, NrnThread*&>
   5.43%  python   libm-2.31.so                                      [.] __ieee754_pow_fma
   3.41%  python   libm-2.31.so                                      [.] __ieee754_exp_fma
   3.14%  python   libnrnmech.so                                     [.] activation__Kca1_1

6c529a415: Regression:

Overhead  Command  Shared Object                                     Symbol
  18.32%  python   libnrnmech.so                                     [.] state__cdp5StCmod
  15.51%  python   libnrnmech.so                                     [.] _nrn_state__cdp5StCmod
  14.35%  python   libnrnmech.so                                     [.] neuron::scopmath::_nrn_thread_getelm
  11.44%  python   libnrnmech.so                                     [.] _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv
   4.42%  python   libnrnmech.so                                     [.] _nrn_state__Nav1_6
   3.43%  python   libnrnmech.so                                     [.] activation__Nav1_6
   3.20%  python   libnrnmech.so                                     [.] neuron::container::data_handle<double>::get_ptr_helper<neuron::container::data_handle<double> >
   2.86%  python   libm-2.31.so                                      [.] __ieee754_pow_fma
   2.21%  python   libnrnmech.so                                     [.] _nrn_state__Kca1_1

Since it's a statistical profiler, we'll miss things, but it seems that _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv might be something to look into

@nrnhines
Copy link
Member

@mgeplf I like that tool (perf?) you are using but I'm not familiar with it. (I was going to use gprof to delve more into the mod file functions but was wondering if that itself would change the profile considerably). Please send a link for me to get started. Your fragments above that say % seem more indirect than actual time. Does one just multiply % by total time?

I see that _nrn_thread_getelm went from 7% to 14% but that seems small to account for half the time. But I'm likely misinterpreting.

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

@nrnhines > @mgeplf I like that tool (perf?)

It's this one: https://en.wikipedia.org/wiki/Perf_(Linux)

It's like a console version of vtune (or whatever it's called now), but is no where near as slick. Whenever I turn to it, I use this to remember some useful examples:
https://www.brendangregg.com/perf.html

Unfortunately, it seems the dwarf debug isn't propagated fully into libnrnmech.so, so the (very useful) --call-graph dwarf can't be used with perf record; I haven't spent time to try and figure out how to get it to show up.

Does one just multiply % by total time?

That may be a loose heuristic, but the nature of sampling profiling means that it won't be accurate.

I'd probably try valgrind callgrind if I wanted accurate timings; which would probably illuminate things better. I mainly used perf to check that the hardware counters weren't showing red flags like cache missing or branch missing.

I see that _nrn_thread_getelm went from 7% to 14%

That seems suspicious, too.

@nrnhines
Copy link
Member

Here is an interesting observation. The time between 8.2 and master narrows considerably (running perf)

57.909s master
47.478s 8.2

merely by introducing an ASSIGNED variable d, settting d = diam in the INITIAL block and replacing all diam with d in the KINETIC state block in cdp5StdCmod.mod

This change was suggested by a comparison of the 8.2 and master perf results

8.2
# Samples: 195K of event 'cycles:Pu'
# Event count (approx.): 178697138584
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                          >
# ........  ............  .......  .................................................  ................................................>
#
    26.40%         49954  nrniv    libnrniv.so                                        [.] matsol
    17.62%         33393  nrniv    libnrniv.so                                        [.] _nrn_thread_getelm
    14.13%         26805  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod
    11.72%         22198  nrniv    libnrniv.so                                        [.] sparse_thread
     5.55%         10506  nrniv    libnrnmech.so                                      [.] _nrn_thread_getelm@plt
     3.62%          6861  nrniv    libnrnmech.so                                      [.] activation__Nav1_6
     3.07%          5820  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     2.83%          5357  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     1.90%          3596  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29
     1.77%          3368  nrniv    libnrnmech.so                                      [.] activation__Kca1_1
     1.07%          2022  nrniv    libnrnmech.so                                      [.] rates__Nav1_6.constprop.0.isra.0
     1.01%          1915  nrniv    libnrniv.so                                        [.] hoc_Exp
     0.81%          1539  nrniv    libnrnmech.so                                      [.] kin__Kca2_2
     0.62%          1180  nrniv    libm.so.6                                          [.] pow@@GLIBC_2.29

with the master per result using diam

master with diam
# Samples: 310K of event 'cycles:Pu'
# Event count (approx.): 305901402865
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                                                                                  >
# ........  ............  .......  .................................................  ........................................................................................................>
#
    31.96%         97514  nrniv    libnrniv.so                                        [.] double* neuron::container::generic_data_handle::get<double*>() const
    15.67%         47797  nrniv    libnrnmech.so                                      [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
    12.22%         37287  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<65ul, 8ul>*, unsigned long, neuron::>
     9.18%         27980  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
     4.95%         15139  nrniv    libnrnmech.so                                      [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     3.39%         10338  nrniv    libnrnmech.so                                      [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron:>
     3.03%          9227  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
     2.19%          6667  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     1.66%          5073  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     1.54%          4699  nrniv    libnrnmech.so                                      [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     1.32%          4038  nrniv    libnrnmech.so                                      [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron:>
     1.15%          3502  nrniv    libnrnmech.so                                      [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_d>
     1.11%          3393  nrniv    libnrnmech.so                                      [.] _nrn_state__Kca1_1(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     0.99%          3026  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29
     0.66%          2025  nrniv    libnrnmech.so                                      [.] double* neuron::container::generic_data_handle::get<double*>() const@plt
     0.57%          1724  nrniv    libnrnmech.so                                      [.] states__GRC_CA(neuron::cache::MechanismRange<30ul, 3ul>*, unsigned long, neuron::container::generic_>
     0.56%          1711  nrniv    libnrniv.so                                        [.] hoc_Exp(double)

Note the top use of get<double>.
Avoiding that with d = diam in the INITIAL block, the perf results become

master with d=diam
# Samples: 215K of event 'cycles:Pu'
# Event count (approx.): 209229806356
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                                                                                         >
# ........  ............  .......  .................................................  ...............................................................................................................>
#
    23.10%         48383  nrniv    libnrnmech.so                                      [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
    18.91%         39595  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<66ul, 8ul>*, unsigned long, neuron::contain>
    13.94%         29208  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
     6.72%         14087  nrniv    libnrnmech.so                                      [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     5.02%         10527  nrniv    libnrnmech.so                                      [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::contai>
     4.52%          9464  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
     3.31%          6940  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     2.43%          5103  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     2.32%          4873  nrniv    libnrnmech.so                                      [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     1.91%          4000  nrniv    libnrnmech.so                                      [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron::contai>
     1.71%          3579  nrniv    libnrnmech.so                                      [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_data_han>
     1.45%          3038  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29

@nrnhines
Copy link
Member

Just for my reference.
Install perf

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`

Avoid the error

$ perf record nrniv -c 'quit()'
Error:
Access to performance monitoring and observability operations is limited.
...

with

sudo sysctl kernel.perf_event_paranoid=2

I find the sample counts helpful, so am using

perf report --stdio -n

It would be helpful to collect useful idioms in https://nrn.readthedocs.io/en/latest/install/debug.html#profiling-and-performance-benchmarking

@mgeplf
Copy link
Collaborator

mgeplf commented Apr 17, 2024

I ran this under valgrind --tool=callgrind and kcachegrind seems to point to neuron::container::generic_data_handler:
2787-callgrind

@nrnhines
Copy link
Member

I forked [email protected]:ModelDBRepository/266806 and my changes are in [email protected]:nrnhines/266806 in branch hines/v9

@nrnhines
Copy link
Member

Another performance data point... I installed intel oneapi in hopes of using vtune (I have not yet figured out how to focus on the data handle performance) but here are the results for CC=icx CXX=icpx for master and 8.2. For the model, I'm using the hines/v9 branch mentioned above.

      icx compiler
nthread  master  v8.2
1     32.9031    38.4446
2     19.5026    22.4375
4     13.5151    14.2496
8     11.7145    8.35472
16    11.6525    6.88371

@1uc
Copy link
Collaborator

1uc commented Jun 21, 2024

To check if recent improvements are sufficient (now all merged), I used the this version of Model DB:
nrnhines/266806@c57abdb

When compiled with:

cmake -DCMAKE_INSTALL_PREFIX=build/install -DNRN_ENABLE_TESTS=ON -DNRN_ENABLE_PYTHON=ON -DNRN_ENABLE_DOCS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_PROFILING=OFF -DNRN_ENABLE_MPI_DYNAMIC=OFF -DCORENRN_ENABLE_OPENMP=OFF -DCMAKE_BUILD_TYPE=Release -B build .

I get the following runtimes for 9.0:

ncores:   [  2  ,  4  ,  8   ]
time [s]: [ 28.8, 19.1, 15.6 ]
          [ 30.2, 20.9, 16.0 ]
          [       20.4, 15.8 ]
          [           , 15.9 ]

and for 8.2.4:

ncores:   [  2  ,  4  ,  8   ]
time [s]: [ 28.9, 17.7, 15.8 ]
          [ 25.3, 16.9, 15.7 ]
          [ 26.4, 17.5, 15.9 ]
          [           , 16.0 ]

It seems most of the regression is fixes, but there's still a performance gap. We have a couple of options:

  1. Consider it good enough and close the issue.
  2. Keep the issue open, but only as a reminder that there still a performance gap. However, development efforts would move on and we don't consider it blocking for 9.0
  3. Keep working on the issue.

@nrnhines
Copy link
Member

Thought I'd look at the present performance status on my Apple M1.

ncore  8.2.4-10-g4ad8f84d2  9.0a-269-g72034bb09
1    54.26    44.81
2    30.78    27.03
4    18.41    18.45
8    16.85    17.79

Note that the times above come from

hines@michaels-macbook-pro-2 Morphology_1 % git diff
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index d134bcc..4486a37 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -47,8 +47,11 @@ pc.set_maxstep(10)
 #coreneuron.file_mode = 0
 #Initialization 
 def initialize():
+    x = h.startsw()
     h.finitialize()
     h.run()
+    x = h.startsw() - x
+    print("nthread=%d  time=%g" % (pc.nthread(), x))
 #    h.stdinit()
 #    pc.psolve(h.tstop)

I launched with commands like

hines@michaels-macbook-pro-2 Morphology_1 % time nrniv -c 'mycpu=8' protocols/01_SS.py

but noticed a puzzling requirement nowadays for the master that I needed

hines@michaels-macbook-pro-2 Morphology_1 % export PYTHONPATH=$PYTHONPATH:`pwd`

in order for python to find Golgi2020_morpho_1.py from protocols/01_SS.py

I believe @pramodk has demonstrated (please correct me if I'm mistaken) on x86_64 that the performance differences disappear when the intel compiler is used and it would be nice to document that here.

I vote to close this issue. Or perhaps have a discussion about it at the next developer meeting. The thread here is getting long and there remain many avenues for further exploration. E.g. good developer documentation for how to focus on performance issue diagnosis, multisplit, KINETIC memory access ordering.

@nrnhines
Copy link
Member

It occurred to me that my last performance results

ncore  8.2.4-10-g4ad8f84d2  9.0a-269-g72034bb09
1    54.26    44.81
2    30.78    27.03
4    18.41    18.45
8    16.85    17.79

might be due to the master suffering from cacheline sharing between threads. To look into this, I started an experimental branch, hines/thread-padding, to see if it would be easy to take advantage of the permutability of the SoA storage to add some padding rows between the threads. The first commit added a padding field (default 0) to NrnThread and Memb_list. The second commit added some debugging printf's to indicate the data padding that would be needed in the backing data std::vector<double> for each range variable. Launching the model of this issue with 1 and 4 threads gives:

Morphology_1 % nrniv -c 'mycpu=1' protocols/01_SS.py
NEURON -- VERSION 9.0a-270-gac8d660b2+ hines/thread-padding (ac8d660b2+) 2024-06-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

loading membrane mechanisms from arm64/.libs/libnrnmech.so
Additional mechanisms from files
 "mod_files/Cav12.mod" "mod_files/Cav13.mod" "mod_files/Cav2_3.mod" "mod_files/Cav3_1.mod" "mod_files/GOLGI_Ampa_mossy_det_vi.mod" "mod_files/GOLGI_Ampa_pf_aa_det_vi.mod" "mod_files/GRC_CA.mod" "mod_files/GRC_KM.mod" "mod_files/Hcn1.mod" "mod_files/Hcn2.mod" "mod_files/Kca11.mod" "mod_files/Kca22.mod" "mod_files/Kca31.mod" "mod_files/Kv11.mod" "mod_files/Kv34.mod" "mod_files/Kv43.mod" "mod_files/Leak.mod" "mod_files/Nav16.mod" "mod_files/PC_NMDA_NR2B.mod" "mod_files/cdp5StCmod.mod"
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
  ZZZ nodes id=0 end=0 padding=0

5389 lines read
1.0
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
  ZZZ nodes id=0 end=545 padding=7
  ZZZ capacitance id=0 count=317 padding=3
  ZZZ na_ion id=0 count=317 padding=3
  ZZZ k_ion id=0 count=317 padding=3
  ZZZ ca_ion id=0 count=317 padding=3
  ZZZ Cav2_3 id=0 count=123 padding=5
  ZZZ Cav3_1 id=0 count=124 padding=4
  ZZZ GRC_CA id=0 count=90 padding=6
  ZZZ GRC_KM id=0 count=5 padding=3
  ZZZ HCN1 id=0 count=5 padding=3
  ZZZ HCN2 id=0 count=5 padding=3
  ZZZ Kca1_1 id=0 count=213 padding=3
  ZZZ Kca2_2 id=0 count=207 padding=1
  ZZZ Kca3_1 id=0 count=1 padding=7
  ZZZ Kv1_1 id=0 count=1 padding=7
  ZZZ Kv3_4 id=0 count=105 padding=7
  ZZZ Kv4_3 id=0 count=1 padding=7
  ZZZ Leak id=0 count=317 padding=3
  ZZZ Nav1_6 id=0 count=317 padding=3
  ZZZ nrvc_ion id=0 count=317 padding=3
  ZZZ cdp5StCmod id=0 count=317 padding=3

and

Morphology_1 % nrniv -c 'mycpu=4' protocols/01_SS.py
NEURON -- VERSION 9.0a-270-gac8d660b2+ hines/thread-padding (ac8d660b2+) 2024-06-24
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2022
See http://neuron.yale.edu/neuron/credits

loading membrane mechanisms from arm64/.libs/libnrnmech.so
Additional mechanisms from files
 "mod_files/Cav12.mod" "mod_files/Cav13.mod" "mod_files/Cav2_3.mod" "mod_files/Cav3_1.mod" "mod_files/GOLGI_Ampa_mossy_det_vi.mod" "mod_files/GOLGI_Ampa_pf_aa_det_vi.mod" "mod_files/GRC_CA.mod" "mod_files/GRC_KM.mod" "mod_files/Hcn1.mod" "mod_files/Hcn2.mod" "mod_files/Kca11.mod" "mod_files/Kca22.mod" "mod_files/Kca31.mod" "mod_files/Kv11.mod" "mod_files/Kv34.mod" "mod_files/Kv43.mod" "mod_files/Leak.mod" "mod_files/Nav16.mod" "mod_files/PC_NMDA_NR2B.mod" "mod_files/cdp5StCmod.mod"
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=1
  ZZZ nodes id=0 end=0 padding=0

5389 lines read
ZZZ nrn_ensure_model_data_are_sorted needs to sort. nrn_nthread=4
  ZZZ nodes id=0 end=127 padding=1
  ZZZ nodes id=1 end=152 padding=0
  ZZZ nodes id=2 end=129 padding=7
  ZZZ nodes id=3 end=157 padding=3
  ZZZ capacitance id=0 count=72 padding=0
  ZZZ capacitance id=1 count=85 padding=3
  ZZZ capacitance id=2 count=70 padding=2
  ZZZ capacitance id=3 count=90 padding=6
  ZZZ na_ion id=0 count=72 padding=0
  ZZZ na_ion id=1 count=85 padding=3
  ZZZ na_ion id=2 count=70 padding=2
  ZZZ na_ion id=3 count=90 padding=6
  ZZZ k_ion id=0 count=72 padding=0
  ZZZ k_ion id=1 count=85 padding=3
  ZZZ k_ion id=2 count=70 padding=2
  ZZZ k_ion id=3 count=90 padding=6
  ZZZ ca_ion id=0 count=72 padding=0
  ZZZ ca_ion id=1 count=85 padding=3
  ZZZ ca_ion id=2 count=70 padding=2
  ZZZ ca_ion id=3 count=90 padding=6
  ZZZ Cav2_3 id=0 count=33 padding=7
  ZZZ Cav2_3 id=1 count=28 padding=4
  ZZZ Cav2_3 id=2 count=50 padding=6
  ZZZ Cav2_3 id=3 count=12 padding=4
  ZZZ Cav3_1 id=0 count=33 padding=7
  ZZZ Cav3_1 id=1 count=29 padding=3
  ZZZ Cav3_1 id=2 count=50 padding=6
  ZZZ Cav3_1 id=3 count=12 padding=4
  ZZZ GRC_CA id=0 count=39 padding=1
  ZZZ GRC_CA id=1 count=5 padding=3
  ZZZ GRC_CA id=2 count=20 padding=4
  ZZZ GRC_CA id=3 count=26 padding=6
  ZZZ GRC_KM id=3 count=5 padding=3
  ZZZ HCN1 id=3 count=5 padding=3
  ZZZ HCN2 id=3 count=5 padding=3
  ZZZ Kca1_1 id=0 count=72 padding=0
  ZZZ Kca1_1 id=1 count=33 padding=7
  ZZZ Kca1_1 id=2 count=70 padding=2
  ZZZ Kca1_1 id=3 count=38 padding=2
  ZZZ Kca2_2 id=0 count=72 padding=0
  ZZZ Kca2_2 id=1 count=32 padding=0
  ZZZ Kca2_2 id=2 count=70 padding=2
  ZZZ Kca2_2 id=3 count=33 padding=7
  ZZZ Kca3_1 id=1 count=1 padding=7
  ZZZ Kv1_1 id=1 count=1 padding=7
  ZZZ Kv3_4 id=1 count=53 padding=3
  ZZZ Kv3_4 id=3 count=52 padding=4
  ZZZ Kv4_3 id=1 count=1 padding=7
  ZZZ Leak id=0 count=72 padding=0
  ZZZ Leak id=1 count=85 padding=3
  ZZZ Leak id=2 count=70 padding=2
  ZZZ Leak id=3 count=90 padding=6
  ZZZ Nav1_6 id=0 count=72 padding=0
  ZZZ Nav1_6 id=1 count=85 padding=3
  ZZZ Nav1_6 id=2 count=70 padding=2
  ZZZ Nav1_6 id=3 count=90 padding=6
  ZZZ nrvc_ion id=0 count=72 padding=0
  ZZZ nrvc_ion id=1 count=85 padding=3
  ZZZ nrvc_ion id=2 count=70 padding=2
  ZZZ nrvc_ion id=3 count=90 padding=6
  ZZZ cdp5StCmod id=0 count=72 padding=0
  ZZZ cdp5StCmod id=1 count=85 padding=3
  ZZZ cdp5StCmod id=2 count=70 padding=2
  ZZZ cdp5StCmod id=3 count=90 padding=6

With that amount of cacheline sharing it seems reasonable to pursue this. One of the purposes of the SoA/DataHandle was to be able to easily do permutation experiments on performance. And a case in point is the present segregation of backing data into threads merely as a permutation. However, I have to admit that the introduction of "holes" or unused padding data into the backing store is making me nervous even though it is still a "kind" of permutation. It seems like it ought to be conceptually straightforward to create the padding data at the tail end of the storage and then permute them into place at the end of each thread region. What is a bit foggy at the moment is whether unused node and mechanism data that computationally never participates in a simulation and, at least from some perspectives, is not available to the interpreter, can be created/destroyed/managed with a small amount of code. Also worrying is the different item sizes in the std:vector for each range variable if the items are arrays (SoAoS).

@pramodk @1uc I'd be happy to have your opinions/suggestions.

@1uc
Copy link
Collaborator

1uc commented Jun 24, 2024

I don't know how we can easily add "false" instances of a mechanism to achieve the required padding. I'll let you know if I think of something.

There's three reasons I'm not particularly leaning towards false sharing:

  1. The last cache line from thread i shares the first cache line with thread i+1. For the costly mechanisms: cdp5StCmod, Nav1_6, Kca1_1 and Kca2_1 there's typically 70+ rows per thread. The L1 cache on this machine is about 40-50 rows of cdp5StCmod per core. Hence, unless the threads are really out-of-sync, by the time thread i needs the cache line it shares with i+1, thread i+1 is likely done with the line, and likely evicted the shared line already.

  2. The most expensive mechanism cdp5StCmod is KINETIC with a 24**2 matrix, the other three are also kinetic, with slightly smaller matrices. Each of these will require one or more linear solves per instance.

  3. If the uncertainty in the runtime is similar to what I was seeing on my device, then at 4 and 8 threads I'm not convinced we can see a performance difference between 9.0 and 8.2.4.

A counter to this argument is that only about 50% of the runtime is spent in the "expensive" mechanisms. Meaning the tail is heavy and optimization that don't seem important for the heavy mechanism might still matter.

It feels like one should be able to measure some of these things, e.g. using LIKWID, seems like something interesting to try out. LIKWID specifically will likely not work on laptop and desktop, but only on server hardware, like BB5. VTune should also be able to measure certain metrics, but I'm not sure if we can tell it to measure "per mechanism".

@pramodk
Copy link
Member

pramodk commented Jun 27, 2024

I don't have answers to the above questions but quickly want to provide some additional timing numbers that I got on my x86_64 desktop:

  • I used the model from my fork here. I have reverted diam related change in the MOD file and some other small changes to measure timing (all trivial): https://github.com/pramodk/266806/tree/hines/v9
  • I build three versions: 8.2.4, latest master and presoa which is the commit just before Olli's SOA PR. You have the commit links to all 3 versions in the below table.
  • I build these versions in standard way i.e. -DCMAKE_BUILD_TYPE=RelWithDebInfo:
cmake .. -DNRN_ENABLE_MOD_COMPATIBILITY=OFF -DCORENRN_ENABLE_OPENMP=OFF \
-DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF \
-DCMAKE_INSTALL_PREFIX=`pwd`/install -DNRN_ENABLE_CORENEURON=OFF \
-DNRN_ENABLE_TESTS=OFF -DNRN_ENABLE_PROFILING=OFF \
-DNRN_PROFILER=caliper -DCMAKE_BUILD_TYPE=RelWithDebInfo
  • The comparison is between GCC 11.4.0 and Intel 2023.2.0

GCC Compiler

Version Threads Repetition 1 Repetition 2
master-latest 1 3.99 4.10
2 2.22 2.27
4 1.26 1.32
8 1.15 1.15
presoa 1 2.77 2.71
2 1.46 1.42
4 0.78 0.77
8 0.72 0.74
824 1 3.48 3.30
2 1.75 1.75
4 0.97 0.98
8 0.88 0.87

Intel Compiler

Version Threads Repetition 1 Repetition 2
master-latest 1 2.77 2.76
2 1.55 1.56
4 0.92 0.92
8 0.84 0.84
presoa 1 2.29 2.26
2 1.23 1.23
4 0.68 0.68
8 0.58 0.61
824 1 3.37 3.38
2 1.83 1.83
4 1.00 1.08
8 0.80 0.81

@1uc and I briefly looked at presoa vs master-latest difference. I can add more info here once we generate & compare the profile outputs.

pramodk added a commit that referenced this issue Jul 8, 2024
…gression

* See discussion in #2787
* Not all compilers will inline sparse_thread.hpp routines, especially
  if generated functions contain many many calls.
* From the testing of models in #2787, I don't see any issue if I force
  inlining with clang, gcc and Intel compiler.

fixes #2787
@pramodk
Copy link
Member

pramodk commented Jul 8, 2024

Finally got a dedicated time and "working" Linux desktop machine without hiccups to look into detailed profiling.

Compiler Build Type Threads Average Time (seconds)
gcc master + PR #2966 1 2.78
gcc master + PR #2966 4 0.95
gcc master + PR #2966 8 0.82
gcc 824 1 3.32
gcc 824 4 1.06
gcc 824 8 0.89
clang master + PR #2966 1 2.92
clang master + PR #2966 4 0.99
clang master + PR #2966 8 0.88
clang 824 1 3.34
clang 824 4 1.04
clang 824 8 0.83
intel master + PR #2966 1 2.60
intel master + PR #2966 4 0.89
intel master + PR #2966 8 0.81
intel 824 1 3.37
intel 824 4 1.02
intel 824 8 0.81

With all 3 compilers (GCC, Intel and Clang), master with #2966 is equal or better than v8.2.4.


I should expand these notes into dev docs but writing brief points already:

  • For Vtune profiling with Python script, itt-python is handy to pause / resume solver part:
$ git diff
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index bbdeff7..2dd1111 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -12,6 +12,9 @@ import numpy as np
 import sys
 import os

+import itt
+itt.pause()
+
 mycpu = multiprocessing.cpu_count()
 mycpu = int(os.environ['NRN_THREADS']) if 'NRN_THREADS' in os.environ else mycpu

@@ -53,9 +56,11 @@ pc.set_maxstep(10)
 def run():
     import time
     h.stdinit()
+    itt.resume()
     t0 = time.time()
     pc.psolve(h.tstop)
     t1 = time.time()
+    itt.pause()
     print("NEURON RUN with %d threads took %f " % (mycpu, t1-t0))

Vtune profiling is as usual:

    analysis=uarch-exploration
    result_dir=vtune_uarch
    time taskset -c 0-7 vtune -collect $analysis -start-paused -result-dir=${result_dir} $compile_dir/x86_64/special -python protocols/01_SS.py
  • Vtune will sample a lot of functions and generated profile data is a bit cluttered if we want to use the "profile comparison" functionality. We can just focus on specific parts like sparse_thread() by modifying nocmodl as:
+++ b/src/nmodl/deriv.cpp
@@ -19,6 +19,8 @@ void copylist(List*, Item*);
 List* massage_list_;
 List* netrec_cnexp;

+extern int requires_itt_notify;
+
 /* SmallBuf size */
 #undef SB
 #define SB 256
@@ -183,6 +185,7 @@ void solv_diffeq(Item* qsol,
                 fun->name,
                 listnum);
             vectorize_substitute(qsol, buf);
+            requires_itt_notify = 1;
         }
     }
     dtsav_for_nrn_state = 1;
diff --git a/src/nmodl/noccout.cpp b/src/nmodl/noccout.cpp
index a2ceef6b9..0670e558f 100644
--- a/src/nmodl/noccout.cpp
+++ b/src/nmodl/noccout.cpp
@@ -33,6 +33,8 @@ extern Symbol* cvode_nrn_cur_solve_;
 extern Symbol* cvode_nrn_current_solve_;
 extern List* state_discon_list_;

+int requires_itt_notify = 0;
+
 /* VECTORIZE has not been optional for years. We leave the define there but */
 /* we no longer update the #else clauses. */
 extern int vectorize;
@@ -532,6 +534,9 @@ void c_out_vectorize() {
     /* things which must go first and most declarations */
     P("/* VECTORIZED */\n#define NRN_VECTORIZED 1\n");
     P("#include <stdio.h>\n#include <stdlib.h>\n#include <math.h>\n#include \"mech_api.h\"\n");
+    if (requires_itt_notify) {
+        P("#include <ittnotify.h>\n");
+    }
     P("#undef PI\n");
     P("#define nil 0\n");
     P("#define _pval pval\n");  // due to some old models using _pval
@@ -775,6 +780,9 @@ void c_out_vectorize() {
             P("double _dtsav = dt;\n"
               "if (secondorder) { dt *= 0.5; }\n");
         }
+        if (requires_itt_notify) {
+            P("__itt_resume();\n");
+        }
         P("#if CACHEVEC\n");
         P("    _ni = _ml->_nodeindices;\n");
         P("#endif\n");
@@ -797,6 +805,9 @@ void c_out_vectorize() {
         if (dtsav_for_nrn_state && nrnstate) {
             P(" dt = _dtsav;");
         }
+    if (requires_itt_notify) {
+        P("__itt_pause();\n");
+    }

i.e. in this case, we are only recording nrn_state functions that use sparse solvers.

  • When comparing the profiles with Vtune, using the custom view to compare timings per file might be a good first step. This is helpful because between 8.2.4 and master, functions and namespaces are changed which makes comparison a bit difficult.
    image
    image

  • In the right bottom corner, one can easily toggle inline functions from the profile view. This also helps to quickly find out which functions are inlined or not inlined:

image

  • If we compare Intel vs GCC profiles for the same version master, we see the below:

image
image
image
image
image

i.e.

  • the total instructions executed are significantly higher with GCC

  • the same is reflected in sparse_thread.hpp or cdp5StCmod.cpp

  • looking at the Event Count tab, we can see the same

  • note, as functions are inlined and hence counts are empty in the intel version

  • The issue is reproducible if we install neuron with Intel compiler and then do nrnivmodl with GCC compiler i.e.

           // install neuron with icx icpx
           // compile now mod files with gcc. Link to libirc for intel functions like fast copy
           CXX=g++ nrnivmodl -loadflags "<home>/compiler/2023.2.1/linux/compiler/lib/intel64_lin/libirc.a" mod_files

    Or, compiling cdp5StCmod.cpp with g++ is sufficient. As the issue is reproducible like this, it also points to the issue of code generation/compilation.

  • Looking into the assembly view for two executions, we can see that in one case we have explicit function calls but not in another (i.e. inlining issue):

image
image

pramodk added a commit that referenced this issue Jul 9, 2024
…gression

* See discussion in #2787
* Not all compilers will inline sparse_thread.hpp routines, especially
  if generated functions contain many many calls.
* From the testing of models in #2787, I don't see any issue if I force
  inlining with clang, gcc and Intel compiler.

fixes #2787
@pramodk pramodk closed this as completed in f927107 Jul 9, 2024
@pramodk
Copy link
Member

pramodk commented Jul 9, 2024

This got closed as I merged PR but I am happy to clarify if there are any further comments/questions. I will take some time to document some of the instructions above.

@SteMasoli: I tested up to 8 threads, and the results are consistent now. When you have time, could you pull the latest master branch and check if the results are as expected on your machine? That would be great!

@1uc
Copy link
Collaborator

1uc commented Jul 16, 2024

We tried NEURON+NMODL (and with a small tweak we get):

ncores:   [  2  ,  4  ,  8  , 16   ]
time [s]: [ 34.2, 20.0, 13.8, 23.4 ]
          [ 34.5, 20.2, 14.2, 24.4 ]
          [ 34.6, 19.5, 14.2,  --  ]
          [ 34.7  20.2, 14.2,  --  ]

compared to NEURON master + NOCMODL:

ncores:   [  2  ,  4  ,  8  , 16   ]
time [s]: [ 34.9, 20.3, 14.0, 22.3 ]
          [ 34.7, 20.2, 14.2, 22.0 ]
          [ 35.6, 19.6, 13.6,  --  ]
          [ 34.6  19.9, 14.5,  --  ]

(I'm on a different device from the previous measurements.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants