-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher simulation time in NEURON 9 compared to NEURON 8.2.4 #2787
Comments
Thanks for the notice. I'm seeing pretty much the same thing. It will need to be looked into in greater detail and resolved before the v9 release. On my machine ( also 16 cores) but tstop=1000
|
If I did it right, a bisect seems to point to 6c529a4:
Where the good ones are < 50s, and the ones above that I marked bad. |
Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it? |
Edited after fixing the Caliper region for state-update in version 8.2 Thanks! That is a great hint. Meanwhile I tried Caliper to see where the performance is very different. Here is what seems to me to be the relevant selection
I interpret for example
suggest for example that state integration is dominated by cdp5StCmod.
I conclude that the largest part of the master performance problem is with the KINETIC block of cdp5StCmod and I'll continue by focusing my attention there. |
I hadn't realized that yet when I wrote my above comment. That being the introduction of SoA and data handles, it may be the case that many of the intermediate commits will not result in a working system. |
Now might be a good time to mention that I'm using a slightly modified version of [email protected]:ModelDBRepository/266806
and the temporary changes (there is currently an array bug in NMODL)
|
@ramcdougal > Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it? I gave it a quick shot, but many of the commits don't even build; one of the only ones I got to build was 1bb5bf1, and it was slow, so that at least partly brackets which of the many commits it was. However, out of the 19 other commits I randomly tried, none of them built without error. @nrnhines > Those mods are: |
A couple other short notes:
Looks to be that many more instructions are executed - not cache missing or branch missing.
Since it's a statistical profiler, we'll miss things, but it seems that |
@mgeplf I like that tool (perf?) you are using but I'm not familiar with it. (I was going to use gprof to delve more into the mod file functions but was wondering if that itself would change the profile considerably). Please send a link for me to get started. Your fragments above that say I see that |
@nrnhines > @mgeplf I like that tool (perf?) It's this one: https://en.wikipedia.org/wiki/Perf_(Linux) It's like a console version of vtune (or whatever it's called now), but is no where near as slick. Whenever I turn to it, I use this to remember some useful examples: Unfortunately, it seems the dwarf debug isn't propagated fully into
That may be a loose heuristic, but the nature of sampling profiling means that it won't be accurate. I'd probably try
That seems suspicious, too. |
Here is an interesting observation. The time between 8.2 and master narrows considerably (running perf)
merely by introducing an ASSIGNED variable d, settting This change was suggested by a comparison of the 8.2 and master perf results
with the master per result using diam
Note the top use of
|
Just for my reference.
Avoid the error
with
I find the sample counts helpful, so am using
It would be helpful to collect useful idioms in https://nrn.readthedocs.io/en/latest/install/debug.html#profiling-and-performance-benchmarking |
I forked [email protected]:ModelDBRepository/266806 and my changes are in [email protected]:nrnhines/266806 in branch hines/v9 |
Another performance data point... I installed intel oneapi in hopes of using vtune (I have not yet figured out how to focus on the data handle performance) but here are the results for
|
To check if recent improvements are sufficient (now all merged), I used the this version of Model DB: When compiled with:
I get the following runtimes for 9.0:
and for 8.2.4:
It seems most of the regression is fixes, but there's still a performance gap. We have a couple of options:
|
Thought I'd look at the present performance status on my Apple M1.
Note that the times above come from
I launched with commands like
but noticed a puzzling requirement nowadays for the master that I needed
in order for python to find I believe @pramodk has demonstrated (please correct me if I'm mistaken) on x86_64 that the performance differences disappear when the intel compiler is used and it would be nice to document that here. I vote to close this issue. Or perhaps have a discussion about it at the next developer meeting. The thread here is getting long and there remain many avenues for further exploration. E.g. good developer documentation for how to focus on performance issue diagnosis, multisplit, KINETIC memory access ordering. |
It occurred to me that my last performance results
might be due to the master suffering from cacheline sharing between threads. To look into this, I started an experimental branch, hines/thread-padding, to see if it would be easy to take advantage of the permutability of the SoA storage to add some padding rows between the threads. The first commit added a
and
With that amount of cacheline sharing it seems reasonable to pursue this. One of the purposes of the SoA/DataHandle was to be able to easily do permutation experiments on performance. And a case in point is the present segregation of backing data into threads merely as a permutation. However, I have to admit that the introduction of "holes" or unused padding data into the backing store is making me nervous even though it is still a "kind" of permutation. It seems like it ought to be conceptually straightforward to create the padding data at the tail end of the storage and then permute them into place at the end of each thread region. What is a bit foggy at the moment is whether unused node and mechanism data that computationally never participates in a simulation and, at least from some perspectives, is not available to the interpreter, can be created/destroyed/managed with a small amount of code. Also worrying is the different item sizes in the std:vector for each range variable if the items are arrays (SoAoS). @pramodk @1uc I'd be happy to have your opinions/suggestions. |
I don't know how we can easily add "false" instances of a mechanism to achieve the required padding. I'll let you know if I think of something. There's three reasons I'm not particularly leaning towards false sharing:
A counter to this argument is that only about 50% of the runtime is spent in the "expensive" mechanisms. Meaning the tail is heavy and optimization that don't seem important for the heavy mechanism might still matter. It feels like one should be able to measure some of these things, e.g. using LIKWID, seems like something interesting to try out. LIKWID specifically will likely not work on laptop and desktop, but only on server hardware, like BB5. VTune should also be able to measure certain metrics, but I'm not sure if we can tell it to measure "per mechanism". |
I don't have answers to the above questions but quickly want to provide some additional timing numbers that I got on my x86_64 desktop:
GCC Compiler
Intel Compiler
@1uc and I briefly looked at |
Finally got a dedicated time and "working" Linux desktop machine without hiccups to look into detailed profiling.
With all 3 compilers (GCC, Intel and Clang), master with #2966 is equal or better than v8.2.4. I should expand these notes into dev docs but writing brief points already:
$ git diff
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index bbdeff7..2dd1111 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -12,6 +12,9 @@ import numpy as np
import sys
import os
+import itt
+itt.pause()
+
mycpu = multiprocessing.cpu_count()
mycpu = int(os.environ['NRN_THREADS']) if 'NRN_THREADS' in os.environ else mycpu
@@ -53,9 +56,11 @@ pc.set_maxstep(10)
def run():
import time
h.stdinit()
+ itt.resume()
t0 = time.time()
pc.psolve(h.tstop)
t1 = time.time()
+ itt.pause()
print("NEURON RUN with %d threads took %f " % (mycpu, t1-t0)) Vtune profiling is as usual: analysis=uarch-exploration
result_dir=vtune_uarch
time taskset -c 0-7 vtune -collect $analysis -start-paused -result-dir=${result_dir} $compile_dir/x86_64/special -python protocols/01_SS.py
+++ b/src/nmodl/deriv.cpp
@@ -19,6 +19,8 @@ void copylist(List*, Item*);
List* massage_list_;
List* netrec_cnexp;
+extern int requires_itt_notify;
+
/* SmallBuf size */
#undef SB
#define SB 256
@@ -183,6 +185,7 @@ void solv_diffeq(Item* qsol,
fun->name,
listnum);
vectorize_substitute(qsol, buf);
+ requires_itt_notify = 1;
}
}
dtsav_for_nrn_state = 1;
diff --git a/src/nmodl/noccout.cpp b/src/nmodl/noccout.cpp
index a2ceef6b9..0670e558f 100644
--- a/src/nmodl/noccout.cpp
+++ b/src/nmodl/noccout.cpp
@@ -33,6 +33,8 @@ extern Symbol* cvode_nrn_cur_solve_;
extern Symbol* cvode_nrn_current_solve_;
extern List* state_discon_list_;
+int requires_itt_notify = 0;
+
/* VECTORIZE has not been optional for years. We leave the define there but */
/* we no longer update the #else clauses. */
extern int vectorize;
@@ -532,6 +534,9 @@ void c_out_vectorize() {
/* things which must go first and most declarations */
P("/* VECTORIZED */\n#define NRN_VECTORIZED 1\n");
P("#include <stdio.h>\n#include <stdlib.h>\n#include <math.h>\n#include \"mech_api.h\"\n");
+ if (requires_itt_notify) {
+ P("#include <ittnotify.h>\n");
+ }
P("#undef PI\n");
P("#define nil 0\n");
P("#define _pval pval\n"); // due to some old models using _pval
@@ -775,6 +780,9 @@ void c_out_vectorize() {
P("double _dtsav = dt;\n"
"if (secondorder) { dt *= 0.5; }\n");
}
+ if (requires_itt_notify) {
+ P("__itt_resume();\n");
+ }
P("#if CACHEVEC\n");
P(" _ni = _ml->_nodeindices;\n");
P("#endif\n");
@@ -797,6 +805,9 @@ void c_out_vectorize() {
if (dtsav_for_nrn_state && nrnstate) {
P(" dt = _dtsav;");
}
+ if (requires_itt_notify) {
+ P("__itt_pause();\n");
+ } i.e. in this case, we are only recording
i.e.
|
This got closed as I merged PR but I am happy to clarify if there are any further comments/questions. I will take some time to document some of the instructions above. @SteMasoli: I tested up to 8 threads, and the results are consistent now. When you have time, could you pull the latest master branch and check if the results are as expected on your machine? That would be great! |
We tried NEURON+NMODL (and with a small tweak we get):
compared to NEURON master + NOCMODL:
(I'm on a different device from the previous measurements.) |
Context
Testing NEURON 9 on a recent CPU shows longer simulation time compared to NEURON 8.2.4
Overview of the issue
Brand new installation of Ubuntu 23.10 on a 16 core Zen 4 AMD processor.
With the same model, same code, same number of cores, etc.., NEURON 9 does a simulation in 90s whereas NEURON 8.2.4 does it in 37s.
Expected result/behavior
Similar or better performance changing from NEURON 8 to NEURON 9.
NEURON setup
Version: VERSION 9.0a-176-g17be061e0 master (17be061) 2024-03-13 and NEURON 8.2.4
Installation method: Cmake from source in both cases with:
-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DNRN_ENABLE_INTERVIEWS=ON
-DNRN_ENABLE_MPI=ON
-DNRN_ENABLE_PYTHON=ON
-DNRN_ENABLE_RX3D:BOOL=ON
-DCORENRN_ENABLE_NMODL=OFF
-DNRN_ENABLE_CORENEURON=OFF
OS + Version: Ubuntu 23.10 latest upgrades
Compiler + Version: gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)
Minimal working example - MWE
https://modeldb.science/266806?tab=1
Use the first model "Morphology_1".
Change protocol 01_SS.py with "p.change_nthread(12,1)" and h.tstop = 5000.
nrnivmodl mod_files
time nrniv -python protocols/01_SS.py
The text was updated successfully, but these errors were encountered: