neuronsimulator · pramodk · Jul 10, 2024 · Jul 9, 2024 · Jul 9, 2024 · Jul 10, 2024
diff --git a/docs/README.md b/docs/README.md
@@ -28,6 +28,11 @@ pip3 install -r docs/docs_requirements.txt --upgrade
 
 Also, make sure to have `Doxygen` and `pandoc` installed, and the dependencies listed in [conda_environment.yml](conda_environment.yml)
 Note that this conda environment file is tailored to the online ReadTheDocs setup (but it lists out all desired requirements, so make sure to check it out).
+On Mac, you can install these packages using `brew`:
+
+```bash
+brew install doxygen pandoc
+```
 
 #### Anaconda environment
 

diff --git a/docs/install/debug.md b/docs/install/debug.md
@@ -253,7 +253,7 @@ To enable a profiler, one needs to rebuild NEURON with the appropriate flags set
 
 ```bash
 mkdir build && cd build
-cmake .. -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=/path/to/caliper/share/cmake/caliper -DNRN_ENABLE_TESTS=ON
+cmake .. -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=/path/to/caliper/install/prefix -DNRN_ENABLE_TESTS=ON
 cmake --build . --parallel
 ```
 or if you are building CoreNEURON standalone:
@@ -369,3 +369,214 @@ Notes:
   small objects to dramatically reduce the number of calls to the CUDA runtime
   API to allocate managed memory. It is, therefore, highly recommended to make
   Boost available when using GPU profiling tools.
+
+
+#### Profiling With Intel Vtune
+
+Intel VTune is a powerful performance analysis tool and the preferred choice for debugging on node performance issues
+on Intel platforms. VTune is especially handy when the exact issue and type of analysis needed
+are unclear. For example, should we focus on hotspots, examine thread imbalance, analyze memory accesses, or
+collect some hardware counters? Intel VTune offers a variety of such detailed analysis types, making it easy to
+switch between them as needed. Additionally, VTune has a performance profile comparison feature, simplifying
+the analysis of different runs.
+
+There is no one recipe for using Intel VTune, as different projects have different needs. Depending on the
+issue to analyze, one might need to configure the experiment in a specific way. However, here are some general
+instructions and tips to get started with using Intel VTune with NEURON.
+
+
+###### Using Caliper VTune Service
+
+If we have installed NEURON with the standard build configuration (e.g. `-DCMAKE_BUILD_TYPE=RelWithDebInfo`) and run
+Vtune with a standard analysis like:
+
+```bash
+vtune -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
+```
+
+and then visualize the results using:
+
+```bash
+vtune-gui nrn_vtune_result &
+```
+
+In the `Bottom-up` analysis, we should see output like this:
+
+![](images/nrn_vtune_uarch_default_bottom_up.png)
+
+This is the expected output, but there are some issues or inconveniences with using the profile in this way:
+
+- The profile will include information from both the model building phase and the simulation phase.
+  While one can `Zoom and Filter by Selection`, the UI is not fluid or user-friendly for such selections.
+- As Vtune is sampling-based, we have functions from the entire execution, making it inconvenient to find
+   the context in which they are called. Importantly, as we have seen in the Caliper profile, we would
+   like to have them grouped in NEURON's terminology and according to hierarchy they are called e.g.
+   "psolve", "timestep", "spike-exchange", "state-update", "hh-state", etc.
+
+This is where Caliper can help us! It uses the Instrumentation and Tracing Technology API, i.e., `ittnotify.h`,
+which we typically use to pause/resume profiling but additionally uses the [Task API](https://github.com/LLNL/Caliper/blob/releases/v2.11.0/src/services/vtune/VTuneBindings.cpp)
+to mark all our Caliper instrumentation regions with `__itt_task_begin()` and `__itt_task_end()`.
+Because of this, all our Caliper instrumentations now nicely appear in Intel Vtune in the views like `Summary`, `Bottom-up`, etc.
+
+![](images/nrn_vtune_uarch_caliper_step1_bottom_up.png)
+
+So let's look at how to build NEURON for Vtune Analysis with Caliper and how to profile execution with various analysis.
+
+
+###### Building NEURON for Vtune Analysis with Caliper
+
+The first step is to install Caliper with Intel VTune support. You can install [Caliper](https://github.com/LLNL/Caliper)
+without Spack, but if you are a Spack user, note that the `vtune` variant in Caliper was added in May 2024. Therefore,
+ensure you are using a newer version of Spack and have the `+vtune` variant activated.
+
+Once Caliper is available with Vtune support, we can build NEURON in a regular way:
+
+```bash
+cmake .. \
+        -DNRN_ENABLE_PROFILING=ON \
+        -DNRN_PROFILER=caliper \
+        -DCMAKE_PREFIX_PATH=/path/to/caliper/install/prefix \
+        -DCMAKE_BUILD_TYPE=RelWithDebInfo \
+        -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" \
+        -DCMAKE_INSTALL_PREFIX=/path/to/install
+make -j && make install
+```
+We have additionally specified `-fno-omit-frame-pointer` CXX flag which helps profiler to to accurately track function calls and execution paths.
+
+###### Running NEURON With Vtune And Caliper
+
+Once NEURON is installed, we can now run profiling with [Caliper's Vtune service](https://software.llnl.gov/Caliper/ThirdPartyTools.html#intel-vtune)
+by setting `CALI_SERVICES_ENABLE=vtune` environmental variable:
+
+```bash
+export CALI_SERVICES_ENABLE=vtune
+```
+
+and then launch Vtune profiling as:
+
+
+```bash
+vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
+vtune-gui nrn_vtune_result &
+```
+
+Note that we have used `-start-paused` CLI option so that VTune profiling will be disabled until our Caliper instrumentation
+activates the recording for the simulation phase (i.e., ignoring the model building part). The highlighted `pause`
+region should indicate such time rane. When you open the analysis view, such as `Bottom-up`, make sure to change `Grouping` to
+`Task Type / Function / Call Stack` as shown below:
+
+![](images/nrn_vtune_uarch_caliper_step3_bottom_up.png)
+
+
+###### Typical Vtune Analysis
+
+From the available analysis, we might be inetrested in `performance-snapshot`, `hotspots`, `uarch-exploration`, `memory-access`,
+`threading` and  `hpc-performance`. We can get information about specific specific collenction type using:
+
+```bash
+vtune -help collect
+
+or
+
+vtune -help collect uarch-exploration
+```
+
+**Identifying Hotspots In A Model**
+
+Hotspots analysis is one of the most basic and essential types of analysis in Intel VTune. It helps pinpoint performance bottlenecks
+by highlighting sections of the code where the application spends a significant amount of time.
+Although Caliper's output already aids in identifying hotspots, there are instances where we might encounter unknown performance issues.
+In such cases, hotspot analysis is invaluable. By employing sampling-based mechanisms, it provides insights into what else might be
+contributing to the application's runtime.
+
+In order to run this analysis, we use the same CLI syntax as before:
+
+```bash
+vtune -start-paused -collect hotspots --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
+```
+
+**Comparing Runtime of Two Different Executions**
+
+In certain situations, we want to compare two changesets/commits/versions or just two different builds to find out what is causing
+the difference in execution time. See the performance regression discussed in [#2787](https://github.com/neuronsimulator/nrn/issues/2787).
+
+In such situations, we first need to select the analysis type that is relevant for comparison. For example, if we need to
+find out which functions are causing the difference in execution time, then hotspot analysis will be sufficient. But,
+if we want to determine why a particular function in one build or commit is slower than in another commit or build, then we
+might need to deep dive into hardware counter analysis (e.g., to compare cache misses, instructions retired, etc). In this
+case, we might need to select the `uarch-exploration` analysis type.
+
+To perform such a comparison, we do two runs with two different executions and generate profiles in two different directories:
+
+```bash
+vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result_build1 build1/x86_64/special -python model.py
+vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result_build2 build2/x86_64/special -python model.py
+```
+
+Once we have profile results, we can open them in comparison view using `vtune-gui`:
+
+```bash
+vtune-gui nrn_vtune_result_build1 nrn_vtune_result_build2
+```
+
+As we have NEURON code regions and mechanisms functions annotated, they will appear as tasks, and the comparison will already be helpful
+to give a first impression of the differences between the two executions:
+
+![](images/nrn_vtune_uarch_caliper_step8_summary_compare.png)
+
+In this example, we can see the differences in execution time, and observing the difference in instruction count is already helpful. The Top Tasks
+region shows which tasks/functions are contributing to the biggest difference.
+
+We can then dive deeper into the `Bottom-up` analysis view and look into various hardware counter details. We can do the same in `Event Count`
+view. Note that VTune comes with hundreds of counters, which can be a bit overwhelming! What you want to look at depends on the problem you
+are trying to investigate. In the example below, we are simply comparing the instructions retired. Notice the `Grouping` that we mentioned earlier:
+
+![](images/nrn_vtune_uarch_caliper_step9_bottom_up_compare.png)
+
+We can further expand a specific `task`. As our grouping includes `Call Stack` context, we should be able to see everything that is happening inside that task:
+
+![](images/nrn_vtune_uarch_caliper_step10_bottom_up_expand.png)
+
+Importantly, note that we can change the `Grouping` as per the view we would like to see! In the screenshot below, you can see how we can do that:
+
+![](images/nrn_vtune_uarch_caliper_step11_grouping_expand.png)
+
+**Identifying False Sharing**
+
+False sharing is the situation where multiple threads update distinct elements that reside on the same cache line, causing performance
+degradation due to repeated cache invalidations.
+
+For a quite some time, Intel VTune has documented how to identify such situations. You can see the Cookbook example 
+[here](https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/false-sharing.html) or older tutorials like 
+[this one](https://cdrdv2-public.intel.com/671363/vtune-tutorial-linux-identifying-false-sharing.pdf). In short, we need to find out memory
+objects where access latencies are significantly higher. To facilitate such investigations, VTune provides an analysis type `memory-access`. 
+We can configure this analysis using the CLI as follows:
+
+```bash
+taskset -c 0-7 vtune -collect memory-access \
+                     -knob analyze-mem-objects=true -knob mem-object-size-min-thres=1 -knob dram-bandwidth-limits=true \
+                     -start-paused -result-dir=nrn_vtune_result x86_64/special -python model.py
+```
+
+Here we specify additional parameters `analyze-mem-objects=true` and `mem-object-size-min-thres=1` 
+to track and analyze memory objects larger than 1 byte. Additionally, we use `taskset -c 0-7` 
+to assign threads to specific cores for consistent performance profiling results.
+
+Once we have profile data, we can examine access latencies for different memory objects. In the 
+example below, we view the `Bottom-up` perspective with `Grouping` set to `Memory Object / Function / Call Stack`:
+
+![](images/nrn_vtune_uarch_caliper_step11_grouping_expand.png)
+
+Here, we observe significantly higher access latencies in the `bksub()` function from `sparse_thread.hpp`.
+Double-clicking  the function takes us to the corresponding code:
+
+![](images/nrn_vtune_memacs_caliper_step13_acs_lat_code.png)
+
+The high latencies are attributed to `SparseObj` objects. This suggests a need to revisit how 
+`SparseObj` objects are allocated, stored and updated during runtime.
+
+It's important to note that higher latencies do not necessarily indicate false sharing. For 
+instance, indirect-memory accesses with a strided pattern could lead to increased latencies. 
+One should examine how the memory object is used within a function to determine if false sharing 
+is a potential issue. Additionally, comparing access latencies with scenarios like single-threaded 
+execution or versions without such issues can provide further insights.
diff --git a/docs/install/images/nrn_vtune_memacs_caliper_step12_acs_lat.png b/docs/install/images/nrn_vtune_memacs_caliper_step12_acs_lat.png
diff --git a/docs/install/images/nrn_vtune_memacs_caliper_step13_acs_lat_code.png b/docs/install/images/nrn_vtune_memacs_caliper_step13_acs_lat_code.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step10_bottom_up_expand.png b/docs/install/images/nrn_vtune_uarch_caliper_step10_bottom_up_expand.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step11_grouping_expand.png b/docs/install/images/nrn_vtune_uarch_caliper_step11_grouping_expand.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step1_bottom_up.png b/docs/install/images/nrn_vtune_uarch_caliper_step1_bottom_up.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step3_bottom_up.png b/docs/install/images/nrn_vtune_uarch_caliper_step3_bottom_up.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step8_summary_compare.png b/docs/install/images/nrn_vtune_uarch_caliper_step8_summary_compare.png
diff --git a/docs/install/images/nrn_vtune_uarch_caliper_step9_bottom_up_compare.png b/docs/install/images/nrn_vtune_uarch_caliper_step9_bottom_up_compare.png
diff --git a/docs/install/images/nrn_vtune_uarch_default_bottom_up.png b/docs/install/images/nrn_vtune_uarch_default_bottom_up.png