Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation about using Intel Vtune with NEURON #2971

Merged
merged 4 commits into from
Jul 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ pip3 install -r docs/docs_requirements.txt --upgrade

Also, make sure to have `Doxygen` and `pandoc` installed, and the dependencies listed in [conda_environment.yml](conda_environment.yml)
Note that this conda environment file is tailored to the online ReadTheDocs setup (but it lists out all desired requirements, so make sure to check it out).
On Mac, you can install these packages using `brew`:

```bash
brew install doxygen pandoc
```

#### Anaconda environment

Expand Down
213 changes: 212 additions & 1 deletion docs/install/debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ To enable a profiler, one needs to rebuild NEURON with the appropriate flags set

```bash
mkdir build && cd build
cmake .. -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=/path/to/caliper/share/cmake/caliper -DNRN_ENABLE_TESTS=ON
cmake .. -DNRN_ENABLE_PROFILING=ON -DNRN_PROFILER=caliper -DCMAKE_PREFIX_PATH=/path/to/caliper/install/prefix -DNRN_ENABLE_TESTS=ON
cmake --build . --parallel
```
or if you are building CoreNEURON standalone:
Expand Down Expand Up @@ -369,3 +369,214 @@ Notes:
small objects to dramatically reduce the number of calls to the CUDA runtime
API to allocate managed memory. It is, therefore, highly recommended to make
Boost available when using GPU profiling tools.


#### Profiling With Intel Vtune

Intel VTune is a powerful performance analysis tool and the preferred choice for debugging on node performance issues
on Intel platforms. VTune is especially handy when the exact issue and type of analysis needed
are unclear. For example, should we focus on hotspots, examine thread imbalance, analyze memory accesses, or
collect some hardware counters? Intel VTune offers a variety of such detailed analysis types, making it easy to
switch between them as needed. Additionally, VTune has a performance profile comparison feature, simplifying
the analysis of different runs.

There is no one recipe for using Intel VTune, as different projects have different needs. Depending on the
issue to analyze, one might need to configure the experiment in a specific way. However, here are some general
instructions and tips to get started with using Intel VTune with NEURON.


###### Using Caliper VTune Service

If we have installed NEURON with the standard build configuration (e.g. `-DCMAKE_BUILD_TYPE=RelWithDebInfo`) and run
Vtune with a standard analysis like:

```bash
vtune -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
```

and then visualize the results using:

```bash
vtune-gui nrn_vtune_result &
```

In the `Bottom-up` analysis, we should see output like this:

![](images/nrn_vtune_uarch_default_bottom_up.png)

This is the expected output, but there are some issues or inconveniences with using the profile in this way:

- The profile will include information from both the model building phase and the simulation phase.
While one can `Zoom and Filter by Selection`, the UI is not fluid or user-friendly for such selections.
- As Vtune is sampling-based, we have functions from the entire execution, making it inconvenient to find
the context in which they are called. Importantly, as we have seen in the Caliper profile, we would
like to have them grouped in NEURON's terminology and according to hierarchy they are called e.g.
"psolve", "timestep", "spike-exchange", "state-update", "hh-state", etc.

This is where Caliper can help us! It uses the Instrumentation and Tracing Technology API, i.e., `ittnotify.h`,
which we typically use to pause/resume profiling but additionally uses the [Task API](https://github.com/LLNL/Caliper/blob/releases/v2.11.0/src/services/vtune/VTuneBindings.cpp)
to mark all our Caliper instrumentation regions with `__itt_task_begin()` and `__itt_task_end()`.
Because of this, all our Caliper instrumentations now nicely appear in Intel Vtune in the views like `Summary`, `Bottom-up`, etc.

![](images/nrn_vtune_uarch_caliper_step1_bottom_up.png)

So let's look at how to build NEURON for Vtune Analysis with Caliper and how to profile execution with various analysis.


###### Building NEURON for Vtune Analysis with Caliper

The first step is to install Caliper with Intel VTune support. You can install [Caliper](https://github.com/LLNL/Caliper)
without Spack, but if you are a Spack user, note that the `vtune` variant in Caliper was added in May 2024. Therefore,
ensure you are using a newer version of Spack and have the `+vtune` variant activated.

Once Caliper is available with Vtune support, we can build NEURON in a regular way:

```bash
cmake .. \
-DNRN_ENABLE_PROFILING=ON \
-DNRN_PROFILER=caliper \
-DCMAKE_PREFIX_PATH=/path/to/caliper/install/prefix \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer" \
pramodk marked this conversation as resolved.
Show resolved Hide resolved
-DCMAKE_INSTALL_PREFIX=/path/to/install
make -j && make install
```
We have additionally specified `-fno-omit-frame-pointer` CXX flag which helps profiler to to accurately track function calls and execution paths.

###### Running NEURON With Vtune And Caliper

Once NEURON is installed, we can now run profiling with [Caliper's Vtune service](https://software.llnl.gov/Caliper/ThirdPartyTools.html#intel-vtune)
by setting `CALI_SERVICES_ENABLE=vtune` environmental variable:

```bash
export CALI_SERVICES_ENABLE=vtune
```

and then launch Vtune profiling as:


```bash
vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
vtune-gui nrn_vtune_result &
```

Note that we have used `-start-paused` CLI option so that VTune profiling will be disabled until our Caliper instrumentation
activates the recording for the simulation phase (i.e., ignoring the model building part). The highlighted `pause`
region should indicate such time rane. When you open the analysis view, such as `Bottom-up`, make sure to change `Grouping` to
`Task Type / Function / Call Stack` as shown below:

![](images/nrn_vtune_uarch_caliper_step3_bottom_up.png)


###### Typical Vtune Analysis

From the available analysis, we might be inetrested in `performance-snapshot`, `hotspots`, `uarch-exploration`, `memory-access`,
`threading` and `hpc-performance`. We can get information about specific specific collenction type using:

```bash
vtune -help collect

or

vtune -help collect uarch-exploration
```

**Identifying Hotspots In A Model**

Hotspots analysis is one of the most basic and essential types of analysis in Intel VTune. It helps pinpoint performance bottlenecks
by highlighting sections of the code where the application spends a significant amount of time.
Although Caliper's output already aids in identifying hotspots, there are instances where we might encounter unknown performance issues.
In such cases, hotspot analysis is invaluable. By employing sampling-based mechanisms, it provides insights into what else might be
contributing to the application's runtime.

In order to run this analysis, we use the same CLI syntax as before:

```bash
vtune -start-paused -collect hotspots --no-summary -result-dir=nrn_vtune_result x86_64/special -python model.py
```

**Comparing Runtime of Two Different Executions**

In certain situations, we want to compare two changesets/commits/versions or just two different builds to find out what is causing
the difference in execution time. See the performance regression discussed in [#2787](https://github.com/neuronsimulator/nrn/issues/2787).

In such situations, we first need to select the analysis type that is relevant for comparison. For example, if we need to
find out which functions are causing the difference in execution time, then hotspot analysis will be sufficient. But,
if we want to determine why a particular function in one build or commit is slower than in another commit or build, then we
might need to deep dive into hardware counter analysis (e.g., to compare cache misses, instructions retired, etc). In this
case, we might need to select the `uarch-exploration` analysis type.

To perform such a comparison, we do two runs with two different executions and generate profiles in two different directories:

```bash
vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result_build1 build1/x86_64/special -python model.py
vtune -start-paused -collect uarch-exploration --no-summary -result-dir=nrn_vtune_result_build2 build2/x86_64/special -python model.py
```

Once we have profile results, we can open them in comparison view using `vtune-gui`:

```bash
vtune-gui nrn_vtune_result_build1 nrn_vtune_result_build2
```

As we have NEURON code regions and mechanisms functions annotated, they will appear as tasks, and the comparison will already be helpful
to give a first impression of the differences between the two executions:

![](images/nrn_vtune_uarch_caliper_step8_summary_compare.png)

In this example, we can see the differences in execution time, and observing the difference in instruction count is already helpful. The Top Tasks
region shows which tasks/functions are contributing to the biggest difference.

We can then dive deeper into the `Bottom-up` analysis view and look into various hardware counter details. We can do the same in `Event Count`
view. Note that VTune comes with hundreds of counters, which can be a bit overwhelming! What you want to look at depends on the problem you
are trying to investigate. In the example below, we are simply comparing the instructions retired. Notice the `Grouping` that we mentioned earlier:

![](images/nrn_vtune_uarch_caliper_step9_bottom_up_compare.png)

We can further expand a specific `task`. As our grouping includes `Call Stack` context, we should be able to see everything that is happening inside that task:

![](images/nrn_vtune_uarch_caliper_step10_bottom_up_expand.png)

Importantly, note that we can change the `Grouping` as per the view we would like to see! In the screenshot below, you can see how we can do that:

![](images/nrn_vtune_uarch_caliper_step11_grouping_expand.png)

**Identifying False Sharing**

False sharing is the situation where multiple threads update distinct elements that reside on the same cache line, causing performance
degradation due to repeated cache invalidations.

For a quite some time, Intel VTune has documented how to identify such situations. You can see the Cookbook example
[here](https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/false-sharing.html) or older tutorials like
[this one](https://cdrdv2-public.intel.com/671363/vtune-tutorial-linux-identifying-false-sharing.pdf). In short, we need to find out memory
objects where access latencies are significantly higher. To facilitate such investigations, VTune provides an analysis type `memory-access`.
We can configure this analysis using the CLI as follows:

```bash
taskset -c 0-7 vtune -collect memory-access \
-knob analyze-mem-objects=true -knob mem-object-size-min-thres=1 -knob dram-bandwidth-limits=true \
-start-paused -result-dir=nrn_vtune_result x86_64/special -python model.py
```

Here we specify additional parameters `analyze-mem-objects=true` and `mem-object-size-min-thres=1`
to track and analyze memory objects larger than 1 byte. Additionally, we use `taskset -c 0-7`
to assign threads to specific cores for consistent performance profiling results.

Once we have profile data, we can examine access latencies for different memory objects. In the
example below, we view the `Bottom-up` perspective with `Grouping` set to `Memory Object / Function / Call Stack`:

![](images/nrn_vtune_uarch_caliper_step11_grouping_expand.png)

Here, we observe significantly higher access latencies in the `bksub()` function from `sparse_thread.hpp`.
Double-clicking the function takes us to the corresponding code:

![](images/nrn_vtune_memacs_caliper_step13_acs_lat_code.png)

The high latencies are attributed to `SparseObj` objects. This suggests a need to revisit how
`SparseObj` objects are allocated, stored and updated during runtime.

It's important to note that higher latencies do not necessarily indicate false sharing. For
instance, indirect-memory accesses with a strided pattern could lead to increased latencies.
One should examine how the memory object is used within a function to determine if false sharing
is a potential issue. Additionally, comparing access latencies with scenarios like single-threaded
execution or versions without such issues can provide further insights.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading