All the events and metrics are split between several groups which are described in details sub-sections below. In addition most of the groups are further split into 2 variants:
-
RET - for non-speculative events counted at retirement. For example, INST.RET counts retired instructions.
-
SPEC - for speculative events. For example, INST.SPEC counts instructions that are issued to the backend pipeline, regardless of whether they retire.
Note
|
In general, RET events are more useful for performance analysis, since they are consistent with software’s view of the instruction flow. But they can be significantly more expensive to implement, as they require event data to be staged along with the associated instruction to retirement. It is up to implementations to decide whether to support the RET, SPEC, or both variants of an event. |
This group contains general events not specific to a particular part of CPU pipeline.
Retirement control flow group which contains events and metrics for counting all control transfer instructions, or just those that were mispredicted, including breakdowns by transfer type.
This group contains events and metrics for data and instruction caches (all levels) counted at retirement.
This group contains events and metrics for data and instruction caches (all levels) counted speculatively.
This group contains events and metrics for data and instruction TLB caches (all levels) counted at retirement.
This group contains events and metrics for data and instruction TLB caches (all levels) counted speculatively.
This group contains events and metrics related for Top-down Microarchitecture Analysis (TMA) methodology.
TMA is an industry-standard methodology introduced by Intel in characterizing the performance of SPEC CPU2006 on Intel CPUs, and since used to characterize HPC workloads, GPU workloads, microarchitecture changes, pre-silicon performance validation failures, and more.
TMA allows even developers with minimal microarchitecture knowledge to understand, for a given workload, where bottlenecks reside. It does so by accounting for the utilization of each pipeline "slot" in the microarchitecture. As an example, for a 4-wide implementation, there are 4 slots to account for each cycle. When the hardware is utilized with optimal efficiency, each slot is occupied by an instruction or micro-operation (uop) that will go on to execute and retire. When bottlenecks occur, due perhaps to a cache miss, branch misprediction, or any number of other microarchitectural conditions, some slots may be either unused or thrown away, which results in inefficiency and reduced performance. TMA is able to identify these wasted slots, and the stalls, clears, misses, or other events that cause them. This enables developers to make informed decisions when tuning their code.
TMA accomplishes this by defining a set of hierarchical states into which each slot is categorized. Each cycle, the frontend of the processor (responsible for instruction fetch and decode) can issue some implementation-defined number (N) of instructions/uops to the backend (instruction execution and retire). Hence there are N issue slots to be categorized per cycle. At the top level of the TMA hierarchy, issue slots are categorized as described below.
-
Frontend Bound - The frontend did not issue a uop to the backend for execution. Example causes include stalls that result from cache or TLB misses during instruction fetch.
-
Backend Bound - The backend could not consume a uop from the frontend. Example causes include backpressure that results from cache or TLB misses on data (load/store) accesses, or from oversubscribed execution units.
-
Bad Speculation - The uop was dropped, as a result of a pipeline clear. Example clears include branch/jump mispredictions, or memory ordering clears. This category also includes any pipeline clear recovery cycles during which issue slots go unfilled.
-
Retiring - The uop retired. Ideally the majority of slots fall into this state.
Many of the top-level states listed above include further breakdown at the 2nd and 3rd levels of the TMA hierarchy, as illustrated below.
Note
|
Some imprecision within the event hierarchy is allowed and even expected. The standard L2 and L3 events may not sum precisely to the parent L1 or L2 events, respectively, as it is expected that there will be some additional sources of bottlenecks beyond those represented by the standard events. The exception is the Backend Bound L2 events (Core Bound and Memory Bound), which ideally should sum to the Backend Bound event total. Because of this possible imprecision, it is recommended that lower level TMA events are examined only when the parent event count or rate is higher than expected. This avoids spending time on misleading L2 or L3 events that may be implemented by imprecise event formulas rather than precise hardware events. Implementations may opt to add custom L2 or L3 events, to identify additional bottlenecks specific to the microarchitecture. |
The events which follow count slots for each of the states listed above, while the metrics express the slots per state value as a percentage of total slots.
This group contains events and metrics related to vectorized operations counted at retirement.
This group contains events and metrics related to vectorized operations counted speculatively.