lo2s - Linux OTF2 sampling
- lo2s
-
[-q | v] [-m PAGES] [-k CLOCKID] [--[no-]instruction-sampling] [-e EVENT] [-c N] [-i MSEC] [--[no-]disassemble] [--[no-]kernel] [-t TRACEPOINT] [-E EVENT] [--userspace-metric-event EVENT] [--standard-metrics] [--metric-leader EVENT] [--metric-count N | --metric-frequency HZ] [-x KNOB] [-X] [-s SYSCALL] [--accel ACCEL] { PROCESS_MONITORING | SYSTEM_MONITORING }
- PROCESS_MONITORING := { COMMAND | -- COMMAND [ARGS...] | -p PID }
- SYSTEM_MONITORING := { -a [ PROCESS_MONITORING ] }
lo2s creates OTF2-traces of uninstrumented processes and systems using the Linux perf(1) infrastructure, in particular the perf_event_open(2) system call.
It offers two modes of operation: process-monitoring mode and system-monitoring mode. In process-monitoring mode, lo2s will record information about a single process and its descendants. If COMMAND is given, lo2s acts as a prefix command, launching COMMAND and monitoring it until it exits. Passing -- before COMMAND stops option parsing in lo2s, allowing options to be passed to the command. In case you want to monitor a process which is already running, specify -p PID instead, where PID is the process identifier of interest. Monitoring will last until PID exits.
System-monitoring mode is enabled by passing the option -a. In this mode, lo2s will monitor processes on all available CPUs indefinitely. Optionally, if either COMMAND or PID is given, system monitoring will stop when their respective processes exit.
At any time, monitoring can be interrupted safely by sending SIGINT to lo2s.
Note that in order to access certain features, your system must be configured to grant additional permissions. The central point of configuration for perf events is the /proc/sys/kernel/perf_event_paranoid file. The value in this file, called paranoid level, represents the level of access granted to perf-features: the lower the value, the more features are available. Use systctl(8) to change the level if needed:
# sysctl kernel.perf_event_paranoid=<level>
See perf_event_open(2) and https://docs.kernel.org/admin-guide/perf-security.html for a more detailed description. Options in OPTIONS are annotated with the paranoid level they require.
- --help
-
Show a help message.
- --version
-
Print version information.
- -q, --quiet
-
Suppress all output, except for error messages.
- -v, --verbose
-
Verbose output. If specified multiple times, increase verbosity for each occurrence.
- NOTE:
-
this option takes precedence over -q.
- -o, --output-trace ARG (default:
lo2s_trace_{DATE}
) -
Save the generated trace in a directory specified by ARG. The argument to this option supports a simple templating mechanism. If it contains sequences in the form of
{...}
, the sequence is substituted before being interpreted as a filesystem path.Substituted sequences are:
- {DATE}
-
The current date in the strftime(3) format of %Y-%m-%d_%H-%M-%S.
- {HOSTNAME}
-
The hostname(7) of the current system.
- {ENV=VAR}
-
Substitute the contents of environment variable VAR.
For example, running
$ lo2s -o /tmp/trace_{ENV=USER}_{HOSTNAME} -- ...
might save the resulting trace file in the directory /tmp/trace_hal3000_discovery-one.
- NOTE:
-
If --output-trace is not specified but the environment variable LO2S_OUTPUT_TRACE is set and not empty, its contents will be used to determine the trace path instead, including variable substitution.
- -p, --pid PID
-
Attach to a running process with process ID PID instead of launching COMMAND.
- -u, --drop-root
-
Launch COMMAND as the user that called on sudo. Requires a lo2s call with sudo.
- -U, --as-user USERNAME
-
Launch COMMAND as USERNAME. The caller needs to have CAP_SETUID. Will take priority over -u if specified together.
- -m, --mmap-pages N (default:
16
) -
Allocate N pages for each internal buffer shared between lo2s and the kernel. Higher values may reduce the amount of lost samples on high sampling frequencies. The maximum amount of mappable memory per system is configured by /proc/sys/kernel/perf_event_mlock_kb.
- -i, --readout-interval MSEC (default:
100
) -
Wake up interval based monitors (i.e. x86_adapt, x86_energy, sensors) every MSEC milliseconds to read event buffers
- -I, --perf-readout-interval MSEC (default:
0
) -
Wake up perf based monitors (i.e. sampling, metrics, tracepoints) at least every MSEC milliseconds to read event buffers. If MSEC is 0 interval based readouts will be disabled. Lower values should lead to more synchronous readouts but might increase the perturbation of your measurements by lo2s. Use in conjunction with --mmap-pages, --count and --metric-count to minimize lo2s's overhead for your measurements.
- -k, --clockid CLOCKID
-
Set the internal reference clock used as a source of timestamps. See --list-clockids for a list of supported arguments.
Beyond standard clocks lo2s also supports a special "pebs" clock, which will give the same timestamps as "monotonic-raw", but is set up in a slightly different way to support the large PEBS feature of newer (Skylake+) Intel processors
- --cgroup NAME
-
If set, only perf events for processes in the NAME cgroup are recorded.
- --list-clockids
-
List the names of clocks that can be used as CLOCKID argument.
- NOTE:
-
Available clocks are determined heuristically at compile time. Not all clocks may usable when running lo2s on a system that is not the build system.
- --list-events
-
List available metric and sampling events. Listed events can be used as EVENT-arguments.
- --list-tracepoints
-
List available tracepoint events. Listed events can be used as TRACEPOINT-arguments.
- --list-knobs
-
List available x86_adapt.h(3) CPU configuration items. Use where KNOB is required.
- -a, --all-cpus
-
Start lo2s in system-monitoring mode. Running lo2s in this mode requires a paranoid level of at most 0.
- -A, --all-cpus-sampling
-
Shorthand option, equivalent to -a --instruction-sampling.
- --[no-]instruction-sampling
-
Enable or disable recording of instruction samples.
- -e, --event EVENT (default:
instructions
) -
Set instruction sampling interrupt source event.
- -c, --count N (default:
11010113
) -
Record an instruction sample each time N instruction sampling interrupt source events have occurred (as specified by --event). The default value is chosen to be a prime number to avoid aliasing effects on repetetive instruction execution in tight loops.
- -g, --call-graph
-
Record call stack of instruction samples.
- --[no-]disassemble
-
Enable or disable augmentation of samples with disassembled instructions. Enabled by default if supported.
- --[no-]kernel
-
Enable or disable recording events happening in kernel space. Enabled by default. Reading events from kernel space requires a paranoid level of at most 1.
- -E, --metric-event EVENT
-
Record metrics for this perf event. May be specified multiple times to record metrics for more than one event. Try --userspace-metric-event if EVENT is not openable.
- --userspace-metric-event EVENT
-
This is a more compatible but slower version of -E.
- --standard-metrics
-
Enable a set of default events for metric recording.
- --metric-leader EVENT
-
The leading metric event when using event count based metric recording, use in conjunction with --metric-count to control the number of events that have to elapse before a metric read is performed
- --metric-count N
-
Controls the number of events that have to elapse before a metric read is performed by the kernel when using event count based metric recording with --metric-leader. Higher values reduce the overhead incurred by lo2s, but may lead to the kernel buffers overflowing, in which case lo2s will miss some metric events. Use in conjunction with -m to increase internal buffer sizes to reduce overhead and risk of missing events. This can only be used in conjunction with --metric-leader
- --metric-frequency HZ
-
This is used to set the frequency in time interval based metric recording, i.e. one readout every 1/HZ seconds. Can not be used in conjunction with --metric-leader
- --syscall SYSCALLS
-
Record syscall activity for the given syscall or "all" to record all syscalls. Can be given multiple times to record multiple syscalls at once. Argument may either be a syscall name, like "read", or a syscall number. Note that due to the high event-rate of many syscalls it is advised to keep the number of recorded syscalls limited.
This is only available in system-wide measurement mode
- -x, --x86-adapt-knob KNOB
-
Record the x86_adapt.h(3) knob KNOB. See --list-knobs for a list of available arguments.
KNOB may be suffixed with
#SUFFIX
to indicate an OTF2 metric mode, where SUFFIX is one of the following:absolute_point
|point
|p
:-
OTF2_METRIC_ABSOLUTE_POINT
absolute_last
|last
|l
:-
OTF2_METRIC_ABSOLUTE_LAST
accumulated_start
|accumulated
:-
OTF2_METRIC_ACCUMULATED_START
- -X
-
Record x86_energy.h(3) values.
- --block-io
-
Record block I/O events using the block:block_rq_insert tracepoint for begin events and block:block_rq_complete tracepoint for end events specifically.
- --block-io-cache-size NUM
-
Size of the per-CPU cache in number-of-events. A larger cache size might increase performance but comes at the cost of a higher memory footprint.
- -S, --sensors
-
Record measurements for each sensor found by sensors(1).
- --accel ACCEL
-
Record activity events (instruction samples or kernel execution information) for the given accelerator. Usable accelerators are "nec" for NEC SX-Aurora and "nvidia" for NVidia CUDA accelerators.
- --nec-readout-interval USEC
-
Set the interval (in microseconds) between NEC SX-Aurora instruction samples.
- --nec-check-interval MSEC
-
Set the interval (in milliseconds) between checks for new NEC SX-Aurora processes.
- EVENT
-
The name of a perf event. Format is one of the following:
name
-
A predefined event.
pmu/event/
orpmu:event
-
A kernel PMU event. Kernel PMU events can be found under /sys/bus/event_source/devices/<pmu>/event/<event>.
rNNNN
-
A raw event, where NNNN is the hexadecimal identifier of the event.
See --list-events for a list of events available.
- TRACEPOINT
-
The name of a kernel tracepoint event. Format is one of
group:name
orgroup/name
Tracepoint events can be found under /sys/kernel/debug/tracing/events/<group>/<name>. Use --list-tracepoints to get a list of tracepoints events.
Using perf-probe(2), it is possible to define dynamic tracepoints for use with lo2s. Consider the C-function
void __attribute__((optimize("O0"))) my_marker(int some_variable) { ... }
that is compiled into a.out. Running
# perf probe -x ./a.out my_marker some_variable
will create a dynamic tracepoint for this function available for use with lo2s:
$ lo2s -t probe_a:my_marker ...
- PERMISSIONS:
-
Recording tracepoint events usually requires both read and execute permissions on /sys/kernel/debug.
To mount the debugfs execute:
# mount -t debugfs none /sys/kernel/debug
Non-root access requires you to change the ownership of the debugfs and execute with kernel.perf_event_paranoid set to -1:
# chown -R myusername /sys/kernel/debug # sysctl kernel.perf_event_paranoid=-1
- CLOCKID
-
The name of a system clock. Use --list-clockids to get a list of clocks.
- LO2S_OUTPUT_TRACE
-
See --output-trace.
- LO2S_OUTPUT_LINK
-
If the variable is set, the contents of this variable specify the path at which a symbolic link to the generated trace directory will be created. If the path does not exist, a new symbolic link to the latest trace will be generated. If it does exists, but is a symbolic link, point it to the latest trace directory. Otherwise, issue a warning.
Setting this variable might be useful should you find yourself repeatedly switching between generating traces and analyzing them. Instructing your preferred trace analysis software to open the trace from the symlinked directory allows you to quickly view the latest version by simply reloading it.
- LO2S_METRIC_PLUGINS
-
A comma separated list of metric plugins to load.
- LO2S_METRIC_PLUGIN or LO2S_METRIC_PLUGIN_PLUGIN
-
A comma separated list of metric events to record for the plugin PLUGIN.
- NOTE:
-
In compatibility with scorep(1), all environment variables starting with
LO2S_METRIC_
may be prefixed withSCOREP
instead ofLO2S
. Plugins themselves may use additional environment variables for configuration that allow only theSCOREP
prefix.
Performance problems in lo2s may lead to information missing in the trace due to event loss and skewed results due to excessive lo2s activity perturbating the recorded metrics. lo2s contains several knobs that may be used to optimize its performance.
--mmap-pages governs the amount of memory that is allocated to each perf event-reading buffer. For measuring modes that use the perf interface to collect data (e.g. sampling, metrics, tracepoints or block I/O) this is the most effective tuning knob, as larger buffer result in longer time until buffer overflow and less lo2s activity during measurement.
Because pages used by perf need to be locked into memory - thus, these pages cannot be swapped out - the values that can be used for --mmap-pages are limited by RLIMIT_MEMLOCK and perf_event_mlock_kb.
perf_event_mlock_kb governs the amount of memory a user can lock per-CPU specifically for use with perf. The current perf_event_mlock_kb limit can be read or set from /proc/sys/kernel/perf_event_mlock_kb.
If the user wants to lock more memory than the perf_event_mlock_kb limit allows, the pool RLIMIT_MEMLOCK is used, which is shared between all processes that lock memory. RLIMIT_MEMLOCK can be read or set via the ulimit -l command.
Priviliged users, user posessing the CAP_IPC_LOCK capability, or users running on system with perf_event_paranoid set to -1 can lock unlimited amounts of memory and are thus not limited in the value that can be set for --mmap-pages
As the number of active perf buffers can vary wildly between different lo2s use-cases no general rule for adjusting --mmap-pages according to the RLIMIT_MEMLOCK and perf_event_mlock_kb limits can be given. The user is advised to discover the ideal value for --mmap-pages through trial-and-error, as lo2s will report mmap buffer creation related failures early during startup.
Block I/O events are cached per-CPU before they are written into a global block I/O cache. --block-io-cache-size governs the number of elements the per-CPU block I/O cache holds. Increasing the number of elements in the block I/O cache reduces the number of lo2s wake-ups spent on merging block I/O caches.