diff --git a/docs/conceptual/omnitrace-feature-set.rst b/docs/conceptual/omnitrace-feature-set.rst index 7ea52a212..4a8aceafb 100644 --- a/docs/conceptual/omnitrace-feature-set.rst +++ b/docs/conceptual/omnitrace-feature-set.rst @@ -7,7 +7,7 @@ The Omnitrace feature set and use cases *************************************** `Omnitrace `_ is designed to be highly extensible. -Internally, it leverages the `timemory performance analysis toolkit `_ +Internally, it leverages the `Timemory performance analysis toolkit `_ to manage extensions, resources, data, and other items. It supports the following features, modes, metrics, and APIs. diff --git a/docs/how-to/configuring-runtime-options.rst b/docs/how-to/configuring-runtime-options.rst index 0303042e4..d4d458249 100644 --- a/docs/how-to/configuring-runtime-options.rst +++ b/docs/how-to/configuring-runtime-options.rst @@ -26,7 +26,7 @@ use the ``omnitrace-avail -G ~/.omnitrace.cfg --all`` option for a verbose configuration file with descriptions, categories, and additional information. Modify ``${HOME}/.omnitrace.cfg`` as required. For example, enable `Perfetto `_, -`timemory `_, sampling, and process-level sampling by default +`Timemory `_, sampling, and process-level sampling by default and tweak the default sampling values. .. code-block:: shell @@ -62,7 +62,7 @@ accepts a case insensitive match for nearly all common Boolean logic expressions Exploring components ----------------------------------- -Omnitrace uses `timemory `_ extensively to provide +Omnitrace uses `Timemory `_ extensively to provide various capabilities and manage data and resources. By default, with ``OMNITRACE_PROFILE=ON``, Omnitrace only collects wall-clock timing values. However, by modifying the ``OMNITRACE_TIMEMORY_COMPONENTS`` setting, @@ -94,7 +94,7 @@ Omnitrace supports hardware counter collection via PAPI and ROCm. Generally, PAPI is used to collect CPU-based hardware counters and ROCm is used to collect GPU-based hardware counters. Although it is possible to install PAPI with ROCm support and use it to collect GPU-based hardware counters, this is not recommended because PAPI -cannot simultaneously collect CPU hardware counters. +cannot simultaneously collect CPU and GPU hardware counters. To view all possible hardware counters and their descriptions, use the following command: diff --git a/docs/how-to/configuring-validating-environment.rst b/docs/how-to/configuring-validating-environment.rst index acbe0cd7a..c3b6a9a43 100644 --- a/docs/how-to/configuring-validating-environment.rst +++ b/docs/how-to/configuring-validating-environment.rst @@ -9,6 +9,11 @@ Configuring and validating the environment After installing the `Omnitrace `_ application, some additional steps are required to set up and validate the environment. +.. note:: + + The following instructions use the installation path ``/opt/omnitrace``. If + Omnitrace is installed elsewhere, substitute the actual installation path. + Configuring the environment ======================================== diff --git a/docs/how-to/sampling-call-stack.rst b/docs/how-to/sampling-call-stack.rst index c3a351e94..84e6d1ea4 100644 --- a/docs/how-to/sampling-call-stack.rst +++ b/docs/how-to/sampling-call-stack.rst @@ -6,10 +6,10 @@ Sampling the call stack **************************************************** -`Omnitrace `_ call-stack sampling can be activated -with either a binary instrumented by the ``omnitrace`` executable -or by using the ``omnitrace-sample`` executable. -All of the following commands are effectively equivalent: +`Omnitrace `_ can use call-stack sampling +on a binary instrumented with either the ``omnitrace`` executable +or the ``omnitrace-sample`` executable. +For example, all of the following commands are effectively equivalent: * Binary rewrite with only the instrumentation necessary to start and stop sampling @@ -39,8 +39,10 @@ does is wrap the ``main`` of the executable with initialization before ``main`` starts and finalization after ``main`` ends. This can be accomplished without instrumentation through a ``LD_PRELOAD`` of a library containing a dynamic symbol wrapper around ``__libc_start_main``. -As a result, whenever binary instrumentation is unnecessary, the use of ``omnitrace-sample`` -is recommended over ``omnitrace-instrument -M sampling`` for several reasons: + +The use of ``omnitrace-sample`` is **recommended** over +``omnitrace-instrument -M sampling`` when binary instrumentation +is not necessary. This is for a number of reasons: * ``omnitrace-sample`` provides command-line options for controlling the Omnitrace feature set instead of requiring configuration files or environment variables @@ -51,10 +53,11 @@ is recommended over ``omnitrace-instrument -M sampling`` for several reasons: In the best-case scenario when the target binary is relatively small, instrumented-sampling has a slightly slower launch time, but in the worst case scenarios it requires a significant amount of time and memory to launch. -* ``omnitrace-sample`` is fully compatible with MPI, for example in the command ``mpirun -n 2 omnitrace-sample -- foo``, +* ``omnitrace-sample`` is fully compatible with MPI. For example, + the command ``mpirun -n 2 omnitrace-sample -- foo`` is valid, whereas ``mpirun -n 2 omnitrace-instrument -M sampling -- foo`` - is incompatible with some MPI distributions (particularly OpenMPI) because of - MPI restrictions against forking within an MPI rank + is incompatible with some MPI distributions (particularly OpenMPI). This is because + MPI prohibits forking within an MPI rank. * When MPI and binary instrumentation are both involved, two steps are required: performing a binary rewrite of the executable and then using the instrumented executable @@ -69,6 +72,7 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option: $ omnitrace-sample --help [omnitrace-sample] Usage: omnitrace-sample [ --help (count: 0, dtype: bool) + --version (count: 0, dtype: bool) --monochrome (max: 1, dtype: bool) --debug (max: 1, dtype: bool) --verbose (count: 1) @@ -79,9 +83,15 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option: --flat-profile (max: 1, dtype: bool) --host (max: 1, dtype: bool) --device (max: 1, dtype: bool) + --wait (count: 1) + --duration (count: 1) --trace-file (count: 1, dtype: filepath) --trace-buffer-size (count: 1, dtype: KB) --trace-fill-policy (count: 1) + --trace-wait (count: 1) + --trace-duration (count: 1) + --trace-periods (min: 1) + --trace-clock-id (count: 1) --profile-format (min: 1) --profile-diff (min: 1) --process-freq (count: 1) @@ -90,8 +100,8 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option: --cpus (count: unlimited, dtype: int or range) --gpus (count: unlimited, dtype: int or range) --freq (count: 1) - --wait (count: 1) - --duration (count: 1) + --sampling-wait (count: 1) + --sampling-duration (count: 1) --tids (min: 1) --cputime (min: 0) --realtime (min: 0) @@ -101,98 +111,121 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option: --gpu-events (count: unlimited) --inlines (max: 1, dtype: bool) --hsa-interrupt (count: 1, dtype: int) - ] - + ] Options: - -h, -?, --help Shows this page - - [DEBUG OPTIONS] - - --monochrome Disable colorized output - --debug Debug output - -v, --verbose Verbose output - - [GENERAL OPTIONS] - - -c, --config Configuration file - -o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix - -T, --trace Generate a detailed trace (perfetto output) - -P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) - -F, --flat-profile Generate a flat profile (conflicts with --profile) - -H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. - -D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. - - [TRACING OPTIONS] - - --trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. - --trace-buffer-size Size limit for the trace output (in KB) + -h, -?, --help Shows this page (count: 0, dtype: bool) + --version Prints the version and exit (count: 0, dtype: bool) + + [DEBUG OPTIONS] + + --monochrome Disable colorized output (max: 1, dtype: bool) + --debug Debug output (max: 1, dtype: bool) + -v, --verbose Verbose output (count: 1) + + [GENERAL OPTIONS] These are options which are ubiquitously applied + + -c, --config Configuration file (min: 0, dtype: filepath) + -o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1) + -T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool) + -P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool) + -F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool) + -H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool) + -D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool) + -w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options. + (count: 1) + -d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two + options. (count: 1) + + [TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event) + + --trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1, + dtype: filepath) + --trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB) --trace-fill-policy [ discard | ring_buffer ] - + Policy for new data when the buffer size limit is reached: - discard : new data is ignored - - ring_buffer : new data overwrites oldest data - - [PROFILE OPTIONS] - + - ring_buffer : new data overwrites oldest data (count: 1) + --trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is + in seconds of realtime but that can changed via --trace-clock-id. (count: 1) + --trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of + realtime but that can changed via --trace-clock-id. (count: 1) + --trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: :, + ::, and/or :::. (min: 1) + --trace-clock-id [ 0 (realtime|CLOCK_REALTIME) + 1 (monotonic|CLOCK_MONOTONIC) + 2 (cputime|CLOCK_PROCESS_CPUTIME_ID) + 4 (monotonic_raw|CLOCK_MONOTONIC_RAW) + 5 (realtime_coarse|CLOCK_REALTIME_COARSE) + 6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE) + 7 (boottime|CLOCK_BOOTTIME) ] + Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be + scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would + equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request + for omnitrace to auto-scale based on the number of threads. (count: 1) + + [PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary) + --profile-format [ console | json | text ] - Data formats for profiling results - --profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters - corresponding to the input path and the input prefix - + Data formats for profiling results (min: 1) + --profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters + corresponding to the input path and the input prefix (min: 1) + [HOST/DEVICE (PROCESS SAMPLING) OPTIONS] - - - --process-freq Set the default host/device sampling frequency (number of interrupts per second) - --process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) - --process-duration Set the duration of the host/device sampling (in seconds of realtime) - --cpus CPU IDs for frequency sampling. Supports integers and/or ranges - --gpus GPU IDs for SMI queries. Supports integers and/or ranges - - [GENERAL SAMPLING OPTIONS] - - -f, --freq Set the default sampling frequency (number of interrupts per second) - -w, --wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock - of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime - -d, --duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time - delay that exceeds the real-time duration... resulting in zero samples being taken - -t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target - application is assigned an atomically incrementing value. - - [SAMPLING TIMER OPTIONS] - + Process sampling is background measurements for resources available to the entire process. These samples are not tied + to specific lines/regions of code + + --process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1) + --process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1) + --process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1) + --cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int or range) + --gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int or range) + + [GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread + + -f, --freq Set the default sampling frequency (number of interrupts per second) (count: 1) + --sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock + of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1) + --sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time + delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1) + -t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target + application is assigned an atomically incrementing value. (min: 1) + + [SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample + --cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments: - 1. Enables sampling based on CPU-clock timer. - 2. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time. - 3. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample. + 0. Enables sampling based on CPU-clock timer. + 1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time. + 2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample. 3+ Thread IDs to target for sampling, starting at 0 (the main thread). May be specified as index or range, e.g., '0 2-4' will be interpreted as: - sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads + sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0) --realtime Sample based on a real-clock timer. Accepts zero or more arguments: - 1. Enables sampling based on real-clock timer. - 2. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime. - 3. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample. + 0. Enables sampling based on real-clock timer. + 1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime. + 2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample. 3+ Thread IDs to target for sampling, starting at 0 (the main thread). May be specified as index or range, e.g., '0 2-4' will be interpreted as: sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle" to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples) - whereas the CPU-clock time does not. - - [BACKEND OPTIONS] (These options control region information captured w/o sampling or instrumentation) - + whereas the CPU-clock time does not. (min: 0) + + [BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation + -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ] - Include data from these backends + Include data from these backends (count: unlimited) -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ] - Exclude data from these backends - - [HARDWARE COUNTER OPTIONS] - - -C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) - -G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) - - [MISCELLANEOUS OPTIONS] - - -i, --inlines Include inline info in output when available + Exclude data from these backends (count: unlimited) + + [HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H + + -C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (count: unlimited) + -G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (count: unlimited) + + [MISCELLANEOUS OPTIONS] + + -i, --inlines Include inline info in output when available (max: 1, dtype: bool) --hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable. ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to @@ -200,7 +233,7 @@ View the help menu of ``omnitrace-sample`` with the ``-h`` / ``--help`` option: performance. Values: 0 avoid triggering the bug, potentially at the cost of reduced performance - 1 do not modify how ROCm is notified about kernel completion + 1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int) The general syntax for separating Omnitrace command-line arguments from the following application arguments @@ -218,110 +251,115 @@ establishes the precedence of environment variable values over values specified in the configuration files. This enables you to configure the Omnitrace runtime to your preferred default behavior in a file such as ``~/.omnitrace.cfg`` and then easily override -those settings using a command like ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``. +those settings in the command line, for example, ``OMNITRACE_ENABLED=OFF omnitrace-sample -- foo``. Similarly, the command-line arguments passed to ``omnitrace-sample`` take precedence over environment variables. All of the command-line options above correlate to one or more configuration settings, for example, ``--cpu-events`` correlates to the ``OMNITRACE_PAPI_EVENTS`` configuration variable. -After the command-line arguments to ``omnitrace-sample`` have been processed but -before the target application runs, ``omnitrace-sample`` creates a log -showing which environment variables were set or modified: - -The snippet below shows the environment updates when ``omnitrace-sample`` is invoked with no arguments - -.. code-block:: shell - - $ omnitrace-sample -- ./parallel-overhead-locks 30 4 100 - - HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - HSA_TOOLS_REPORT_LOAD_FAILURE=1 - LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - OMNITRACE_USE_PROCESS_SAMPLING=false - OMNITRACE_USE_SAMPLING=true - OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1 - -The snippet below shows the environment updates when ``omnitrace-sample`` enables -profiling, tracing, host process-sampling, device process-sampling, and all the available backends: - -.. code-block:: shell - - $ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100 - - HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - HSA_TOOLS_REPORT_LOAD_FAILURE=1 - KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1 - LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - OMNITRACE_CPU_FREQ_ENABLED=true - OMNITRACE_TRACE_THREAD_LOCKS=true - OMNITRACE_TRACE_THREAD_RW_LOCKS=true - OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true - OMNITRACE_USE_KOKKOSP=true - OMNITRACE_USE_MPIP=true - OMNITRACE_USE_OMPT=true - OMNITRACE_TRACE=true - OMNITRACE_USE_PROCESS_SAMPLING=true - OMNITRACE_USE_RCCLP=true - OMNITRACE_USE_ROCM_SMI=true - OMNITRACE_USE_ROCPROFILER=true - OMNITRACE_USE_ROCTRACER=true - OMNITRACE_USE_ROCTX=true - OMNITRACE_USE_SAMPLING=true - OMNITRACE_PROFILE=true - OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1 - ... - -The snippet below shows the environment updates when ``omnitrace-sample`` enables -profiling, tracing, host process-sampling, and device process-sampling, -sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables -all the available backends: - -.. code-block:: shell - - $ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100 - - LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 - OMNITRACE_CPU_FREQ_ENABLED=true - OMNITRACE_OUTPUT_PATH=omnitrace-output - OMNITRACE_OUTPUT_PREFIX=%tag% - OMNITRACE_TRACE_THREAD_LOCKS=false - OMNITRACE_TRACE_THREAD_RW_LOCKS=false - OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false - OMNITRACE_USE_KOKKOSP=false - OMNITRACE_USE_MPIP=false - OMNITRACE_USE_OMPT=false - OMNITRACE_TRACE=true - OMNITRACE_USE_PROCESS_SAMPLING=true - OMNITRACE_USE_RCCLP=false - OMNITRACE_USE_ROCM_SMI=false - OMNITRACE_USE_ROCPROFILER=false - OMNITRACE_USE_ROCTRACER=false - OMNITRACE_USE_ROCTX=false - OMNITRACE_USE_SAMPLING=true - OMNITRACE_PROFILE=true - ... +``omnitrace-sample`` processes the arguments and outputs a summary of its configuration +before running the target application. + +The following snippets show how ``omnitrace-sample`` runs with various environment updates. + +* This snippet shows the environment updates when ``omnitrace-sample`` is invoked with no arguments: + + .. code-block:: shell + + $ omnitrace-sample -- ./parallel-overhead-locks 30 4 100 + + HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + HSA_TOOLS_REPORT_LOAD_FAILURE=1 + LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + OMNITRACE_USE_PROCESS_SAMPLING=false + OMNITRACE_USE_SAMPLING=true + OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1 + +* The next snippet shows the environment updates when ``omnitrace-sample`` enables + profiling, tracing, host process-sampling, device process-sampling, and all the available backends: + + .. code-block:: shell + + $ omnitrace-sample -PTDH -I all -- ./parallel-overhead-locks 30 4 100 + + HSA_TOOLS_LIB=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + HSA_TOOLS_REPORT_LOAD_FAILURE=1 + KOKKOS_PROFILE_LIBRARY=/opt/omnitrace/lib/libomnitrace.so.1.7.1 + LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + OMNITRACE_CPU_FREQ_ENABLED=true + OMNITRACE_TRACE_THREAD_LOCKS=true + OMNITRACE_TRACE_THREAD_RW_LOCKS=true + OMNITRACE_TRACE_THREAD_SPIN_LOCKS=true + OMNITRACE_USE_KOKKOSP=true + OMNITRACE_USE_MPIP=true + OMNITRACE_USE_OMPT=true + OMNITRACE_TRACE=true + OMNITRACE_USE_PROCESS_SAMPLING=true + OMNITRACE_USE_RCCLP=true + OMNITRACE_USE_ROCM_SMI=true + OMNITRACE_USE_ROCPROFILER=true + OMNITRACE_USE_ROCTRACER=true + OMNITRACE_USE_ROCTX=true + OMNITRACE_USE_SAMPLING=true + OMNITRACE_PROFILE=true + OMP_TOOL_LIBRARIES=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + ROCP_TOOL_LIB=/opt/omnitrace/lib/libomnitrace.so.1.7.1 + ... + +* The final snippet shows the environment updates when ``omnitrace-sample`` enables + profiling, tracing, host process-sampling, and device process-sampling, + sets the output path to ``omnitrace-output`` and the output prefix to ``%tag%``, and disables + all the available backends: + + .. code-block:: shell + + $ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100 + + LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + OMNITRACE_CPU_FREQ_ENABLED=true + OMNITRACE_OUTPUT_PATH=omnitrace-output + OMNITRACE_OUTPUT_PREFIX=%tag% + OMNITRACE_TRACE_THREAD_LOCKS=false + OMNITRACE_TRACE_THREAD_RW_LOCKS=false + OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false + OMNITRACE_USE_KOKKOSP=false + OMNITRACE_USE_MPIP=false + OMNITRACE_USE_OMPT=false + OMNITRACE_TRACE=true + OMNITRACE_USE_PROCESS_SAMPLING=true + OMNITRACE_USE_RCCLP=false + OMNITRACE_USE_ROCM_SMI=false + OMNITRACE_USE_ROCPROFILER=false + OMNITRACE_USE_ROCTRACER=false + OMNITRACE_USE_ROCTX=false + OMNITRACE_USE_SAMPLING=true + OMNITRACE_PROFILE=true + ... An ``omnitrace-sample`` example ======================================== +Here is the full output from the previous +``omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -- ./parallel-overhead-locks 30 4 100`` command: + .. code-block:: shell $ omnitrace-sample -PTDH -E all -o omnitrace-output %tag% -c -- ./parallel-overhead-locks 30 4 100 - LD_PRELOAD=/opt/omnitrace/lib/libomnitrace-dl.so.1.7.1 + LD_PRELOAD=/home/gliff/code/omnitrace/build-release/lib/libomnitrace-dl.so.1.11.3 OMNITRACE_CONFIG_FILE= OMNITRACE_CPU_FREQ_ENABLED=true OMNITRACE_OUTPUT_PATH=omnitrace-output OMNITRACE_OUTPUT_PREFIX=%tag% + OMNITRACE_PROFILE=true + OMNITRACE_TRACE=true OMNITRACE_TRACE_THREAD_LOCKS=false OMNITRACE_TRACE_THREAD_RW_LOCKS=false OMNITRACE_TRACE_THREAD_SPIN_LOCKS=false OMNITRACE_USE_KOKKOSP=false OMNITRACE_USE_MPIP=false OMNITRACE_USE_OMPT=false - OMNITRACE_TRACE=true OMNITRACE_USE_PROCESS_SAMPLING=true OMNITRACE_USE_RCCLP=false OMNITRACE_USE_ROCM_SMI=false @@ -329,21 +367,16 @@ An ``omnitrace-sample`` example OMNITRACE_USE_ROCTRACER=false OMNITRACE_USE_ROCTX=false OMNITRACE_USE_SAMPLING=true - OMNITRACE_PROFILE=true - - [omnitrace][omnitrace_init_tooling] Instrumentation mode: Sampling - - - ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ + [omnitrace][dl][1785877] omnitrace_main + [omnitrace][1785877][omnitrace_init_tooling] Instrumentation mode: Sampling + ______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| - | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ - | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| - | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ + | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ + | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| + | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| - - - [759.689] perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" - + omnitrace v1.11.2 (rev: 2586b74db8bf335742600010b8d9f1ce8da9cf89, compiler: GNU v11.4.1, rocm: v6.1.x) + [988.958] perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: "" [parallel-overhead-locks] Threads: 4 [parallel-overhead-locks] Iterations: 100 [parallel-overhead-locks] fibonacci(30)... @@ -351,30 +384,21 @@ An ``omnitrace-sample`` example [2] number of iterations: 100 [3] number of iterations: 100 [4] number of iterations: 100 - [parallel-overhead-locks] fibonacci(30) x 4 = 394644873 + [parallel-overhead-locks] fibonacci(30) x 4 = 409221992 [parallel-overhead-locks] number of mutex locks = 400 - [omnitrace][107157][0][omnitrace_finalize] - [omnitrace][107157][0][omnitrace_finalize] finalizing... - [omnitrace][107157][0][omnitrace_finalize] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157 : 0.610427 sec wall_clock, 2.248 MB peak_rss, 2.265 MB page_rss, 2.560000 sec cpu_clock, 419.4 % cpu_util [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/0 : 0.608866 sec wall_clock, 0.000677 sec thread_cpu_clock, 0.1 % thread_cpu_util, 2.248 MB peak_rss [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/1 : 0.608237 sec wall_clock, 0.603553 sec thread_cpu_clock, 99.2 % thread_cpu_util, 2.204 MB peak_rss [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/2 : 0.601430 sec wall_clock, 0.598378 sec thread_cpu_clock, 99.5 % thread_cpu_util, 1.156 MB peak_rss [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/3 : 0.570223 sec wall_clock, 0.568713 sec thread_cpu_clock, 99.7 % thread_cpu_util, 0.772 MB peak_rss [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] omnitrace/process/107157/thread/4 : 0.557637 sec wall_clock, 0.557198 sec thread_cpu_clock, 99.9 % thread_cpu_util, 0.156 MB peak_rss [laps: 1] - [omnitrace][107157][0][omnitrace_finalize] - [omnitrace][107157][0][omnitrace_finalize] Finalizing perfetto... - [omnitrace][107157][perfetto]> Outputting '/home/user/data/omnitrace-output/2022-10-19_02.46/parallel-overhead-locksperfetto-trace-107157.proto' (842.90 KB / 0.84 MB / 0.00 GB)... Done - [omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.json' - [omnitrace][107157][trip_count]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockstrip_count-107157.txt' - [omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.json' - [omnitrace][107157][sampling_percent]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_percent-107157.txt' - [omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.json' - [omnitrace][107157][sampling_cpu_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_cpu_clock-107157.txt' - [omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.json' - [omnitrace][107157][sampling_wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockssampling_wall_clock-107157.txt' - [omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.json' - [omnitrace][107157][wall_clock]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-lockswall_clock-107157.txt' - [omnitrace][107157][metadata]> Outputting 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksmetadata-107157.json' and 'omnitrace-output/2022-10-19_02.46/parallel-overhead-locksfunctions-107157.json' - [omnitrace][107157][0][omnitrace_finalize] Finalized - [761.584] perfetto.cc:57382 Tracing session 1 ended, total sessions:0 + [omnitrace][1785877][0][omnitrace_finalize] finalizing... + [omnitrace][1785877][0][omnitrace_finalize] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877 : 0.294342 sec wall_clock, 4.776 MB peak_rss, 3.170 MB page_rss, 0.990000 sec cpu_clock, 336.3 % cpu_util [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/0 : 0.291535 sec wall_clock, 0.002619 sec thread_cpu_clock, 0.9 % thread_cpu_util, 4.776 MB peak_rss [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/1 : 0.271353 sec wall_clock, 0.222572 sec thread_cpu_clock, 82.0 % thread_cpu_util, 4.200 MB peak_rss [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/2 : 0.238218 sec wall_clock, 0.206405 sec thread_cpu_clock, 86.6 % thread_cpu_util, 3.432 MB peak_rss [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/3 : 0.209459 sec wall_clock, 0.193415 sec thread_cpu_clock, 92.3 % thread_cpu_util, 2.472 MB peak_rss [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] omnitrace/process/1785877/thread/4 : 0.212029 sec wall_clock, 0.211694 sec thread_cpu_clock, 99.8 % thread_cpu_util, 1.152 MB peak_rss [laps: 1] + [omnitrace][1785877][0][omnitrace_finalize] + [omnitrace][1785877][0][omnitrace_finalize] Finalizing perfetto... + [omnitrace][1785877][perfetto]> Outputting '/home/gliff/code/omnitrace/build-release/omnitrace-output/2024-07-15_16.21/parallel-overhead-locksperfetto-trace-1785877.proto' (39.12 KB / 0.04 MB / 0.00 GB)... Done + [omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.json' + [omnitrace][1785877][wall_clock]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-lockswall_clock-1785877.txt' + [omnitrace][1785877][metadata]> Outputting 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksmetadata-1785877.json' and 'omnitrace-output/2024-07-15_16.21/parallel-overhead-locksfunctions-1785877.json' + [omnitrace][1785877][0][omnitrace_finalize] Finalized: 0.054582 sec wall_clock, 0.000 MB peak_rss, -1.798 MB page_rss, 0.040000 sec cpu_clock, 73.3 % cpu_util + [989.312] perfetto.cc:60128 Tracing session 1 ended, total sessions:0 diff --git a/docs/how-to/understanding-omnitrace-output.rst b/docs/how-to/understanding-omnitrace-output.rst index d70f69315..b6ac40797 100644 --- a/docs/how-to/understanding-omnitrace-output.rst +++ b/docs/how-to/understanding-omnitrace-output.rst @@ -263,7 +263,7 @@ Output prefix keys Output prefix keys have many uses but are most helpful when dealing with multiple profiling runs or large MPI jobs. -Their are included in Omnitrace because they were introduced into timemory +They are included in Omnitrace because they were introduced into Timemory for `compile-time-perf `_. They are needed to create different output files for a generic wrapper around compilation commands while still @@ -350,7 +350,7 @@ Use ``omnitrace-avail --components --filename`` to view the base filename for ea |---------------------------------|---------------|------------------------| With the settings ``OMNITRACE_COLLAPSE_THREADS=ON`` and ``OMNITRACE_COLLAPSE_PROCESSES=ON``, which is only valid -with full MPI support, the timemory output +with full MPI support, the Timemory output combines the per-thread and/or per-rank data, which have identical call stacks. The ``OMNITRACE_FLAT_PROFILE`` setting removes all call stack hierarchy. @@ -360,7 +360,7 @@ min/max measurements regardless of the calling context. The ``OMNITRACE_TIMELINE_PROFILE`` setting (with ``OMNITRACE_FLAT_PROFILE=OFF``) effectively generates similar data to that found in Perfetto. Enabling timeline and flat profiling effectively generates -similar data to ``strace``. However, while timemory in general +similar data to ``strace``. However, while Timemory in general requires significantly less memory than Perfetto, this is not the case in timeline mode, so use this setting with caution. diff --git a/docs/reference/development-guide.rst b/docs/reference/development-guide.rst index e77be20da..ef212a4a3 100644 --- a/docs/reference/development-guide.rst +++ b/docs/reference/development-guide.rst @@ -230,7 +230,7 @@ Component member functions There are no real restrictions or requirements on the member functions a component needs to provide. Unless the component is being used directly, the invocation of component member functions via a "component bundler" -(provided by timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any, +(provided by Timemory) makes extensive use of template metaprogramming concepts. This finds the best match, if any, for calling a component's member function. This is a bit easier to demonstrate using an example: .. code-block:: cpp @@ -295,14 +295,14 @@ Memory model Collected data is generally handled in one of the three following ways: * It is handed directly to, and stored by, Perfetto -* It is managed implicitly by timemory and accessed as needed +* It is managed implicitly by Timemory and accessed as needed * As thread-local data In general, only instrumentation for relatively simple data is directly passed to -Perfetto and/or timemory during runtime. +Perfetto and/or Timemory during runtime. For example, the callbacks from binary instrumentation, user API instrumentation, and roctracer directly invoke -calls to Perfetto or timemory's storage model. Otherwise, the data is stored +calls to Perfetto or Timemory's storage model. Otherwise, the data is stored by Omnitrace in the thread-data model which is more persistent than simply using ``thread_local`` static data, which gets deleted when the thread stops. @@ -330,7 +330,7 @@ Currently, most thread data is effectively stored in a static ``OMNITRACE_MAX_THREADS`` is a value defined a compile-time and set to ``2048`` for release builds. During finalization, Omnitrace iterates through the thread-data and transforms that data -into something that can be passed along to Perfetto and/or timemory. +into something that can be passed along to Perfetto and/or Timemory. The downside of the current model is that if the user exceeds ``OMNITRACE_MAX_THREADS``, a segmentation fault occurs. To fix this issue, a new model is being adopted which has all the benefits of this model @@ -339,7 +339,7 @@ but permits dynamic expansion. Sampling model ======================================== -The general structure for the sampling is within timemory (``source/timemory/sampling``). +The general structure for the sampling is within Timemory (``source/timemory/sampling``). Currently, all sampling is done per-thread via POSIX timers. Omnitrace supports both a real-time timer and a CPU-time timer. Both have adjustable frequencies, delays, and durations.