knob

Set configuration options for the specified analysis type or collector type.

GUI Equivalent

Analysis Type window

Syntax

-knob | -k <knob-name>=<knob-value>

Arguments

knob-name

An analysis type or collector type may have one or more configuration options (knobs) that provide additional instructions for performing the specified type of analysis. To use a knob, you must specify the knob name and knob value.

Multiple knob options are allowed and can be followed by additional action-options, as well as global-options, if needed.

knob-value

There are values available for each knob. In most cases this is a Boolean value, so for Boolean knobs, specify <knob-name>=true to enable the knob.

Note

Knob behavior may vary depending on the analysis type or collector type.

<knob-name>	<knob-value>	Supported Analysis	Description
`enable-user-sync`	`true \| false`. Default: `false`	`concurrency`, `locksandwaits`,`runss`	Collect synchronization data via the User-Defined Synchronization API.
`enable-user-tasks`	`true \| false`. Default: `false`	`hotspots`, `advanced-hotspots`, `concurrency`, `locksandwaits`,`runss`, `general-exploration`, `sgx-hotspots`, `tsx-exploration`, `tsx-hotspots`, `runss`, `runsa`	Analyze tasks, events and counters specified in your application via the Task API. This option causes higher overhead and increases result size.
`analyze-openmp`	`true \| false`. Default: `true` for the HPC Performance Characterization analysis; `false` for other analysis types.	`hotspots`, `advanced-hotspots`, `concurrency`, `hpc-performance`, `memory-access`, `general-exploration`, `runsa`	Instrument the OpenMP* runtimes in your application to group performance data by regions/work-sharing constructs and detect inefficiencies such as imbalance, lock contention, or overhead on performing scheduling, reduction, and atomic operations. Using this option may cause higher overhead and increase the result size.
`enable-gpu-usage`	`true \| false`. Default: `false`	`runss`, `runsa`	Analyze frame rate and usage of Intel HD Graphics and Intel® Iris® Graphics engines and identify whether your application is GPU or CPU bound.
`enable-gpu-runtimes`	`true \| false`. Default for `gpu-hotspots`: `true`, for `runss`: `false`.	`gpu-hotspots`, `runss`, `runsa`	Analyze execution of OpenCL™ kernels and Intel® Media SDK programs on Intel HD Graphics and Intel® Iris® Graphics. This option may affect the performance of your application on the CPU side. Note OpenCL kernels analysis is currently supported for Windows and Linux target systems with Intel HD Graphics and Intel Iris Graphics. Intel Media SDK program analysis is supported for Linux targets only.
`gpu-sampling-interval`	A number (in milliseconds) between 0.1 and 1000. Default: 1.	`gpu-hotspots`, `runss`, `runsa`	Specify an interval between GPU samples.
`gpu-counters-mode`	`none` (default for `runss`), `overview` (default for `gpu-hotspots`), `global-local-accesses`, `compute-extended`, `full-compute`	`gpu-hotspots`, `runss`, `runsa`	Analyze performance data from Intel HD Graphics and Intel Iris Graphics based on the preset counter sets: `overview` - track general GPU memory accesses such as Memory Read/Write Bandwidth, GPU L3 Misses, Sampler Busy, Sampler Is Bottleneck, and GPU Memory Texture Read Bandwidth. These metrics can be useful for both graphics and compute-intensive applications. `global-local-accesses` - include metrics that distinguish accessing different types of data on a GPU: Untyped Memory Read/Write Bandwidth, Typed Memory Read/Write Transactions, SLM Read/Write Bandwidth, Render/GPGPU Command Streamer Loaded, and GPU EU Array Usage. This metrics are useful for compute-intensive workloads on the GPU. `compute-extended` - analyze GPU activity on the Intel processor code name Broadwell. This metrics set is disabled for other systems. `full-compute` - collect both `overview` and `compute-basic` metrics with the `allow-multiple-runs` option enabled to analyze all types of EUs array stalled/idle issues in the same view.
`gpu-profiling-mode`	`bblatency` (default), `memlatency`	`gpu-profiling`, `runsa`	Select a profiling mode to identify basic blocks latency due to algorithm inefficiencies, or memory latency due to memory access issues.
`kernels-to-profile`	`kernel:1:1:4294967293`	`gpu-profiling`, `runsa`	Specify a comma-separated list of GPU kernel names and invocations in the following format: `kernel_name[:start_idx:step:stop_idx]` where `kernel_name` is the name of GPU kernel; `start_idx` is the number of the first invocation; and `stop_idx` is the number of the last invocation to be profiled.
`sampling-interval`	For user-mode sampling and tracing types: a number (in milliseconds) between 1 and 1000. Default: 10 For hardware event-based sampling types: a number (in milliseconds) between 0.01 and 1000. Default: 1.	`hotspots`,`runss`, `advanced-hotspots`, `concurrency`, `locksandwaits`,`runsa`, `system-overview`, `memory-access`, `sgx-hotspots`, `hpc-performance`, `runss`	Specify a sampling interval (in milliseconds) between CPU samples.
`collection-detail`	`hotspots-sampling` (default)	`advanced-hotspots`, `system-overview`	Identify application hotspots based on such basic hardware events as Clockticks and Instructions Retired.
	`stack-sampling`		Identify hardware hotspots, explore statistically reconstructed call flow of your program and analyze thread scheduling.
	`stack-and-callcount`		Identify hardware hotspots, analyze thread scheduling, explore call stacks and statistically approximated number of calls to sampled functions. This value is used for advanced-hotspots only.
	`stack-call-and-tripcount`		Extend the `stack-and-callcount` collection with an analysis of loop trip count statistically estimated using the hardware events. This value is used for advanced-hotspots only.
`enable-stack-collection`	`true \| false`. Default: `false`	`tsx-hotspots`, `hpc-performance`, `gpu-hotspots`, `runsa`	Enable Hardware Event-based Sampling Collection with Stacks.
`dram-bandwidth-limits`	`true \| false`. Default: `true` for the HPC Performance Characterization and General Exploration analysis with `collect-memory-bandwidth` knob enabled; `true` for the Memory Access analysis.	`memory-access`, `general exploration`, `hpc-performance`, `runsa`	Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.
`collect-memory-bandwidth`	`true \| false`. Default: `false`	`general-exploration`, `hpc-performance`	Collect data to identify where your application is generating significant bandwidth to DRAM. To view collected data in GUI, enable the Analyze memory bandwidth option.
`analyze-mem-objects`	`true \| false`. Default: `false`	`memory-access`	Enable the instrumentation of memory allocation/de-allocation and map hardware events to memory objects. This option is supported for Linux targets only running on the Intel microarchitecture code name Sandy Bridge (or later).
`mem-object-size-min-thres`	Default: 1024 bytes	`memory-access`	Specify a minimal size of memory allocations to analyze. This option helps reduce runtime overhead of the instrumentation. This option is supported for Linux targets only running on the Intel microarchitecture code name Sandy Bridge (or later).
`event-mode`	`all \| user \| os`. Default: `all`	`advanced-hotspots`, `runsa`	Limit event-based sampling collection to OS or USER mode.
`analysis-step`	`cycles \| aborts`. Default: `cycles`	`tsx-exploration`	Specify a step for analyzing Intel Transactional Synchronization Extensions behavior. Typically, you start with measuring transactional success (`cycles`) and then, if the aborts rate is high, you run the TSX Exploration to analyze for `aborts`. Note This knob is available only for theTSX Exploration Analysis analysis for the Intel microarchitecture code name Haswell.
`analyze-loops`	`true \| false`. Default: `false`	`runss`, `runsa`	Extend loop analysis to collect advanced loops information such as instruction set usage and display analysis results by loops and functions.
`mrte-type`	`java,dotnet \| java,dotnet,python \| python`. Default: `java,dotnet`	`runss`, `runsa`	Specify a type of managed runtime to analyze. Available values: combined .NET* and Java* analysis, combined Java, .NET and Python* analysis, and Python only.
`io-mode`	`off \| stack \| nostack`. Default: `off`	`runss`, `runsa`	Enable to identify where threads are waiting or compute thread concurrency. The collector instruments APIs, which causes higher overhead and increases result size.
`ftrace-config`	Available events are `freq, idle, sched, disk, filesystem, irq, kvm, workq, softirq, sync`. Default for Linux targets: `sched,freq,idle,workq,irq,softirq` Default for Android targets: `sched,freq,idle,workq,filesystem, irq,softirq,sync,disk`	`runsa`, `runss`	Collect Linux Ftrace* framework events. This option is supported for Linux target systems only. On some systems, Linux Ftrace events collection is possible only for the root user.
`stackwalk-mode`	`online \| offline`. Default: `offline`	`runss`	Choose between online (during collection) and offline (after collection) modes to analyze stacks. Offline mode reduces analysis overhead and is typically recommended.
`stack-stitching`	`true \| false`. Default: `true`	`runss`	For Intel TBB-based applications, restructure the call flow to attach stacks to a point introducing a parallel workload.
`cpu-samples-mode`	`off \| stack \| nostack`. Default: `off`	`runss`	Enable to periodically sample the application. Samples can be collected with or without stacks.
`accurate-cpu-time-detection` (Windows only)	`true \| false`. Default: `true`	`runss`	Collect more accurate CPU time data. This option requires additional disk space and post-processing time. Administrator privileges are required.
`waits-mode`	`off \| stack \| nostack`. Default: `off`	`runss`	Enable to identify where threads are waiting or compute thread concurrency. The collector instruments APIs, which causes higher overhead and increases result size.
`signals-mode`	`off \| objects \| stack \| nostack`. Default: `off`	`runss`	Enable to view synchronization transitions in the timeline and signalling call stacks for associated waits. The collector instruments signalling APIs, which causes higher overhead and increases result size.
`no-altstack`	`true \| false`. Default: `false`	`runss`	Disable using alternative stacks for signal handlers. Consider this option for profiling standard Python 3 code on Linux.
`collect-io-waits`	`true \| false`. Default: `false`	`runsa`	Analyze the percentage of time each thread and CPU spends in I/O wait state.
`stack-size`	A number between 0 and 2147483647. Default is 0 (unlimited stack size).	`runsa`	Reduce the collection overhead and limit the stack size (in bytes) processed by the VTune Amplifier.
`stack-type`	`software \| lbr`. Default: `software`	`runsa`	Choose between software stack and hardware LBR-based stack types. Software stacks have no depth limitations and provide more data while hardware stacks introduce less overhead. Typically, software stack type is recommended unless the collection overhead becomes significant. Note that hardware LBR stack type may not be available on all platforms.
`enable-call-counts`	`true \| false`. Default: `false`	`runsa`	Obtain statistical estimation of call counts based on hardware events.
`enable-trip-counts`	`true \| false`. Default: `false`	`runsa`	Obtain statistical estimation of loop trip counts based on hardware events.
`event-config`	`<event_name1>,<event_name2>,...`	`runsa`	Configure PMU events to collect with the hardware event-based sampling collector. Multiple events can be specified as a comma-separated list (no spaces). Note To display a list of events available on the target PMU, enter: `$ amplxe-cl -collect-with runsa -knob event-config=? <target>` The command returns names and short descriptions of available events. For more information on the events, use Intel Processor Events Reference.
`chipset-event-config`	`"event1,event2,..."`	`runsa`	Specify a comma-separated list of Android chipset events (up to 5 events) to monitor with the hardware event-based sampling collector.
`enable-system-cswitch`	`true \| false`. Default: `false`	`runsa`	Analyze detailed scheduling layout for all threads on the system and identify the nature of context switches for a thread (preemption or synchronization).
`atrace-config`	Available events are `gfx, input, view, webview, wm, am, audio, video, camera, hal, res, dalvik`.	`runsa`	Collect Android framework events from Systrace*.
`collect-tsx-cycles`	`true \| false`. Default: `false`	`runsa`	Collect the events required to analyze transactional success.
`enable-context-switches`	`true \| false`. Default: `false`	`runsa`	Analyze detailed scheduling layout for all threads in your application, explore time spent on a context switch and identify the nature of context switches for a thread (preemption or synchronization).

Actions Modified

collect, collect-with

Description

Use the knob action-option to configure knob settings for a collect (predefined analysis types) or collect-with (custom analysis types) action where the analysis type supports one or more knobs. Each analysis type or collector type supports a specific set of knobs, and each knob requires a value. In most cases the knob value is Boolean, so you would use True to enable the knob.

To see all knobs available for a predefined analysis type:

> amplxe-cl -help collect <analysis_type>

To see knobs for a custom analysis type:

> amplxe-cl -help collect-with <analysis_type>

Example

This example returns a list of knobs for the Locks and Waits analysis type:

> amplxe-cl -help collect locksandwaits

This example runs a custom event-based sampling data collection on an Android system enabling collection of Android framework and chipset events.

> amplxe-cl -collect-with runss -target-system=android -knob sampling-interval=2 -knob cpu-samples-mode=stack -knob ftrace-config=gfx,dalvik -knob chipset-event-config="GMCH_PARTIAL_WR_DRAM.ANY,GMCH_CORE_CLKS" --target-process com.intel.tbb.example.tachyon

This example configures and runs a custom event-based sampling data collection with the stack size limited to 8192 bytes:

> amplxe-cl -collect-with runsa -knob enable-stack-collection=true -knob stack-size=8192 -knob enable-call-counts=true -knob event-config=CPU_CLK_UNHALTED.REF_TSC:sa=1800000,CPU_CLK_UNHALTED

knob

GUI Equivalent

Syntax

Arguments

Note

Note

Note

Note

Actions Modified

Description

Example

関連情報