Skip to content

Latest commit

 

History

History
1171 lines (491 loc) · 34.1 KB

OneCCL.md

File metadata and controls

1171 lines (491 loc) · 34.1 KB

OneCCLvars

CCL_LOG_LEVEL

Set this environment variable to control logging level.

The CCL_LOG_LEVEL environment variable can be set to control the level of detail in the logging output generated by the CCL library. "<value>": "error", "warn", "info", "debug", "trace" By-default: "warn"

CCL_WORKER_COUNT

Set specify the number of oneCCL worker threads.

"<value>" - The number of worker threads for oneCCL rank By-default: "1"

CCL_WORKER_AFFINITY

Set to specify cpu affinity for oneCCL worker threads.

"<value>": "auto", "<cpulist>":
"auto" - Workers are automatically pinned to last cores of pin domain. Pin domain depends from process launcher. If mpirun from oneCCL package is used then pin domain is MPI process pin domain. Otherwise, pin domain is all cores on the node.
"<cpulist>" - A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th core in the list. For example 'a','b'-'c'defines list of cores contaning core with number 'a' and range of cores with numbers from 'b' to 'c'. The number should not exceed the number of cores available on the system. By-default: "not-specified"

CCL_WORKER_MEM_AFFINITY

Set to specify memory affinity for oneCCL worker threads.
.

"<nodelist>" :
"auto" - Workers are automatically pinned to NUMA nodes that correspond to CPU affinity of workers.
A comma-separated list of NUMA node numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th NUMA node in the list. The number should not exceed the number of NUMA nodes available on the system. By-default: "not-specified"

CCL_KVS_MODE

Select the mechanism to collect ranks while creating a communicator.

"<value>":
"0" - use default implementation using sockets
"1" - use mpi
KVS implemention with sockets is used to collect the rank information while creating communicator by default.
By-default: "0"

CCL_KVS_CONNECTION_TIMEOUT

Set the timeout for setting up connections during kvs initialization

"<timeout>" - Timeout in seconds to use for setting up sockets during kvs initialization

By-default: "120"

CCL_ATL_SHM

Set this environment variable to enable the OFI shared memory provider for communication between ranks in the same node of host (CPU) buffers.
.

Syntax
CCL_ATL_SHM="<value>"

Arguments
"<value>" Description

  • 0 Disables OFI shared memory provider (default).

  • 1 Enables OFI shared memory provider.


Description
Set this environment variable to enable the OFI shared memory provider for communication between ranks in the same node of host (CPU) buffers.

By-default: "0"

CCL_ALLGATHER

Set allgather algorithm.

ALLGATHER algorithms

  • direct Based on MPI_Iallgather

  • naive Send to all, receive from all

  • ring Alltoall-based algorithm

  • flat Alltoall-based algorithm

  • multi_bcast Series of broadcast operations with different root ranks

  • topo Topo scaleup algorithm

By-default: "topo", if sycl and l0 are enabled, otherwise "naive" for ofi or "direct" for mpi; "ring" used as fallback

CCL_ALLGATHERV

Set allgatherv algorithm.

ALLGATHERV algorithms

  • direct Based on MPI_Iallgatherv

  • naive Send to all, receive from all

  • ring Alltoall-based algorithm

  • flat Alltoall-based algorithm

  • multi_bcast Series of broadcast operations with different root ranks

  • topo Topo scaleup algorithm

By-default: "topo", if sycl and l0 are enabled, otherwise "naive" for ofi or "direct" for mpi; "ring" used as fallback

CCL_ALLREDUCE

Set allreduce algorithm.

ALLREDUCE algorithms

  • direct Based on MPI_Iallreduce

  • rabenseifner Rabenseifner’s algorithm

  • nreduce May be beneficial for imbalanced workloads

  • ring Reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase.

  • double_tree Double-tree algorithm

  • recursive_doubling Recursive doubling algorithm

  • 2d Two-dimensional algorithm (reduce_scatter + allreduce + allgather). Only available for Host (CPU) buffers.

  • topo Topo scaleup algorithm (available if sycl and l0 are enabled)

By-default: "topo", if sycl and l0 are enable, otherwise "ring"

CCL_ALLTOALL

Set alltoall algorithm.

ALLTOALLV algorithms

  • direct Based on MPI_Ialltoallv

  • naive Send to all, receive from all

  • scatter Scatter-based algorithm

  • topo Topo scaleup algorithm (available if sycl and l0 are enabled)

By-default: "topo", if sycl and l0 are enable, otherwise "scatter"

CCL_ALLTOALLV

Set alltoallv algorithm.

ALLTOALLV algorithms

  • direct Based on MPI_Ialltoallv

  • naive Send to all, receive from all

  • topo Topo scaleup algorithm (available if sycl and l0 are enabled)

By-default: "topo", if sycl and l0 are enable, otherwise "scatter"

CCL_BARRIER

Set barrier algorithm.

BARRIER algorithms

  • direct Based on MPI_Ibarrier

  • ring Ring-based algorithm

Note: BARRIER does not support the CCL_BARRIER_SCALEOUT environment variable. To change the algorithm for scaleout, use CCL_BARRIER. By-default: "direct"

CCL_BCAST

Set broadcast algorithm.

BCAST algorithms

  • direct Based on MPI_Ibcast

  • ring Ring

  • double_tree Double-tree algorithm

  • naive Send to all from root rank

Note: BCAST algorithm does not support yet the CCL_BCAST_SCALEOUT environment variable. To change the algorithm for BCAST, use CCL_BCAST. By-default: "direct"

CCL_BCASTEXT

Set broadcastExt algorithm (send_buf, recv_buf)

BCAST algorithms

  • direct Based on MPI_Ibcast

  • ring Ring

  • double_tree Double-tree algorithm

  • naive Send to all from root rank

Note: BCAST algorithm does not support yet the CCL_BCAST_SCALEOUT environment variable. To change the algorithm for BCAST, use CCL_BCAST. By-default: "direct"

CCL_REDUCE

Set reduce algorithm.

REDUCE algorithms

  • direct Based on MPI_Ireduce

  • rabenseifner Rabenseifner’s algorithm

  • ring Ring algorithm

  • tree Tree algorithm

  • double_tree Double-tree algorithm

  • topo Topo scaleup algorithm (available if sycl and l0 are enabled)

By-default: "topo" if sycl and l0 are enabled, otherwise tree for ofi transport or direct for mpi

CCL_REDUCE_SCATTER

Set reduce-scatter algorithm.

REDUCE_SCATTER algorithms

  • direct Based on MPI_Ireduce_scatter_block

  • naive Send to all, receive and reduce from all

  • ring Ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.

  • topo Topo algorithm (available if sycl and l0 are enabled, scaleup only)

By-default: "topo" if sycl and l0 are enabled, otherwise naive for ofi transport or direct for mpi

CCL_RECV

Set recv algorithm.

RECV algorithms

  • direct Using prepost(d2h-h2d) copies to get host buffers to invoke mpi/ofi->recv()

  • topo Topo scale-up algorithm (available if sycl and l0 are enabled)

  • offload Using device buffers directly into mpi/ofi layer skipping prepost copies d2h h2d. By-default used for scale-out. Setting extra MPI env vars for getting better performance (available if sycl and l0 are enabled)

By-default: "topo" if sycl and l0 are enabled, otherwise offload for ofi/mpi transport

CCL_SEND

Set send algorithm.

SEND algorithms

  • direct Using prepost(d2h-h2d) copies to get host buffers to invoke mpi/ofi->send()

  • topo Topo scale-up algorithm (available if sycl and l0 are enabled)

  • offload Using device buffers directly into mpi/ofi layer skipping prepost copies d2h h2d. By-default used for scale-out. Setting extra MPI env vars for getting better performance (available if sycl and l0 are enabled)

By-default: "topo" if sycl and l0 are enabled, otherwise offload for ofi/mpi transport

CCL_ALLGATHER_SCALEOUT

Set scaleout allgather algorithm.

ALLGATHER algorithms

  • direct Based on MPI_Iallgather

  • naive Send to all, receive from all

  • ring Alltoall-based algorithm

  • flat Alltoall-based algorithm

  • multi_bcast Series of broadcast operations with different root ranks

By-default: "naive" for ofi or "direct" for mpi; "ring" used as fallback

CCL_ALLGATHERV_SCALEOUT

Set scaleout allgatherv algorithm.

ALLGATHERV algorithms

  • direct Based on MPI_Iallgatherv

  • naive Send to all, receive from all

  • ring Alltoall-based algorithm

  • flat Alltoall-based algorithm

  • multi_bcast Series of broadcast operations with different root ranks

By-default: "naive" for ofi or "direct" for mpi; "ring" used as fallback

CCL_ALLREDUCE_SCALEOUT

Set allreduce scaleout algorithm.

ALLREDUCE algorithms

  • direct Based on MPI_Iallreduce

  • rabenseifner Rabenseifner’s algorithm

  • nreduce May be beneficial for imbalanced workloads

  • ring Reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase.

  • double_tree Double-tree algorithm

  • recursive_doubling Recursive doubling algorithm

  • 2d Two-dimensional algorithm (reduce_scatter + allreduce + allgather). Only available for Host (CPU) buffers.

By-default: "ring"

CCL_ALLTOALL_SCALEOUT

Set alltoall scaleout algorithm.

ALLTOALL algorithms

  • direct Based on MPI_Ialltoall

  • naive Send to all, receive from all

  • scatter Scatter-based algorithm

By-default: "scatter"

CCL_ALLTOALLV_SCALEOUT

Set alltoallv scaleout algorithm.

ALLTOALLV algorithms

  • direct Based on MPI_Ialltoallv

  • naive Send to all, receive from all

  • scatter Scatter-based algorithm

By-default: "scatter"

CCL_REDUCE_SCALEOUT

Set reduce scaleout algorithm.

REDUCE algorithms

  • direct Based on MPI_Ireduce

  • rabenseifner Rabenseifner’s algorithm

  • ring Ring algorithm

  • tree Tree algorithm

  • double_tree Double-tree algorithm

By-default: "double_tree"

CCL_REDUCE_SCATTER_SCALEOUT

Set reduce-scatter scaleout algorithm.

REDUCE_SCATTER algorithms

  • direct Based on MPI_Ireduce_scatter_block

  • naive Send to all, receive and reduce from all

  • ring Ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.

By-default: "naive"

CCL_ZE_TMP_BUF_SIZE

Specifies the size of the intermediate buffer used by oneCCL for collective operations.

The CCL_ZE_TMP_BUF_SIZE environment variable controls the size of the buffer that is used for temporary buffers of collective operations in 'topo' algorithms. It has no effect on other algorithms. Smaller values can reduce memory usage at the expense of performance for 'topo' algorithms. Syntax CCL_ZE_TMP_BUF_SIZE="<value>" Arguments "<value>" Description

  • SIZE The size of the buffer in bytes.

By-default: "536870912"

CCL_RS_CHUNK_COUNT

Set to specify maximum number of chunks for reduce_scatter phase in ring allreduce.

"<count>" - Maximum number of chunks for reduce_scatter phase in ring allreduce By-default: "1"

CCL_RS_MIN_CHUNK_SIZE

Set to specify minimum number of bytes in chunk for reduce_scatter phase in ring allreduce.

"<size>" - Minimum number of bytes in chunk for reduce_scatter phase in ring allreduce. Affects actual value of CCL_RS_CHUNK_COUNT. By-default: "65536"

CCL_REDUCE_SCATTER_TOPO_READ

Set this environment variable to select read or write based device-to-device data copy during the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers.

Syntax CCL_REDUCE_SCATTER_TOPO_READ="<value>" Arguments "<value>" Description

  • 1 Uses read based copy to transfer data across GPUs for the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives (default).

  • 0 Uses write based copy to transfer data across GPUs for the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives.

Description Set this environment variable to select read or write based device-to-device data copy during the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers. By-default: "1"

CCL_ZE_DEPS_SYNC

Set this environment variable to 1 to enable synchronous dependencies processing for oneCCL operations.

Syntax CCL_ZE_DEPS_SYNC="<value>" Arguments "<value>" Description

  • 1 Dependencies of oneCCL operations are processed synchronously.

  • 0 Dependencies of oneCCL operations are processed asynchronously (default), meaning that further L0 submissions are being done while dependencies are in progress. Dependencies are signaling when processed.

Description Set this environment variable to 1 to make oneCCL block the thread while previous sycl/L0 submissions are not finished. By-default: "0"

CCL_REDUCE_SCATTER_MONOLITHIC_KERNEL

Set this environment variable to enable compute kernels for Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers.

Syntax CCL_REDUCE_SCATTER_MONOLITHIC_KERNEL="<value>" Arguments "<value>" Description

  • 1 Uses compute kernels to transfer data across GPUs for Allreduce, Reduce, and Reduce-Scatter collectives

  • 0 Uses copy engines to transfer data across GPUs for Allreduce, Reduce, and Reduce-Scatter collectives (default).

Description Set this environment variable to enable compute kernels for Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers By-default: "0"

CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL

Set this environment variable to enable compute kernels for Allgather collectives using device (GPU) buffers.

Syntax CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL="<value>" Arguments "<value>" Description

  • 1 Uses compute kernels to transfer data across GPUs for Allgatherv collectives

  • 0 Uses copy engines to transfer data across GPUs for Allgatherv collectives (default)

Description Set this environment variable to enable compute kernels for Allgatherv collectives using device (GPU) buffers By-default: "0"

CCL_ALLTOALLV_MONOLITHIC_KERNEL

Set this environment variable to enable compute kernels for Alltoall and Alltoallv collectives using device (GPU) buffers.

Syntax CCL_ALLTOALLV_MONOLITHIC_KERNEL="<value>" Arguments "<value>" Description

  • 1 Uses compute kernels to transfer data across GPUs for AlltoAll and Alltoallv collectives (default)

  • 0 Uses copy engines to transfer data across GPUs for AlltoAll and Alltoallv collectives

Description Set this environment variable to enable compute kernels for Alltoall and Alltoallv collectives using device (GPU) buffers By-default: "1"

CCL_ALLGATHERV_PIPE_CHUNK_COUNT

Set this environment variable to enable pipelining implementation for Allgatherv collectives using device (GPU) buffers.

Syntax CCL_ALLGATHERV_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description

  • 0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code

  • 1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.

  • 2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.

Description Set this environment variable to enable control how many chunks are used for Allgatherv, pipeline-based collectives using device (GPU) buffers. By-default: "0"

CCL_ALLREDUCE_PIPE_CHUNK_COUNT

Set this environment variable to enable pipelining implementation for Allreduce collectives using device (GPU) buffers.

Syntax CCL_ALLREDUCE_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description

  • 0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code

  • 1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.

  • 2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.

Description Set this environment variable to enable control how many chunks are used for Allreduce pipeline-based collectives using device (GPU) buffers. By-default: "0"

CCL_REDUCE_SCATTER_PIPE_CHUNK_COUNT

Set this environment variable to enable pipelining implementation for Reduce_Scatter collectives using device (GPU) buffers.

Syntax CCL_REDUCE_SCATTER_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description

  • 0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code

  • 1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.

  • 2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.

Description Set this environment variable to enable control how many chunks are used for Reduce_Scatter pipeline-based collectives using device (GPU) buffers. By-default: "0"

CCL_REDUCE_PIPE_CHUNK_COUNT

Set this environment variable to enable pipelining implementation for Reduce collectives using device (GPU) buffers.

Syntax CCL_REDUCE_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description

  • 0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code

  • 1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.

  • 2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.

Description Set this environment variable to enable control how many chunks are used for Reduce pipeline-based collectives using device (GPU) buffers. By-default: "0"

CCL_LOCAL_RANK

Set this environment variable to specify the rank number of the current process in the local host.

Syntax CCL_LOCAL_RANK="<value>" Arguments "<value>" Description

  • RANK Rank number of the current process in the local host

Description Set this environment variable to specify the rank number of the current process in the local host By-default: N/A; job/process launcher (CCL_PROCESS_LAUNCHER) needs to be used if variable not specified

CCL_LOCAL_SIZE

Set this environment variable to specify the total number of ranks on the local host.

Syntax CCL_LOCAL_SIZE="<value>" Arguments "<value>" Description

  • SIZE Total number of ranks on the local host.

Description Set this environment variable to specify the total number of ranks on the local host By-default: N/A; job/process launcher (CCL_PROCESS_LAUNCHER) needs to be used if variable not specified

CCL_PROCESS_LAUNCHER

Set this environment variable to specify the job launcher to use.

Syntax CCL_PROCESS_LAUNCHER="<value>" Arguments "<value>" Description

  • hydra Uses the MPI hydra job launcher (default)

  • torch Uses torch job launcher

  • pmix It is used with the PALS job launcher which uses the pmix API, so your mpiexec command should look something like this: CCL_PROCESS_LAUNCHER=pmix CCL_ATL_TRANSPORT=mpi mpiexec -np 2 -ppn 2 <ndash />pmi=pmix ...

  • none No Job launcher is used. In this case, the user needs to specify the values for CCL_LOCAL_SIZE and CCL_LOCAL_RANK

Description Set this environment variable to specify the job launcher to use. By-default: "hydra"

CCL_ZE_CACHE_OPEN_IPC_HANDLES

Set this environment variable to enable or disable the caching of IPC handles opened with zeMemOpenIpcHandle().

This controls whether it caches IPC handles opened with zeMemOpenIpcHandle() on receiver's side. When enabled, it caches opened IPC handles, which can improve performance in certain scenarios. See <ulink url="https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1&quot;&gt;https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1&lt;/ulink> CCL_ZE_CACHE_OPEN_IPC_HANDLES="<value>" "<value>"

  • 0 Disables the caching of opened IPC handles.

  • 1 Enables the caching of opened IPC handles (default).

By-default: "1"

CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD

Set this environment variable to specify the per process threshold for caching IPC handles opened with zeMemOpenIpcHandle().

This specifies the threshold for caching open IPC handles on receiver's side. When the number of open IPC handles exceeds this threshold, the cache will start evicting handles via LRU from the cache. CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD="<value>" "<value>"

  • SIZE The threshold value for caching open IPC handles.

By-default: "1000"

CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD

Set this environment variable to enable or disable the caching of IPC handles obtained with zeMemGetIpcHandle().

This environment variable specifies the threshold for caching get IPC handles on sender's side. When the number of IPC handles obtained with zeMemGetIpcHandle() exceeds this threshold, the cache will start evicting handles via LRU from the cache. CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD="<value>" "<value>"

  • SIZE The threshold value for caching get IPC handles.

By-default: "1000"

CCL_ZE_CACHE_GET_IPC_HANDLES

Set this environment variable to specify the per process threshold for caching IPC handles obtained with zeMemGetIpcHandle().

This controls whether it caches IPC handles obtained with zeMemGetIpcHandle() on sender's side. When enabled, it caches IPC handles, which can improve performance in certain scenarios. By default, the caching of get IPC handles is enabled. See <ulink url="https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1&quot;&gt;https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1&lt;/ulink> CCL_ZE_CACHE_GET_IPC_HANDLES="<value>" "<value>"

  • 0 Disables the caching of get IPC handles.

  • 1 Enables the caching of get IPC handles (default).

By-default: "1"

CCL_ZE_ENABLE_OVERSUBSCRIPTION_FALLBACK

Set to enable oversubscription in topo fallback stage for all collectives.

This enviroment variable enables or disables the oversubscription fallback from topo algorithm to copy in/out "<value>" : "0", "1" By-default: "1"

CCL_ZE_ENABLE_OVERSUBSCRIPTION_THROW

Set to enable oversubscription throw for all collectives.

This enviroment variable enables or disables the oversubscription throw check "<value>" : "0", "1" By-default: "1"

CCL_DRMFD_DEV_RENDER_DIR_PATH

Set the directory path for DRM render devices.

This environment variable specifies the directory path where DRM render devices are located. Example value: "/custom/path/to/devices/" By-default: "/dev/dri/by-path/"

CCL_DRMFD_DEV_RENDER_SUFFIX

Set the suffix for DRM render device names.

This environment variable specifies the suffix to be used when searching for DRM render device names. Example value: "-customsuffix" By-default: "-render"

ExpOneCCLvars

CCL_REDUCE_SCATTER_MONOLITHIC_PIPELINE_KERNEL

Set to specify monolithic pipeline approach for reduce_scatter phase in allreduceand reduce collectives.

This enviroment variable has the advantage of forming a seamless pipeline that conceals the data transfer time across MDFI. This way, a process reads the data from its peer tile on the same GPU, performs the reduction, and writes to a temporary buffer located on a different GPU. This modification will cover the time for transferring the data through XeLinks during the reduce-scatter phase in allreduce and reduce collectives. "<value>" : "0", "1" By-default: "1"

CCL_ZE_IPC_EXCHANGE

Set to specify the mechanism to use for Level Zero IPC exchange.


"drmfd" - Uses a the DRM mechanism for Level Zero IPC exchange. This is an experimental mechanism that is used with OS kernels previous to SP4. To use the DRM mechanism, the libdrm and drm headers must be available on a system.
"pidfd" - Uses pidfd mechanism for Level Zero IPC exchange. It requires OS kernel SP4 or above as it requires Linux 5.6 kernel or above
"sockets" - Uses socket mechanism for Level Zero IPC exchange. It is usually slower than the other two mechanisms, but can be used for debugging as it is usually available on most systems "<value>": "drmfd", "pidfd", "sockets" By-default: "drmfd"

CCL_ZE_DRM_BDF_SUPPORT

Use bdf support for mapping logical to physical devices.

To obtain the physical device id based on the bdf, we need get and then parse the bdf values. Then using those values we can identify the particular device by referencing the appropriate fields in a pci configuration space for pci devices.to utilize bdf for the purpose of mapping logical devices to their corresponding physical devices. "<value>" : "0", "1" By-default: "1"

CCL_REDUCE_SCATTER_FALLBACK_ALGO

Use the fallback algorithm for reduce_scatter.

The fallback algorithm performs a full allreduce and then copies a subset of its output to the recv buffer. Currently, the fallback algorithm is used for scaleout whereas scaleup uses optimized algorithm. "<value>" : "0", "1" By-default: "0"

CCL_ZE_AUTO_TUNE_PORTS

Automatically tune algorithm protocols based on port count.

Use number of ports to detect the 12 ports system and use write protocols on such systems for collectives. Users can disable this automatic detection and select the protocols manually. "<value>" : "0", "1" By-default: "1"

CCL_ZE_PT2PT_READ

Enable switching of read and write protocols for pt2pt topo algorithm.

Control pt2pt read/write protocols.
Read Protocol:
It means SEND side is exchanging the handle with RECV side. Then execute the copy operation on the RECV operation side, where the dst buf is the local buffer and the source buffer is the remote buffer.
Write Protocol:
it means RECV side is exchanging the handle with SEND side. Execute the copy operation on the SEND operation side, where the dst buf is the remote buffer and the source buffer is the local buffer.
"<value>" : "0", "1"
By-default: "1"

CCL_ZE_TYPE2_TUNE_PORTS

Tunable value for collectives to adjust copy engine indexes.

use 2,4,6 copy engine indexes for host with 6 ports for allreduce, reduce and allgatherv "<value>": "on" - always use write mode with calculated indexes "off" - always disabled "detected" - determined by the logic in detection "undetected" - the default value, used before the logic in detection By-default: "undetected"

CCL_BARRIER_SYNC

Switch ccl::barrier() host-sync / host-async options.

Historically ccl::barrier() was always synchronous. That does not match with oneCCL asynchronous concept. Same as other collectives, ccl::barrier() should be host-asynchronous if possible. As it would be too much to change in one moment, we start through experimental variable which introduces the option to make barrier host-asynchronous. Use CCL_BARRIER_SYNC=0 to achieve that. By-default: "1 (SYNC)"

CCL_ENABLE_SYCL_KERNELS

Enable SYCL kernels.

Setting this environment variable to 1 enables SYCL kernel-based implementation for allgatherv, allreduce, and reduce_scatter. Support includes all message sizes and some data types (int32, fp32, fp16, and bf16), sum operation, and single node. oneCCL falls back to other implementations when the support is not available with SYCL kernels, so the user can safely setup this environment variable. "<value>" : "0", "1" By-default: "0 (disabled)"

CCL_SYCL_ALLGATHERV_TMP_BUF

Enable the use of persistent temporary buffer in allgatherv.

Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the allgatherv operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"

CCL_SYCL_ALLGATHERV_SMALL_THRESHOLD

Specify the threshold for the small size algorithm in allgatherv.

Set the threshold in bytes to specify the small size algorithm in the allgatherv collective. Default value is 131072. "<value>"" : ">=0"

CCL_SYCL_ALLGATHERV_MEDIUM_THRESHOLD

Specify the threshold for the medium size algorithm in allgatherv.

Set the threshold in bytes to specify the medium size algorithm in the allgatherv collective. Default value is 2097152. "<value>"" : ">=0"

CCL_SYCL_ALLREDUCE_TMP_BUF

Enable the use of persistent temporary buffer in allreduce.

Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the allreduce operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"

CCL_SYCL_ALLREDUCE_SMALL_THRESHOLD

Specify the threshold for the small size algorithm in allreduce.

Set the threshold in bytes to specify the small size algorithm in the allreduce collective. Default value is 524288. "<value>"" : ">=0"

CCL_SYCL_ALLREDUCE_MEDIUM_THRESHOLD

Specify the threshold for the medium size algorithm in allreduce.

Set the threshold in bytes to specify the medium size algorithm in the allreduce collective. Default value is 16777216. "<value>"" : ">=0"

CCL_SYCL_REDUCE_SCATTER_TMP_BUF

Enable the use of persistent temporary buffer in reduce_scatter.

Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the reduce_scatter operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"

CCL_SYCL_REDUCE_SCATTER_SMALL_THRESHOLD

Specify the threshold for the small size algorithm in reduce_scatter.

Set the threshold in bytes to specify the small size algorithm in the reduce_scatter collective. Default value is 2097152."<value>"" : ">=0"

CCL_SYCL_REDUCE_SCATTER_MEDIUM_THRESHOLD

Specify the threshold for the medium size algorithm in reduce_scatter.

Set the threshold in bytes to specify the medium size algorithm in the reduce_scatter collective. Default value is 67108864. "<value>"" : ">=0"

Experimental OneCCL Environment Variables Functionality of these variables has not been (fully) tested and, therefore, cannot be supported nor guaranteed.