Set this environment variable to control logging level.
The CCL_LOG_LEVEL
environment variable can be set to control the level of detail in the logging output generated by the CCL library.
"<value>": "error", "warn", "info", "debug", "trace"
By-default: "warn"
Set specify the number of oneCCL worker threads.
"<value>" - The number of worker threads for oneCCL rank By-default: "1"
Set to specify cpu affinity for oneCCL worker threads.
"<value>": "auto", "<cpulist>":
"auto" - Workers are automatically pinned to last cores of pin domain. Pin domain depends from process launcher. If mpirun from oneCCL package is used then pin domain is MPI process pin domain. Otherwise, pin domain is all cores on the node.
"<cpulist>" - A comma-separated list of core numbers and/or ranges of core numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th core in the list. For example 'a','b'-'c'defines list of cores contaning core with number 'a' and range of cores with numbers from 'b' to 'c'. The number should not exceed the number of cores available on the system.
By-default: "not-specified"
Set to specify memory affinity for oneCCL worker threads.
.
"<nodelist>" :
"auto" - Workers are automatically pinned to NUMA nodes that correspond to CPU affinity of workers.
A comma-separated list of NUMA node numbers for all local workers, one number per worker. The i-th local worker is pinned to the i-th NUMA node in the list. The number should not exceed the number of NUMA nodes available on the system.
By-default: "not-specified"
Select the mechanism to collect ranks while creating a communicator.
"<value>":
"0" - use default implementation using sockets
"1" - use mpi
KVS implemention with sockets is used to collect the rank information while creating communicator by default.
By-default: "0"
Set the timeout for setting up connections during kvs initialization
"<timeout>" - Timeout in seconds to use for setting up sockets during kvs initialization
By-default: "120"
Set this environment variable to enable the OFI shared memory provider for communication between ranks in the same node of host (CPU) buffers.
.
Syntax
CCL_ATL_SHM="<value>"
Arguments
"<value>" Description
-
0 Disables OFI shared memory provider (default).
-
1 Enables OFI shared memory provider.
Description
Set this environment variable to enable the OFI shared memory provider for communication between ranks in the same node of host (CPU) buffers.
By-default: "0"
Set allgather algorithm.
ALLGATHER algorithms
-
direct Based on MPI_Iallgather
-
naive Send to all, receive from all
-
ring Alltoall-based algorithm
-
flat Alltoall-based algorithm
-
multi_bcast Series of broadcast operations with different root ranks
-
topo Topo scaleup algorithm
By-default: "topo", if sycl and l0 are enabled, otherwise "naive" for ofi or "direct" for mpi; "ring" used as fallback
Set allgatherv algorithm.
ALLGATHERV algorithms
-
direct Based on MPI_Iallgatherv
-
naive Send to all, receive from all
-
ring Alltoall-based algorithm
-
flat Alltoall-based algorithm
-
multi_bcast Series of broadcast operations with different root ranks
-
topo Topo scaleup algorithm
By-default: "topo", if sycl and l0 are enabled, otherwise "naive" for ofi or "direct" for mpi; "ring" used as fallback
Set allreduce algorithm.
ALLREDUCE algorithms
-
direct Based on MPI_Iallreduce
-
rabenseifner Rabenseifner’s algorithm
-
nreduce May be beneficial for imbalanced workloads
-
ring Reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase.
-
double_tree Double-tree algorithm
-
recursive_doubling Recursive doubling algorithm
-
2d Two-dimensional algorithm (reduce_scatter + allreduce + allgather). Only available for Host (CPU) buffers.
-
topo Topo scaleup algorithm (available if sycl and l0 are enabled)
By-default: "topo", if sycl and l0 are enable, otherwise "ring"
Set alltoall algorithm.
ALLTOALLV algorithms
-
direct Based on MPI_Ialltoallv
-
naive Send to all, receive from all
-
scatter Scatter-based algorithm
-
topo Topo scaleup algorithm (available if sycl and l0 are enabled)
By-default: "topo", if sycl and l0 are enable, otherwise "scatter"
Set alltoallv algorithm.
ALLTOALLV algorithms
-
direct Based on MPI_Ialltoallv
-
naive Send to all, receive from all
-
topo Topo scaleup algorithm (available if sycl and l0 are enabled)
By-default: "topo", if sycl and l0 are enable, otherwise "scatter"
Set barrier algorithm.
BARRIER algorithms
-
direct Based on MPI_Ibarrier
-
ring Ring-based algorithm
Note: BARRIER does not support the CCL_BARRIER_SCALEOUT environment variable. To change the algorithm for scaleout, use CCL_BARRIER. By-default: "direct"
Set broadcast algorithm.
BCAST algorithms
-
direct Based on MPI_Ibcast
-
ring Ring
-
double_tree Double-tree algorithm
-
naive Send to all from root rank
Note: BCAST algorithm does not support yet the CCL_BCAST_SCALEOUT environment variable. To change the algorithm for BCAST, use CCL_BCAST. By-default: "direct"
Set broadcastExt algorithm (send_buf, recv_buf)
BCAST algorithms
-
direct Based on MPI_Ibcast
-
ring Ring
-
double_tree Double-tree algorithm
-
naive Send to all from root rank
Note: BCAST algorithm does not support yet the CCL_BCAST_SCALEOUT environment variable. To change the algorithm for BCAST, use CCL_BCAST. By-default: "direct"
Set reduce algorithm.
REDUCE algorithms
-
direct Based on MPI_Ireduce
-
rabenseifner Rabenseifner’s algorithm
-
ring Ring algorithm
-
tree Tree algorithm
-
double_tree Double-tree algorithm
-
topo Topo scaleup algorithm (available if sycl and l0 are enabled)
By-default: "topo" if sycl and l0 are enabled, otherwise tree for ofi transport or direct for mpi
Set reduce-scatter algorithm.
REDUCE_SCATTER algorithms
-
direct Based on MPI_Ireduce_scatter_block
-
naive Send to all, receive and reduce from all
-
ring Ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.
-
topo Topo algorithm (available if sycl and l0 are enabled, scaleup only)
By-default: "topo" if sycl and l0 are enabled, otherwise naive for ofi transport or direct for mpi
Set recv algorithm.
RECV algorithms
-
direct Using prepost(d2h-h2d) copies to get host buffers to invoke mpi/ofi->recv()
-
topo Topo scale-up algorithm (available if sycl and l0 are enabled)
-
offload Using device buffers directly into mpi/ofi layer skipping prepost copies d2h h2d. By-default used for scale-out. Setting extra MPI env vars for getting better performance (available if sycl and l0 are enabled)
By-default: "topo" if sycl and l0 are enabled, otherwise offload for ofi/mpi transport
Set send algorithm.
SEND algorithms
-
direct Using prepost(d2h-h2d) copies to get host buffers to invoke mpi/ofi->send()
-
topo Topo scale-up algorithm (available if sycl and l0 are enabled)
-
offload Using device buffers directly into mpi/ofi layer skipping prepost copies d2h h2d. By-default used for scale-out. Setting extra MPI env vars for getting better performance (available if sycl and l0 are enabled)
By-default: "topo" if sycl and l0 are enabled, otherwise offload for ofi/mpi transport
Set scaleout allgather algorithm.
ALLGATHER algorithms
-
direct Based on MPI_Iallgather
-
naive Send to all, receive from all
-
ring Alltoall-based algorithm
-
flat Alltoall-based algorithm
-
multi_bcast Series of broadcast operations with different root ranks
By-default: "naive" for ofi or "direct" for mpi; "ring" used as fallback
Set scaleout allgatherv algorithm.
ALLGATHERV algorithms
-
direct Based on MPI_Iallgatherv
-
naive Send to all, receive from all
-
ring Alltoall-based algorithm
-
flat Alltoall-based algorithm
-
multi_bcast Series of broadcast operations with different root ranks
By-default: "naive" for ofi or "direct" for mpi; "ring" used as fallback
Set allreduce scaleout algorithm.
ALLREDUCE algorithms
-
direct Based on MPI_Iallreduce
-
rabenseifner Rabenseifner’s algorithm
-
nreduce May be beneficial for imbalanced workloads
-
ring Reduce_scatter + allgather ring. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining on reduce_scatter phase.
-
double_tree Double-tree algorithm
-
recursive_doubling Recursive doubling algorithm
-
2d Two-dimensional algorithm (reduce_scatter + allreduce + allgather). Only available for Host (CPU) buffers.
By-default: "ring"
Set alltoall scaleout algorithm.
ALLTOALL algorithms
-
direct Based on MPI_Ialltoall
-
naive Send to all, receive from all
-
scatter Scatter-based algorithm
By-default: "scatter"
Set alltoallv scaleout algorithm.
ALLTOALLV algorithms
-
direct Based on MPI_Ialltoallv
-
naive Send to all, receive from all
-
scatter Scatter-based algorithm
By-default: "scatter"
Set reduce scaleout algorithm.
REDUCE algorithms
-
direct Based on MPI_Ireduce
-
rabenseifner Rabenseifner’s algorithm
-
ring Ring algorithm
-
tree Tree algorithm
-
double_tree Double-tree algorithm
By-default: "double_tree"
Set reduce-scatter scaleout algorithm.
REDUCE_SCATTER algorithms
-
direct Based on MPI_Ireduce_scatter_block
-
naive Send to all, receive and reduce from all
-
ring Ring-based algorithm. Use CCL_RS_CHUNK_COUNT and CCL_RS_MIN_CHUNK_SIZE to control pipelining.
By-default: "naive"
Specifies the size of the intermediate buffer used by oneCCL for collective operations.
The CCL_ZE_TMP_BUF_SIZE environment variable controls the size of the buffer that is used for temporary buffers of collective operations in 'topo' algorithms. It has no effect on other algorithms. Smaller values can reduce memory usage at the expense of performance for 'topo' algorithms. Syntax CCL_ZE_TMP_BUF_SIZE="<value>" Arguments "<value>" Description
- SIZE The size of the buffer in bytes.
By-default: "536870912"
Set to specify maximum number of chunks for reduce_scatter phase in ring allreduce.
"<count>" - Maximum number of chunks for reduce_scatter phase in ring allreduce By-default: "1"
Set to specify minimum number of bytes in chunk for reduce_scatter phase in ring allreduce.
"<size>" - Minimum number of bytes in chunk for reduce_scatter phase in ring allreduce. Affects actual value of CCL_RS_CHUNK_COUNT. By-default: "65536"
Set this environment variable to select read or write based device-to-device data copy during the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers.
Syntax CCL_REDUCE_SCATTER_TOPO_READ="<value>" Arguments "<value>" Description
-
1 Uses read based copy to transfer data across GPUs for the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives (default).
-
0 Uses write based copy to transfer data across GPUs for the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives.
Description Set this environment variable to select read or write based device-to-device data copy during the reduce_scatter stage of Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers. By-default: "1"
Set this environment variable to 1 to enable synchronous dependencies processing for oneCCL operations.
Syntax CCL_ZE_DEPS_SYNC="<value>" Arguments "<value>" Description
-
1 Dependencies of oneCCL operations are processed synchronously.
-
0 Dependencies of oneCCL operations are processed asynchronously (default), meaning that further L0 submissions are being done while dependencies are in progress. Dependencies are signaling when processed.
Description Set this environment variable to 1 to make oneCCL block the thread while previous sycl/L0 submissions are not finished. By-default: "0"
Set this environment variable to enable compute kernels for Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers.
Syntax CCL_REDUCE_SCATTER_MONOLITHIC_KERNEL="<value>" Arguments "<value>" Description
-
1 Uses compute kernels to transfer data across GPUs for Allreduce, Reduce, and Reduce-Scatter collectives
-
0 Uses copy engines to transfer data across GPUs for Allreduce, Reduce, and Reduce-Scatter collectives (default).
Description Set this environment variable to enable compute kernels for Allreduce, Reduce, and Reduce-Scatter collectives using device (GPU) buffers By-default: "0"
Set this environment variable to enable compute kernels for Allgather collectives using device (GPU) buffers.
Syntax CCL_ALLGATHERV_MONOLITHIC_PIPELINE_KERNEL="<value>" Arguments "<value>" Description
-
1 Uses compute kernels to transfer data across GPUs for Allgatherv collectives
-
0 Uses copy engines to transfer data across GPUs for Allgatherv collectives (default)
Description Set this environment variable to enable compute kernels for Allgatherv collectives using device (GPU) buffers By-default: "0"
Set this environment variable to enable compute kernels for Alltoall and Alltoallv collectives using device (GPU) buffers.
Syntax CCL_ALLTOALLV_MONOLITHIC_KERNEL="<value>" Arguments "<value>" Description
-
1 Uses compute kernels to transfer data across GPUs for AlltoAll and Alltoallv collectives (default)
-
0 Uses copy engines to transfer data across GPUs for AlltoAll and Alltoallv collectives
Description Set this environment variable to enable compute kernels for Alltoall and Alltoallv collectives using device (GPU) buffers By-default: "1"
Set this environment variable to enable pipelining implementation for Allgatherv collectives using device (GPU) buffers.
Syntax CCL_ALLGATHERV_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description
-
0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code
-
1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.
-
2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.
Description Set this environment variable to enable control how many chunks are used for Allgatherv, pipeline-based collectives using device (GPU) buffers. By-default: "0"
Set this environment variable to enable pipelining implementation for Allreduce collectives using device (GPU) buffers.
Syntax CCL_ALLREDUCE_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description
-
0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code
-
1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.
-
2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.
Description Set this environment variable to enable control how many chunks are used for Allreduce pipeline-based collectives using device (GPU) buffers. By-default: "0"
Set this environment variable to enable pipelining implementation for Reduce_Scatter collectives using device (GPU) buffers.
Syntax CCL_REDUCE_SCATTER_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description
-
0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code
-
1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.
-
2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.
Description Set this environment variable to enable control how many chunks are used for Reduce_Scatter pipeline-based collectives using device (GPU) buffers. By-default: "0"
Set this environment variable to enable pipelining implementation for Reduce collectives using device (GPU) buffers.
Syntax CCL_REDUCE_PIPE_CHUNK_COUNT="<value>" Arguments "<value>" Description
-
0: (default) Bypasses the chunking/pipelining code and directly calls the topology-aware code
-
1: Calls the pipelining code with a single chunk. Effectively, it has identical behavior and performance as with "0", but exercises the chunking code path with a single chunk.
-
2 or higher: Divides the message into as many logical parts, or chunks, as specified. Then, it executes the collective with each logical chunk. This should allow for several phases of the algorithm to run in parallel, as long as they don't use the same physical resource. Effectively, this should increase performance.
Description Set this environment variable to enable control how many chunks are used for Reduce pipeline-based collectives using device (GPU) buffers. By-default: "0"
Set this environment variable to specify the rank number of the current process in the local host.
Syntax CCL_LOCAL_RANK="<value>" Arguments "<value>" Description
- RANK Rank number of the current process in the local host
Description Set this environment variable to specify the rank number of the current process in the local host By-default: N/A; job/process launcher (CCL_PROCESS_LAUNCHER) needs to be used if variable not specified
Set this environment variable to specify the total number of ranks on the local host.
Syntax CCL_LOCAL_SIZE="<value>" Arguments "<value>" Description
- SIZE Total number of ranks on the local host.
Description Set this environment variable to specify the total number of ranks on the local host By-default: N/A; job/process launcher (CCL_PROCESS_LAUNCHER) needs to be used if variable not specified
Set this environment variable to specify the job launcher to use.
Syntax CCL_PROCESS_LAUNCHER="<value>" Arguments "<value>" Description
-
hydra Uses the MPI hydra job launcher (default)
-
torch Uses torch job launcher
-
pmix It is used with the PALS job launcher which uses the pmix API, so your mpiexec command should look something like this: CCL_PROCESS_LAUNCHER=pmix CCL_ATL_TRANSPORT=mpi mpiexec -np 2 -ppn 2 <ndash />pmi=pmix ...
-
none No Job launcher is used. In this case, the user needs to specify the values for CCL_LOCAL_SIZE and CCL_LOCAL_RANK
Description Set this environment variable to specify the job launcher to use. By-default: "hydra"
Set this environment variable to enable or disable the caching of IPC handles opened with zeMemOpenIpcHandle().
This controls whether it caches IPC handles opened with zeMemOpenIpcHandle() on receiver's side. When enabled, it caches opened IPC handles, which can improve performance in certain scenarios. See <ulink url="https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1">https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1</ulink> CCL_ZE_CACHE_OPEN_IPC_HANDLES="<value>" "<value>"
-
0 Disables the caching of opened IPC handles.
-
1 Enables the caching of opened IPC handles (default).
By-default: "1"
Set this environment variable to specify the per process threshold for caching IPC handles opened with zeMemOpenIpcHandle().
This specifies the threshold for caching open IPC handles on receiver's side. When the number of open IPC handles exceeds this threshold, the cache will start evicting handles via LRU from the cache. CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD="<value>" "<value>"
- SIZE The threshold value for caching open IPC handles.
By-default: "1000"
Set this environment variable to enable or disable the caching of IPC handles obtained with zeMemGetIpcHandle().
This environment variable specifies the threshold for caching get IPC handles on sender's side. When the number of IPC handles obtained with zeMemGetIpcHandle() exceeds this threshold, the cache will start evicting handles via LRU from the cache. CCL_ZE_CACHE_GET_IPC_HANDLES_THRESHOLD="<value>" "<value>"
- SIZE The threshold value for caching get IPC handles.
By-default: "1000"
Set this environment variable to specify the per process threshold for caching IPC handles obtained with zeMemGetIpcHandle().
This controls whether it caches IPC handles obtained with zeMemGetIpcHandle() on sender's side. When enabled, it caches IPC handles, which can improve performance in certain scenarios. By default, the caching of get IPC handles is enabled. See <ulink url="https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1">https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory-1</ulink> CCL_ZE_CACHE_GET_IPC_HANDLES="<value>" "<value>"
-
0 Disables the caching of get IPC handles.
-
1 Enables the caching of get IPC handles (default).
By-default: "1"
Set to enable oversubscription in topo fallback stage for all collectives.
This enviroment variable enables or disables the oversubscription fallback from topo algorithm to copy in/out "<value>" : "0", "1" By-default: "1"
Set to enable oversubscription throw for all collectives.
This enviroment variable enables or disables the oversubscription throw check "<value>" : "0", "1" By-default: "1"
Set the directory path for DRM render devices.
This environment variable specifies the directory path where DRM render devices are located. Example value: "/custom/path/to/devices/" By-default: "/dev/dri/by-path/"
Set the suffix for DRM render device names.
This environment variable specifies the suffix to be used when searching for DRM render device names. Example value: "-customsuffix" By-default: "-render"
Set to specify monolithic pipeline approach for reduce_scatter phase in allreduceand reduce collectives.
This enviroment variable has the advantage of forming a seamless pipeline that conceals the data transfer time across MDFI. This way, a process reads the data from its peer tile on the same GPU, performs the reduction, and writes to a temporary buffer located on a different GPU. This modification will cover the time for transferring the data through XeLinks during the reduce-scatter phase in allreduce and reduce collectives. "<value>" : "0", "1" By-default: "1"
Set to specify the mechanism to use for Level Zero IPC exchange.
"drmfd" - Uses a the DRM mechanism for Level Zero IPC exchange. This is an experimental mechanism that is used with OS kernels previous to SP4. To use the DRM mechanism, the libdrm and drm headers must be available on a system.
"pidfd" - Uses pidfd mechanism for Level Zero IPC exchange. It requires OS kernel SP4 or above as it requires Linux 5.6 kernel or above
"sockets" - Uses socket mechanism for Level Zero IPC exchange. It is usually slower than the other two mechanisms, but can be used for debugging as it is usually available on most systems "<value>": "drmfd", "pidfd", "sockets" By-default: "drmfd"
Use bdf support for mapping logical to physical devices.
To obtain the physical device id based on the bdf, we need get and then parse the bdf values. Then using those values we can identify the particular device by referencing the appropriate fields in a pci configuration space for pci devices.to utilize bdf for the purpose of mapping logical devices to their corresponding physical devices. "<value>" : "0", "1" By-default: "1"
Use the fallback algorithm for reduce_scatter.
The fallback algorithm performs a full allreduce and then copies a subset of its output to the recv buffer. Currently, the fallback algorithm is used for scaleout whereas scaleup uses optimized algorithm. "<value>" : "0", "1" By-default: "0"
Automatically tune algorithm protocols based on port count.
Use number of ports to detect the 12 ports system and use write protocols on such systems for collectives. Users can disable this automatic detection and select the protocols manually. "<value>" : "0", "1" By-default: "1"
Enable switching of read and write protocols for pt2pt topo algorithm.
Control pt2pt read/write protocols.
Read Protocol:
It means SEND side is exchanging the handle with RECV side. Then execute the copy operation on the RECV operation side, where the dst buf is the local buffer and the source buffer is the remote buffer.
Write Protocol:
it means RECV side is exchanging the handle with SEND side. Execute the copy operation on the SEND operation side, where the dst buf is the remote buffer and the source buffer is the local buffer.
"<value>" : "0", "1"
By-default: "1"
Tunable value for collectives to adjust copy engine indexes.
use 2,4,6 copy engine indexes for host with 6 ports for allreduce, reduce and allgatherv "<value>": "on" - always use write mode with calculated indexes "off" - always disabled "detected" - determined by the logic in detection "undetected" - the default value, used before the logic in detection By-default: "undetected"
Switch ccl::barrier() host-sync / host-async options.
Historically ccl::barrier() was always synchronous. That does not match with oneCCL asynchronous concept. Same as other collectives, ccl::barrier() should be host-asynchronous if possible. As it would be too much to change in one moment, we start through experimental variable which introduces the option to make barrier host-asynchronous. Use CCL_BARRIER_SYNC=0 to achieve that. By-default: "1 (SYNC)"
Enable SYCL kernels.
Setting this environment variable to 1 enables SYCL kernel-based implementation for allgatherv, allreduce, and reduce_scatter. Support includes all message sizes and some data types (int32, fp32, fp16, and bf16), sum operation, and single node. oneCCL falls back to other implementations when the support is not available with SYCL kernels, so the user can safely setup this environment variable. "<value>" : "0", "1" By-default: "0 (disabled)"
Enable the use of persistent temporary buffer in allgatherv.
Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the allgatherv operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"
Specify the threshold for the small size algorithm in allgatherv.
Set the threshold in bytes to specify the small size algorithm in the allgatherv collective. Default value is 131072. "<value>"" : ">=0"
Specify the threshold for the medium size algorithm in allgatherv.
Set the threshold in bytes to specify the medium size algorithm in the allgatherv collective. Default value is 2097152. "<value>"" : ">=0"
Enable the use of persistent temporary buffer in allreduce.
Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the allreduce operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"
Specify the threshold for the small size algorithm in allreduce.
Set the threshold in bytes to specify the small size algorithm in the allreduce collective. Default value is 524288. "<value>"" : ">=0"
Specify the threshold for the medium size algorithm in allreduce.
Set the threshold in bytes to specify the medium size algorithm in the allreduce collective. Default value is 16777216. "<value>"" : ">=0"
Enable the use of persistent temporary buffer in reduce_scatter.
Setting this environment variable to 1 enables the use of a persistent temporary buffer to perform the reduce_scatter operation. This implementation makes the collective fully asynchronous but adds some additional overhead due to the extra copy of the user buffer to a (persistent) temporary buffer. "<value>" : "0", "1" By-default: "0 (disabled)"
Specify the threshold for the small size algorithm in reduce_scatter.
Set the threshold in bytes to specify the small size algorithm in the reduce_scatter collective. Default value is 2097152."<value>"" : ">=0"
Specify the threshold for the medium size algorithm in reduce_scatter.
Set the threshold in bytes to specify the medium size algorithm in the reduce_scatter collective. Default value is 67108864. "<value>"" : ">=0"
Experimental OneCCL Environment Variables Functionality of these variables has not been (fully) tested and, therefore, cannot be supported nor guaranteed.