Investigate performance delta between `cuda.parallel` and CuPy reduction #3213

shwina · 2024-12-21T12:51:08Z

In #2958, we note a constant delta between cuda.parallel and CuPy reduction for smaller input sizes. We should find and eliminate the reasons for that delta.

The text was updated successfully, but these errors were encountered:

gevtushenko · 2025-01-28T22:08:32Z

Potentially related to #3574

oleksandr-pavlyk · 2025-02-03T15:43:57Z

Using the following scripts:

Scripts

# Filename: device_reduce.py
import numpy as np
import cupy as cp
import cuda.parallel.experimental.algorithms as algos


def add(a, b):
  return a + b


def device_reduce(x, reduce_axis, out_axis, out=None, keepdims=False):
  if out is None:
    out = cp.empty((1,), dtype=x.dtype)
  h_init = np.zeros(tuple(), dtype=x.dtype)
  reducer_f = algos.reduce_into(x, out, add, h_init)
  temp_storage_bytes = reducer_f(None, x, out, x.size, h_init)
  temp = cp.empty((temp_storage_bytes,), dtype=cp.int8)
  reducer_f(temp, x, out, x.size, h_init)
  return out

# Filename: bench.py
import cupy as cp
import time
from device_reduce import device_reduce


def reduce_cupy(input_arr, out_arr):
  return input_arr.sum(out=out_arr)


def reduce_cuda_parallel(input_arr, out_arr):
  zerod_shape = (0,)
  return device_reduce(input_arr, zerod_shape, zerod_shape, out=out_arr)


def bench(impl_func):

  # List of input sizes to benchmark
  input_sizes = [10, 100, 1000, 10000, 100000, 1000000, 10_000_000, 100_000_000]

  input_ = cp.random.rand(max(input_sizes))  # Create a random array of the given size
  out_ = cp.empty(tuple(), dtype=input_.dtype)

  # Dictionary to store results (size, avg_time_with_first_run, avg_time_without_first_run)
  results = {}
  n_reps = 100


  for size in input_sizes:
    times = []

    arr = input_[:size]

    # Run the benchmark 10 times for each input size
    for _ in range(10):

      # Record the start time using host timer
      start_time = time.perf_counter()

      # Perform the sum operation
      for _ in range(n_reps):
        result = impl_func(arr, out_)
      cp.cuda.runtime.deviceSynchronize()

      # Record the end time and calculate the time taken
      end_time = time.perf_counter()

      # Calculate the time taken for this run and store it
      time_taken = (end_time - start_time) / n_reps
      times.append(time_taken)

    # Calculate the average time including the first run
    avg_time_with_first_run = sum(times) / len(times)

    # Calculate the average time excluding the first run
    avg_time_without_first_run = sum(times[1:]) / (len(times) - 1)

    # Store the results
    results[size] = (avg_time_with_first_run, avg_time_without_first_run)

  # Print out the results
  print("Benchmark Results (input size, average time with first run, average time without first run):")
  for size, (avg_with_first, avg_without_first) in results.items():
    print(f"Input size: {size:10d} | Avg time with first run: {avg_with_first:.8f} seconds | Avg time without first run: {avg_without_first:.8f} seconds")


if __name__ == "__main__":
  import argparse

  parser = argparse.ArgumentParser(
    prog="bench_reduction",
    description="Uitlity to benchmark issue cccl#3213"
  )
  parser.add_argument("--cupy", action="store_true")
  parser.add_argument("--cuda_parallel", action="store_true")
  args = parser.parse_args()

  if bool(args.cupy) != bool(args.cuda_parallel):
    if args.cupy:
      print("Benching reduce_cupy")
      bench(reduce_cupy)
    else:
      print("Benching reduce_cuda_parallel")
      bench(reduce_cuda_parallel)
  else:
    raise RuntimeError(
      "Please specify what to bench by using --cupy, or --cuda_parallel arguments"
    )

I confirm that CuPy's reduction is about 3.5x faster for short arrays compared to that using of cuda.parallel:

(build-py312) opavlyk@ee09c48-lcedt:~/repos/cccl/python/cuda_parallel/bench$ python bench.py --cupy
Benching reduce_cupy
Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00003298 seconds | Avg time without first run: 0.00001065 seconds
Input size:        100 | Avg time with first run: 0.00001066 seconds | Avg time without first run: 0.00001065 seconds
Input size:       1000 | Avg time with first run: 0.00001069 seconds | Avg time without first run: 0.00001069 seconds
Input size:      10000 | Avg time with first run: 0.00001300 seconds | Avg time without first run: 0.00001296 seconds
Input size:     100000 | Avg time with first run: 0.00001534 seconds | Avg time without first run: 0.00001534 seconds
Input size:    1000000 | Avg time with first run: 0.00009608 seconds | Avg time without first run: 0.00009601 seconds
Input size:   10000000 | Avg time with first run: 0.00088291 seconds | Avg time without first run: 0.00088328 seconds
Input size:  100000000 | Avg time with first run: 0.00860134 seconds | Avg time without first run: 0.00860142 seconds
(build-py312) opavlyk@ee09c48-lcedt:~/repos/cccl/python/cuda_parallel/bench$ python bench.py --cuda_parallel
Benching reduce_cuda_parallel
Benchmark Results (input size, average time with first run, average time without first run):
Input size:         10 | Avg time with first run: 0.00111234 seconds | Avg time without first run: 0.00003598 seconds
Input size:        100 | Avg time with first run: 0.00003612 seconds | Avg time without first run: 0.00003609 seconds
Input size:       1000 | Avg time with first run: 0.00003611 seconds | Avg time without first run: 0.00003607 seconds
Input size:      10000 | Avg time with first run: 0.00003847 seconds | Avg time without first run: 0.00003836 seconds
Input size:     100000 | Avg time with first run: 0.00003848 seconds | Avg time without first run: 0.00003846 seconds
Input size:    1000000 | Avg time with first run: 0.00009521 seconds | Avg time without first run: 0.00009521 seconds
Input size:   10000000 | Avg time with first run: 0.00086744 seconds | Avg time without first run: 0.00086760 seconds
Input size:  100000000 | Avg time with first run: 0.00859974 seconds | Avg time without first run: 0.00860017 seconds

Using line_profiler and the following script to focus of hot-spots of reducing short arrays with cuda.parallel:

# Filename: lprof_reduce_into.py
import cupy as cp
from device_reduce import device_reduce

def foo(input_, out_):
    for _ in range(10000):
        result = device_reduce(input_, (0,), (0,), out=out_)
    cp.cuda.runtime.deviceSynchronize()


if __name__ == "__main__":
    input_ = cp.ones(100, dtype=cp.float64)
    out_ = cp.empty(tuple(), dtype=input_.dtype)

    result = device_reduce(input_, (0,), (0,), out=out_)
    foo(input_, out_)

and decorating _Reducer.__call__ I find that 46% of __call__ execution is spent in cccl.to_cccl_iter, 12.6% is devoted to cccl.to_cccl_value, and 7.5% in data-type validation:

Total time: 0.513869 s
File: /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/algorithms/reduce.py
Function: __call__ at line 93

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    93                                               @line_profiler.profile
    94                                               def __call__(
    95                                                   self,
    96                                                   temp_storage,
    97                                                   d_in,
    98                                                   d_out,
    99                                                   num_items: int,
   100                                                   h_init: np.ndarray | GpuStruct,
   101                                                   stream=None,
   102                                               ):
   103     20002     132727.8      6.6     25.8          d_in_cccl = cccl.to_cccl_iter(d_in)
   104     20002       5099.7      0.3      1.0          if d_in_cccl.type.value == cccl.IteratorKind.ITERATOR:
   105                                                       assert num_items is not None
   106                                                   else:
   107     20002       3371.5      0.2      0.7              assert d_in_cccl.type.value == cccl.IteratorKind.POINTER
   108     20002       2304.5      0.1      0.4              if num_items is None:
   109                                                           num_items = d_in.size
   110                                                       else:
   111     20002       2615.9      0.1      0.5                  assert num_items == d_in.size
   112     40004       9715.5      0.2      1.9          _dtype_validation(
   113     20002       1691.6      0.1      0.3              self._ctor_d_in_cccl_type_enum_name,
   114     20002      11441.3      0.6      2.2              cccl.type_enum_as_name(d_in_cccl.value_type.type.value),
   115                                                   )
   116     20002      31746.0      1.6      6.2          _dtype_validation(self._ctor_d_out_dtype, protocols.get_dtype(d_out))
   117     20002       6903.4      0.3      1.3          _dtype_validation(self._ctor_init_dtype, h_init.dtype)
   118     20002       8213.6      0.4      1.6          stream_handle = protocols.validate_and_get_stream(stream)
   119     20002       2934.9      0.1      0.6          bindings = get_bindings()
   120     20002       2451.6      0.1      0.5          if temp_storage is None:
   121     10001       1391.8      0.1      0.3              temp_storage_bytes = ctypes.c_size_t()
   122     10001        917.2      0.1      0.2              d_temp_storage = None
   123                                                   else:
   124     10001       2604.0      0.3      0.5              temp_storage_bytes = ctypes.c_size_t(temp_storage.nbytes)
   125                                                       # Note: this is slightly slower, but supports all ndarray-like objects as long as they support CAI
   126                                                       # TODO: switch to use gpumemoryview once it's ready
   127     10001       8458.1      0.8      1.6              d_temp_storage = temp_storage.__cuda_array_interface__["data"][0]
   128     20002     110095.5      5.5     21.4          d_out_cccl = cccl.to_cccl_iter(d_out)
   129     40004      49975.2      1.2      9.7          error = bindings.cccl_device_reduce(
   130     20002       1681.4      0.1      0.3              self.build_result,
   131     20002       1804.9      0.1      0.4              d_temp_storage,
   132     20002       3047.8      0.2      0.6              ctypes.byref(temp_storage_bytes),
   133     20002       1642.4      0.1      0.3              d_in_cccl,
   134     20002       1662.7      0.1      0.3              d_out_cccl,
   135     20002       3122.7      0.2      0.6              ctypes.c_ulonglong(num_items),
   136     20002      28228.0      1.4      5.5              self.op_wrapper.handle(),
   137     20002      64682.1      3.2     12.6              cccl.to_cccl_value(h_init),
   138     20002       1872.5      0.1      0.4              stream_handle,
   139                                                   )
   140     20002       3052.3      0.2      0.6          if error != enums.CUDA_SUCCESS:
   141                                                       raise ValueError("Error reducing")
   142                                           
   143     20002       8413.1      0.4      1.6          return temp_storage_bytes.value

  0.51 seconds - /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/algorithms/reduce.py:93 - __call__

The to_cccl_iter spends most of its time in _device_to_cccl_iter:

Total time: 0.253931 s
File: /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_cccl.py
Function: to_cccl_iter at line 213

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   213                                           @line_profiler.profile
   214                                           def to_cccl_iter(array_or_iterator) -> Iterator:
   215     40006       6196.9      0.2      2.4      if isinstance(array_or_iterator, IteratorBase):
   216                                                   return _iterator_to_cccl_iter(array_or_iterator)
   217     40006     247734.5      6.2     97.6      return _device_array_to_cccl_iter(array_or_iterator)

  0.25 seconds - /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_cccl.py:213 - to_cccl_iter

And that function has its hot-spots distributed between checks whether input array is contiguous, processing array data type, and constructing Iterator:

Total time: 0.264932 s
File: /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_cccl.py
Function: _device_array_to_cccl_iter at line 142

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   142                                           @line_profiler.profile
   143                                           def _device_array_to_cccl_iter(array: DeviceArrayLike) -> Iterator:
   144     40006      96438.4      2.4     36.4      if not is_contiguous(array):
   145                                                   raise ValueError("Non-contiguous arrays are not supported.")
   146     40006      73963.7      1.8     27.9      info = _numpy_type_to_info(get_dtype(array))
   147     80012      36694.9      0.5     13.9      return Iterator(
   148     40006       4466.2      0.1      1.7          info.size,
   149     40006       4443.3      0.1      1.7          info.alignment,
   150     40006       4147.7      0.1      1.6          IteratorKind.POINTER,
   151     40006       5097.8      0.1      1.9          Op(),
   152     40006       4631.0      0.1      1.7          Op(),
   153     40006       3184.9      0.1      1.2          info,
   154                                                   # Note: this is slightly slower, but supports all ndarray-like objects
   155                                                   # as long as they support CAI
   156                                                   # TODO: switch to use gpumemoryview once it's ready
   157     40006      31863.7      0.8     12.0          array.__cuda_array_interface__["data"][0],
   158                                               )

  0.26 seconds - /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_cccl.py:142 - _device_array_to_cccl_iter

The check for is_contiguous spends most of its time retrieving shape and strides from DLPack:

Total time: 0.0951585 s
File: /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_utils/protocols.py
Function: is_contiguous at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           @line_profiler.profile
    39                                           def is_contiguous(arr: DeviceArrayLike) -> bool:
    40     40006      87058.4      2.2     91.5      shape, strides = get_shape(arr), get_strides(arr)
    41                                           
    42     40006       3984.9      0.1      4.2      if strides is None:
    43     40006       4115.2      0.1      4.3          return True
    44

Making changes to the implementation, I see that the call to __cuda_array_interface__ is pretty heavy:

Total time: 0.0422951 s
File: /home/opavlyk/miniforge/envs/build-py312/lib/python3.12/site-packages/cuda/parallel/experimental/_utils/protocols.py
Function: is_contiguous at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           @line_profiler.profile
    39                                           def is_contiguous(arr: DeviceArrayLike) -> bool:
    40     40006      30100.9      0.8     71.2      _cai = arr.__cuda_array_interface__
    41     40006       3955.0      0.1      9.4      strides = _cai["strides"]
    42                                           
    43     40006       3734.8      0.1      8.8      if strides is None:
    44     40006       4504.5      0.1     10.7          return True
    45                                           
    46                                               shape = _cai["shape"]

leofang · 2025-02-03T16:05:33Z

Great investigation, Sasha! So this means we really should start considering integrating StridedMemoryView from cuda.core with cuda.parallel asap (which was the original plan anyway)... CAI is known to be a bottleneck and cuda.parallel does not support DLPack, so everything goes through CAI.

leofang · 2025-02-03T16:05:59Z

https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#memory-view-utilities-for-cpu-gpu-buffers

shwina · 2025-02-03T16:19:54Z

Thank you, Sasha and Leo! Definitely interested to see if and how much StridedMemoryView can help us with these costs.

shwina · 2025-02-04T16:53:03Z

In an offline sync today, it also came up that the choice of the type used to specify the num_items parameter can impact performance in a non-trivial way. cupy uses a plain int to specify the number of items, whereas we always use a unsigned long long and corresponding offset type.

It would be interesting to see if using int or unsigned also contributes to any performance improvement.

github-project-automation bot added this to CCCL Dec 21, 2024

github-project-automation bot moved this to Todo in CCCL Dec 21, 2024

jrhemstad assigned oleksandr-pavlyk Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance delta between `cuda.parallel` and CuPy reduction #3213

Investigate performance delta between `cuda.parallel` and CuPy reduction #3213

shwina commented Dec 21, 2024

gevtushenko commented Jan 28, 2025

oleksandr-pavlyk commented Feb 3, 2025

leofang commented Feb 3, 2025 •

edited

Loading

leofang commented Feb 3, 2025

shwina commented Feb 3, 2025

shwina commented Feb 4, 2025

Investigate performance delta between cuda.parallel and CuPy reduction #3213

Investigate performance delta between cuda.parallel and CuPy reduction #3213

Comments

shwina commented Dec 21, 2024

gevtushenko commented Jan 28, 2025

oleksandr-pavlyk commented Feb 3, 2025

leofang commented Feb 3, 2025 • edited Loading

leofang commented Feb 3, 2025

shwina commented Feb 3, 2025

shwina commented Feb 4, 2025

Investigate performance delta between `cuda.parallel` and CuPy reduction #3213

Investigate performance delta between `cuda.parallel` and CuPy reduction #3213

leofang commented Feb 3, 2025 •

edited

Loading