-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate performance delta between cuda.parallel
and CuPy reduction
#3213
Comments
Potentially related to #3574 |
Using the following scripts: Scripts# Filename: device_reduce.py
import numpy as np
import cupy as cp
import cuda.parallel.experimental.algorithms as algos
def add(a, b):
return a + b
def device_reduce(x, reduce_axis, out_axis, out=None, keepdims=False):
if out is None:
out = cp.empty((1,), dtype=x.dtype)
h_init = np.zeros(tuple(), dtype=x.dtype)
reducer_f = algos.reduce_into(x, out, add, h_init)
temp_storage_bytes = reducer_f(None, x, out, x.size, h_init)
temp = cp.empty((temp_storage_bytes,), dtype=cp.int8)
reducer_f(temp, x, out, x.size, h_init)
return out # Filename: bench.py
import cupy as cp
import time
from device_reduce import device_reduce
def reduce_cupy(input_arr, out_arr):
return input_arr.sum(out=out_arr)
def reduce_cuda_parallel(input_arr, out_arr):
zerod_shape = (0,)
return device_reduce(input_arr, zerod_shape, zerod_shape, out=out_arr)
def bench(impl_func):
# List of input sizes to benchmark
input_sizes = [10, 100, 1000, 10000, 100000, 1000000, 10_000_000, 100_000_000]
input_ = cp.random.rand(max(input_sizes)) # Create a random array of the given size
out_ = cp.empty(tuple(), dtype=input_.dtype)
# Dictionary to store results (size, avg_time_with_first_run, avg_time_without_first_run)
results = {}
n_reps = 100
for size in input_sizes:
times = []
arr = input_[:size]
# Run the benchmark 10 times for each input size
for _ in range(10):
# Record the start time using host timer
start_time = time.perf_counter()
# Perform the sum operation
for _ in range(n_reps):
result = impl_func(arr, out_)
cp.cuda.runtime.deviceSynchronize()
# Record the end time and calculate the time taken
end_time = time.perf_counter()
# Calculate the time taken for this run and store it
time_taken = (end_time - start_time) / n_reps
times.append(time_taken)
# Calculate the average time including the first run
avg_time_with_first_run = sum(times) / len(times)
# Calculate the average time excluding the first run
avg_time_without_first_run = sum(times[1:]) / (len(times) - 1)
# Store the results
results[size] = (avg_time_with_first_run, avg_time_without_first_run)
# Print out the results
print("Benchmark Results (input size, average time with first run, average time without first run):")
for size, (avg_with_first, avg_without_first) in results.items():
print(f"Input size: {size:10d} | Avg time with first run: {avg_with_first:.8f} seconds | Avg time without first run: {avg_without_first:.8f} seconds")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
prog="bench_reduction",
description="Uitlity to benchmark issue cccl#3213"
)
parser.add_argument("--cupy", action="store_true")
parser.add_argument("--cuda_parallel", action="store_true")
args = parser.parse_args()
if bool(args.cupy) != bool(args.cuda_parallel):
if args.cupy:
print("Benching reduce_cupy")
bench(reduce_cupy)
else:
print("Benching reduce_cuda_parallel")
bench(reduce_cuda_parallel)
else:
raise RuntimeError(
"Please specify what to bench by using --cupy, or --cuda_parallel arguments"
) I confirm that CuPy's reduction is about 3.5x faster for short arrays compared to that using of
Using # Filename: lprof_reduce_into.py
import cupy as cp
from device_reduce import device_reduce
def foo(input_, out_):
for _ in range(10000):
result = device_reduce(input_, (0,), (0,), out=out_)
cp.cuda.runtime.deviceSynchronize()
if __name__ == "__main__":
input_ = cp.ones(100, dtype=cp.float64)
out_ = cp.empty(tuple(), dtype=input_.dtype)
result = device_reduce(input_, (0,), (0,), out=out_)
foo(input_, out_) and decorating
The
And that function has its hot-spots distributed between checks whether input array is contiguous, processing array data type, and constructing
The check for
Making changes to the implementation, I see that the call to
|
Great investigation, Sasha! So this means we really should start considering integrating |
Thank you, Sasha and Leo! Definitely interested to see if and how much |
In an offline sync today, it also came up that the choice of the type used to specify the It would be interesting to see if using |
In #2958, we note a constant delta between
cuda.parallel
and CuPy reduction for smaller input sizes. We should find and eliminate the reasons for that delta.The text was updated successfully, but these errors were encountered: