You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.
As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:
def make_cl_image_buffer(queue, img_dims, img_dtype, linestep = None):
img_dtype = dtype_of(img_dtype)
if linestep is None:
linestep = img_dims[1] * img_dtype.itemsize
if queue.device.type == cl.device_type.GPU:
#make sure to have an uneven line width for better memory channel utilization, as per AMD recommendation
if linestep % 2 == 0:
linestep += 1
buf = cl.Buffer(queue.context, cl.mem_flags.READ_WRITE, img_dims[0]*linestep)
return buf, linestep
def make_cl_image(queue, img_dims, img_dtype, linestep = None):
img_dtype = dtype_of(img_dtype)
img_buf, linestep = make_cl_image_buffer(queue, img_dims, img_dtype, linestep)
strides = (linestep, dtype_of(img_dtype).itemsize)
img = clarray.Array(queue, img_dims, img_dtype, strides=strides, data=img_buf)
return img
In [3]: img=com.make_cl_image(queue, (2044, 2044), np.uint32)
In [12]: img.shape
Out[12]: (2044, 2044)
In [13]: img.strides
Out[13]: (8177, 4)
/home/jason/tmp/pyopencl.git/pyopencl/array.py in get(self, queue, ary, async)
685 raise TypeError("'ary' has non-matching type")
686
--> 687 assert self.flags.forc, "Array in get() must be contiguous"
688
689 if self.size:
AssertionError: Array in get() must be contiguous
I searched around and found someone else curious about this last year - he also put some work into this: #54
I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.
What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?
ps. Tried bringing this up for discussion on the ML, got stuck in moderation.
The text was updated successfully, but these errors were encountered:
loopy is a cool thing to have learned about - I see libraries like that pop-up from academic labs in a few different contexts.
But what we're talking about on the kernel side is probably no code changes to anything that takes in pitches/strides in bytes - ie it assumes unpacked layout. If the stride is a bad value for certain GPUs than one would want to set it >= the packed value and this can greatly change memory bandwidth utilization / bank conflicts.
We just need a small bit more support to make allocating those arrays not be too painful - of course it's best if it works on everything that is currently supported. The most important things are
-get matrix values
-set from or to values from a packed numpy array or another clarray
-set to constant value
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173
You can see that, depending on graphics card model, there are stride values to avoid - funny enough sometimes with strides of powers of 2 - fairly common numbers. If you ignore this, you can get into trouble and under utilize the memory channels available in hardware in addition to increasing bank conflicts (so the hit is 2 fold). The fix is to simply allocate a slightly pad the strides to avoid these conflicts (found via vendor/model documentation). This should be of use to any GPU implementation and likely some accelerators. It should not be performed when the device type is CPU or code that is doing vector operations directly on memory where the hardware has alignment restrictions.
As such, I tried to use such an array with pyopencl but it doesn't work out very well in manipulation/printing/inspection:
I searched around and found someone else curious about this last year - he also put some work into this:
#54
I wouldn't mind adding some lines to the library to accomplish good-no-surprises manipulation but I was unsure the best way to do it since it seems everything relies on the element wise kernels. I think the best/most compatible way of doing this is having the element wise kernels take in an additional stride parameter which they use to calculate the offset of the element they process.
What do you think? Is it possible for you to add this? If you don't have the time - can you outline what is the way to get it done?
ps. Tried bringing this up for discussion on the ML, got stuck in moderation.
The text was updated successfully, but these errors were encountered: