-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel attributes #360
base: main
Are you sure you want to change the base?
Kernel attributes #360
Conversation
/ok to test |
/ok to test |
/ok to test |
/ok to test |
|
/ok to test |
I have a design question for any reviewers to weigh in on. There is another change in the works to add device properties to the Device class, and the way I've implemented that, is to have device_instance.properties -> DeviceProperties, where DeviceProperties lazy queries the properties and exposes them. In short you would get a property like such: device = Device()
device.properties.property_a The reason I put all of the properties in the subclass, is because there are a lot of them, and adding them straight to device would cause device to be very bloated. The question is whether you think I should do the same thing here. Prior to making the deivce property change, I thought this was the best way to implement it, but I am now leaning towards sticking the attributes in a subclass so they would be accessed like: kernel.attributes.attribute_a = True
variable = kernel.attributes.attribute_b One considerable difference is that all the device properties are read only, while some of the kernel attributes are read/write. |
I really think this is the way to go! We definitely do not want to bloat the kernel/device instance when hitting tab. |
ok cool, I agree. Change made |
/ok to test |
updated the review to remove the setters on read/write properties in line with the discussion about deadlock between properties and launch config. + a couple formatting improvements to the docs |
/ok to test |
This attribute is read-only.""" | ||
return handle_return( | ||
driver.cuKernelGetAttribute( | ||
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, self._handle, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Here and below, why is device None
?
@property | ||
def shared_size_bytes(self) -> int: | ||
"""int : The size in bytes of statically-allocated shared memory required by this function. | ||
This attribute is read-only.""" | ||
return handle_return( | ||
driver.cuKernelGetAttribute( | ||
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES, self._handle, None | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do a little experiment for me: Compare the average timing of getting an attribute via
- the current implementation, and
- a Python layer cache, something like (can be improved)
@property
def shared_size_bytes(self) -> int:
out = self.cache.get("CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES")
if out is None:
out = handle_return(
driver.cuKernelGetAttribute(
driver.CUfunction_attribute.CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES, self._handle, None
)
self.cache["CU_FUNC_ATTRIBUTE_SHARED_SIZE_BYTES"] = out
return out
The idea is to check whether the C API call overhead is larger or smaller than the Python overhead.
@pytest.fixture(scope="module") | ||
def cuda_version(): | ||
# WAR this is a workaround for the fact that checking the runtime version using the cuGetDriverVersion | ||
# doesnt actually return the driver version buyt rather the latest cuda whcih is supported by the installed drive3r. | ||
|
||
version = handle_return(runtime.cudaRuntimeGetVersion()) | ||
major_version = version // 1000 | ||
minor_version = (version % 1000) // 10 | ||
return major_version, minor_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. In cuda.bindings
, this refers to the CUDA Runtime version which is static. (FWIW we also have cudart.getLocalRuntimeVersion()
. But it's not usable for here either, see below.)
if cuda_version[0] < 12: | ||
pytest.skip("CUDA version is less than 12, and doesn't support kernel attribute access") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not working because Kernel
encapsulates
CUkernel
starting CUDA driver/bindings 12+, which needscuKernelGetAttribute
for attribute getterCUfunction
for CUDA 11, which needscuFuncGetAttribute
(notice the function signature is different)
I believe we can just reuse this check in the tests:
self._backend_version = "new" if (_py_major_ver >= 12 and _driver_ver >= 12000) else "old" |
(I feel we've done similar things for either
Program
or Linker
? Something along the same line.)
Add getters and setters for the kernel attributes.
close #205