-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-device C/C++ sample #4
Conversation
f9818a5
to
92d35c5
Compare
FYI: the branch has conflicts because it's based on KhronosGroup/main instead of StreamHPC/main. |
Fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the implemented example a good draft, however still a bit undirected, maybe as a result of multiple people working on it. The amount of required boilerplate and accompanying code (e.g. the ubiquitous padding logic) is quite large compared to the showcased OpenCL functionality.
I think this sample would be a fine opportunity to show more interesting OpenCL features, such as device partitioning and/or sub-buffers.
Let's discuss the next steps!
samples/core/multi-device/main.cpp
Outdated
void host_convolution(std::vector<cl_float> in, std::vector<cl_float>& out, | ||
std::vector<cl_float> mask, size_t x_dim, size_t y_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is to pass potentially expensive-to-copy arguments by (const) reference.
samples/core/multi-device/main.cpp
Outdated
cl::Context context2 = | ||
cl::sdk::get_context(triplets.at((triplets.size() >= 2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implicit bool to int conversion seems to be quite unconventional. Can we have an explicit phrasing of the same logic here, e.g. a ternary operator?
samples/core/multi-device/main.cpp
Outdated
// Query device and runtime capabilities. | ||
auto d1_highest_device_opencl_c_is_2_x = | ||
cl::util::opencl_c_version_contains(dev1, "2."); | ||
auto d1_highest_device_opencl_c_is_3_x = | ||
cl::util::opencl_c_version_contains(dev1, "3."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole setup should be implemented in a for
loop. Especially that the current implementation is buggy: we use the same -cl-std
queried from the first device for both devices.
## Key APIs and Concepts | ||
The main idea behind this example is that a given kernel can be run simultaneously by two (or potentially more) devices, therefore reducing its execution time. One can essentially think of two strategies for this workflow: | ||
1. each device computes its proportional part of the solution at its own speed and the results are combined on the host's side when finished, and | ||
2. each device executes the kernel at its own speed but after each iteration there is P2P communication between the devices to share the partial results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only possible, when the two devices are on the same context (if the copy is done via Buffers). I.e. they must be on the same platform as well.
samples/core/multi-device/main.cpp
Outdated
// Check that the WGSs can divide the global size (MacOS reports | ||
// CL_INVALID_WORK_GROUP_SIZE otherwise). If WGS is smaller than the x | ||
// dimension, then a NULL pointer will be used when initialising | ||
// cl::EnqueueArgs for enqueuing the kernels. | ||
if (pad_x_dim % wgs1 && pad_x_dim > wgs1) | ||
{ | ||
size_t div = pad_x_dim / wgs1; | ||
wgs1 = sqrt(div * wgs1); | ||
} | ||
|
||
if (pad_x_dim % wgs2 && pad_x_dim > wgs2) | ||
{ | ||
size_t div = pad_x_dim / wgs2; | ||
wgs2 = sqrt(div * wgs2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm yet to understand why we go through all hassle with the local size, when the kernel itself doesn't use any work-group related function, nor local memory. Could you please give an input on that?
samples/core/multi-device/main.cpp
Outdated
// Fill with 0s the extra rows and columns added for padding. | ||
for (size_t j = 0; j < pad_x_dim; ++j) | ||
{ | ||
for (size_t i = 0; i < pad_y_dim; ++i) | ||
{ | ||
if (i == 0 || j == 0 || i == (pad_y_dim - 1) | ||
|| j == (pad_x_dim - 1)) | ||
{ | ||
h_input_grid[j + i * pad_x_dim] = 0; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the ubiquitous padding logic, the logic could be simplified by using Image2D
s for input and output. The read operation could be performed with a padded sampler. Obviously the padding still would be implemented in the host convolution, but it could be done on-the-fly, without regenerating the whole input array.
} | ||
} | ||
|
||
int main(int argc, char* argv[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of a way to break up this very long function into meaningful subroutines? E.g. the generation of the input data, the command line parsing, the setup of the kernels, or the verification of the results could be such. This would help readers to follow the structure of the program, without going too deep into every section's details.
26c7bcc
to
2d62412
Compare
87a4154
to
419fe40
Compare
* Add BUILD_UTILITY_LIBRARIES option * Add whereami dependence * Add exe relative utilities * Update samples to use exe relative utilities * Improve diagnostic on missing file * Add missing default argument for error param * Add docs on file utilities * Add EOL * Fix typo Co-authored-by: Ronan Keryell <[email protected]> * Simplify byte size calculation Co-authored-by: Ronan Keryell <[email protected]> * Fix typo Co-authored-by: Ben Ashbaugh <[email protected]> * Fix formatting * Remove implicit narrowing conversions * No unnamed type on libSDK surface * warning: enumeration value x not handled in switch --------- Co-authored-by: Ronan Keryell <[email protected]> Co-authored-by: Ben Ashbaugh <[email protected]>
11c0765
to
b818605
Compare
* Implemented callback sample * Minor fixes from code review * Minor fixes from code review II.
b818605
to
b33c0e4
Compare
b33c0e4
to
0babaf4
Compare
0babaf4
to
b6475e4
Compare
Commits from DEB packaging and CI updates have been left out. Updated CI passing can be checked here.