Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-device C/C++ sample #4

Closed
wants to merge 4 commits into from
Closed

Multi-device C/C++ sample #4

wants to merge 4 commits into from

Conversation

Beanavil
Copy link
Collaborator

Commits from DEB packaging and CI updates have been left out. Updated CI passing can be checked here.

@Beanavil Beanavil force-pushed the multi-device-sample branch 3 times, most recently from f9818a5 to 92d35c5 Compare September 26, 2023 11:26
@Beanavil
Copy link
Collaborator Author

FYI: the branch has conflicts because it's based on KhronosGroup/main instead of StreamHPC/main.

@mfep
Copy link

mfep commented Sep 27, 2023

FYI: the branch has conflicts because it's based on KhronosGroup/main instead of StreamHPC/main.

Fixed!

Copy link

@mfep mfep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the implemented example a good draft, however still a bit undirected, maybe as a result of multiple people working on it. The amount of required boilerplate and accompanying code (e.g. the ubiquitous padding logic) is quite large compared to the showcased OpenCL functionality.

I think this sample would be a fine opportunity to show more interesting OpenCL features, such as device partitioning and/or sub-buffers.

Let's discuss the next steps!

Comment on lines 137 to 138
void host_convolution(std::vector<cl_float> in, std::vector<cl_float>& out,
std::vector<cl_float> mask, size_t x_dim, size_t y_dim)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention is to pass potentially expensive-to-copy arguments by (const) reference.

Comment on lines 193 to 194
cl::Context context2 =
cl::sdk::get_context(triplets.at((triplets.size() >= 2)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implicit bool to int conversion seems to be quite unconventional. Can we have an explicit phrasing of the same logic here, e.g. a ternary operator?

Comment on lines 216 to 220
// Query device and runtime capabilities.
auto d1_highest_device_opencl_c_is_2_x =
cl::util::opencl_c_version_contains(dev1, "2.");
auto d1_highest_device_opencl_c_is_3_x =
cl::util::opencl_c_version_contains(dev1, "3.");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole setup should be implemented in a for loop. Especially that the current implementation is buggy: we use the same -cl-std queried from the first device for both devices.

## Key APIs and Concepts
The main idea behind this example is that a given kernel can be run simultaneously by two (or potentially more) devices, therefore reducing its execution time. One can essentially think of two strategies for this workflow:
1. each device computes its proportional part of the solution at its own speed and the results are combined on the host's side when finished, and
2. each device executes the kernel at its own speed but after each iteration there is P2P communication between the devices to share the partial results.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only possible, when the two devices are on the same context (if the copy is done via Buffers). I.e. they must be on the same platform as well.

Comment on lines 276 to 259
// Check that the WGSs can divide the global size (MacOS reports
// CL_INVALID_WORK_GROUP_SIZE otherwise). If WGS is smaller than the x
// dimension, then a NULL pointer will be used when initialising
// cl::EnqueueArgs for enqueuing the kernels.
if (pad_x_dim % wgs1 && pad_x_dim > wgs1)
{
size_t div = pad_x_dim / wgs1;
wgs1 = sqrt(div * wgs1);
}

if (pad_x_dim % wgs2 && pad_x_dim > wgs2)
{
size_t div = pad_x_dim / wgs2;
wgs2 = sqrt(div * wgs2);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm yet to understand why we go through all hassle with the local size, when the kernel itself doesn't use any work-group related function, nor local memory. Could you please give an input on that?

Comment on lines 309 to 289
// Fill with 0s the extra rows and columns added for padding.
for (size_t j = 0; j < pad_x_dim; ++j)
{
for (size_t i = 0; i < pad_y_dim; ++i)
{
if (i == 0 || j == 0 || i == (pad_y_dim - 1)
|| j == (pad_x_dim - 1))
{
h_input_grid[j + i * pad_x_dim] = 0;
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the ubiquitous padding logic, the logic could be simplified by using Image2Ds for input and output. The read operation could be performed with a padded sampler. Obviously the padding still would be implemented in the host convolution, but it could be done on-the-fly, without regenerating the whole input array.

}
}

int main(int argc, char* argv[])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of a way to break up this very long function into meaningful subroutines? E.g. the generation of the input data, the command line parsing, the setup of the kernels, or the verification of the results could be such. This would help readers to follow the structure of the program, without going too deep into every section's details.

@Beanavil Beanavil force-pushed the multi-device-sample branch 2 times, most recently from 26c7bcc to 2d62412 Compare October 27, 2023 12:50
@Beanavil Beanavil changed the title Multi-device sample Multi-device C/C++sample Oct 27, 2023
@Beanavil Beanavil changed the title Multi-device C/C++sample Multi-device C/C++ sample Oct 27, 2023
@Beanavil Beanavil force-pushed the multi-device-sample branch 5 times, most recently from 87a4154 to 419fe40 Compare October 30, 2023 11:09
* Add BUILD_UTILITY_LIBRARIES option

* Add whereami dependence

* Add exe relative utilities

* Update samples to use exe relative utilities

* Improve diagnostic on missing file

* Add missing default argument for error param

* Add docs on file utilities

* Add EOL

* Fix typo

Co-authored-by: Ronan Keryell <[email protected]>

* Simplify byte size calculation

Co-authored-by: Ronan Keryell <[email protected]>

* Fix typo

Co-authored-by: Ben Ashbaugh <[email protected]>

* Fix formatting

* Remove implicit narrowing conversions

* No unnamed type on libSDK surface

* warning: enumeration value x not handled in switch

---------

Co-authored-by: Ronan Keryell <[email protected]>
Co-authored-by: Ben Ashbaugh <[email protected]>
* Implemented callback sample

* Minor fixes from code review

* Minor fixes from code review II.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants