Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

next #19

Merged
merged 7 commits into from
Aug 3, 2024
Merged

next #19

merged 7 commits into from
Aug 3, 2024

Conversation

PhilipDeegan
Copy link
Member

@PhilipDeegan PhilipDeegan commented Feb 19, 2024

closes #18

Summary by CodeRabbit

  • New Features

    • Added new device functions for indexing and memory management to improve GPU operations.
    • Introduced a StreamEvent struct for effective management of CUDA events.
    • Implemented enhanced memory management functionalities, including cooperative kernel launch support and memory heap size limits.
    • Added functionality for managing stream events and improving synchronization capabilities across GPU operations.
    • Established a framework for managing concurrent execution of GPU and host functions.
  • Improvements

    • Enhanced structure for managing concurrent execution of GPU functions through a new StreamFunction hierarchy.
    • Improved performance through additional template parameters, allowing for better configurability in launch scenarios.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 8b44c8c and 0482420.
Files ignored due to path filters (2)
  • mkn.yaml is excluded by: !**/*.yaml
  • res/mkn/hipcc.yaml is excluded by: !**/*.yaml
Files selected for processing (9)
  • inc/mkn/gpu.hpp (1 hunks)
  • inc/mkn/gpu/alloc.hpp (1 hunks)
  • inc/mkn/gpu/cuda.hpp (2 hunks)
  • inc/mkn/gpu/def.hpp (1 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/device.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (3 hunks)
  • test/any/add.cpp (1 hunks)
  • test/any/managed.cpp (1 hunks)
Additional comments: 15
inc/mkn/gpu/def.hpp (1)
  • 5-10: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [6-17]

The conditional compilation logic to extend is_floating_point_v for non-CPU architectures by including _Float16 is a thoughtful addition. It aligns with the PR's objectives of enhancing compatibility across different GPU architectures. However, ensure that _Float16 is supported in all targeted environments to avoid potential compilation issues.

inc/mkn/gpu/defines.hpp (2)
  • 8-40: The use of __has_include for conditionally defining MKN_GPU_ROCM and MKN_GPU_CUDA and the subsequent logic to handle these definitions is well-implemented. It's a clean approach to adapt the codebase to different GPU environments. Consider adding comments to explain the logic behind these macros for future maintainability.
  • 44-48: The CompileFlags struct is a neat addition for encapsulating the compile-time GPU environment flags. This can be very useful for runtime checks and decisions based on the compilation environment. Good job on this design choice.
test/any/managed.cpp (1)
  • 24-24: Adding an assertion to check if a GPU pointer is a managed pointer before launching the kernel is a good practice for ensuring memory management correctness. This aligns with the PR's objectives of enhancing robustness. Make sure to thoroughly test this change to confirm it doesn't inadvertently restrict valid use cases.
inc/mkn/gpu.hpp (2)
  • 34-34: Replacing conditional compilation directives with the direct inclusion of defines.hpp is a clean approach to centralize the logic for determining the GPU environment. This change enhances code readability and maintainability.
  • 39-50: Simplifying the idx() function with conditional checks based on macros for different GPU types is a good improvement. It makes the function more readable and adaptable to different environments. Consider enhancing the error message for unknown GPU environments to guide developers more effectively.
test/any/add.cpp (3)
  • 34-35: Adding assertions to check if a pointer is a host pointer enhances the robustness of memory management validation. This is a good practice for catching potential issues early.
  • 38-39: Similarly, adding an assertion to check if a pointer is a device pointer before operations is a prudent measure for ensuring memory management correctness.
  • 43-43: Commenting out the assertion related to pointer data might indicate an unresolved issue or a change in testing strategy. Please clarify the reason for this change to ensure it aligns with the intended testing objectives.
inc/mkn/gpu/alloc.hpp (1)
  • 75-86: The modifications in the copy function, including the direct initialization of src_p and dst_p and the correction in the argument order for the take function, improve the clarity and correctness of the memory management operations. These changes are commendable for enhancing the code's readability and functionality.
inc/mkn/gpu/cuda.hpp (2)
  • 42-42: Reordering the include statements to place #include <cuda_runtime.h> below local includes is a subtle change that might have implications for compilation order or symbol resolution. It's generally a good practice to include system or library headers after local headers to avoid potential conflicts.
  • 56-67: Adding the idx() template function within the mkn::gpu::cuda namespace is a valuable enhancement that simplifies the calculation of thread indices in CUDA kernels. This function is likely to be widely used and appreciated for its utility. Consider adding comments to explain its usage and the rationale behind its implementation.
inc/mkn/gpu/rocm.hpp (3)
  • 40-40: Reordering of include statements is generally a good practice for clarity and maintainability, ensuring that related headers are grouped together. This change appears to be aligned with the PR's objectives of simplifying and organizing the codebase.
  • 55-68: The addition of the mkn::gpu::hip namespace and the idx() template function is a significant enhancement. It provides a structured way to calculate thread indices in HIP kernels, which is crucial for GPU programming. This addition aligns with the PR's goal of improving device interaction mechanisms. However, it's essential to ensure that the logic for calculating indices (x + (y * width) + (z * width * height)) correctly matches the intended use cases and doesn't introduce any off-by-one errors or other logical issues.
  • 95-116: The modifications in the Pointer struct, including additional checks and assignment of type, enhance robustness by ensuring that the pointer attributes are correctly initialized and by providing methods to query the type of memory (e.g., host, device, managed). These changes are in line with the PR's objectives of refining memory management practices. However, it's crucial to verify that the assert(t) statement in the constructor (line 96) aligns with the project's error handling strategy, as assertions may be disabled in release builds, potentially leading to unchecked null pointers.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and ffd31c5.
Files selected for processing (1)
  • .github/workflows/build.yml (1 hunks)
Additional comments: 4
.github/workflows/build.yml (4)
  • 10-10: Renaming the build job to cpu clarifies the job's purpose, aligning with the addition of cuda and rocm jobs for specific GPU environments.
  • 13-13: Updating the actions/checkout version to v3 is a good practice, ensuring the latest features and fixes are utilized.
  • 20-30: The addition of a cuda job with a specific NVIDIA CUDA container is a significant improvement for testing in CUDA environments. However, ensure that the CUDA version (12.3.0) and Ubuntu version (22.04) specified in the container tag are compatible with the project's requirements and the ROCm 6.0.2 environment.
  • 31-40: Similarly, the addition of a rocm job with a ROCm container is crucial for testing in ROCm environments. It's important to verify that the rocm/rocm-terminal:latest container is compatible with ROCm 6.0.2 and meets the project's specific requirements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between ffd31c5 and 3c8e53f.
Files selected for processing (1)
  • .github/workflows/build.yml (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/build.yml

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and ebdf2bc.
Files selected for processing (4)
  • .github/workflows/build.yml (1 hunks)
  • inc/mkn/gpu/cpu.hpp (2 hunks)
  • inc/mkn/gpu/cuda.hpp (5 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
Files skipped from review as they are similar to previous changes (3)
  • .github/workflows/build.yml
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
Additional comments: 2
inc/mkn/gpu/cpu.hpp (2)
  • 132-132: The destroy function is now declared as inline. This is a good practice for small, frequently called functions that are defined in header files, as it can help reduce function call overhead and potentially improve performance by allowing the compiler to expand the function inline where it is called. However, ensure that this change does not lead to code bloat if the function is used extensively throughout the codebase.
  • 180-180: Similarly, declaring the sync function as inline is appropriate given its trivial implementation. This change is consistent with the goal of optimizing performance and reducing overhead in GPU-related operations. As with the destroy function, it's important to monitor for any negative impacts such as increased binary size, but given the simplicity of this function, such concerns are likely minimal.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and ee61695.
Files selected for processing (4)
  • .github/workflows/build.yml (1 hunks)
  • inc/mkn/gpu/cpu.hpp (2 hunks)
  • inc/mkn/gpu/cuda.hpp (5 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
Files skipped from review as they are similar to previous changes (4)
  • .github/workflows/build.yml
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp

@PhilipDeegan PhilipDeegan force-pushed the doh branch 2 times, most recently from 5c7b28d to dde2116 Compare February 22, 2024 22:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and 721500a.
Files selected for processing (4)
  • .github/workflows/build.yml (3 hunks)
  • inc/mkn/gpu/cpu.hpp (2 hunks)
  • inc/mkn/gpu/cuda.hpp (5 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
Additional comments: 3
.github/workflows/build.yml (3)
  • 1-1: The workflow name has been updated to ubuntu, which is more specific than the previous ubuntu-latest. This change is clear and straightforward.
  • 11-11: Updating the runs-on configuration to ubuntu-20.04 from ubuntu-latest ensures more predictable and stable build environments by using a fixed version of Ubuntu. This is a good practice to avoid unexpected issues when ubuntu-latest moves to a newer version.
  • 21-26: The addition of a job using Jimver/[email protected] for CUDA version 12.3.2 is a significant enhancement. It ensures that the project is tested against a specific CUDA version, which is crucial for compatibility and stability. However, it's important to ensure that the version of CUDA chosen is compatible with the project's requirements and other dependencies. Additionally, the build command for CUDA is consistent with the project's standards, using -std=c++17 and other flags. This step enhances the CI pipeline by ensuring that CUDA builds are explicitly tested.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and b9fe569.
Files selected for processing (4)
  • .github/workflows/build.yml (2 hunks)
  • inc/mkn/gpu/cpu.hpp (2 hunks)
  • inc/mkn/gpu/cuda.hpp (5 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
Additional comments: 2
.github/workflows/build.yml (2)
  • 11-11: The update from ubuntu-latest to ubuntu-20.04 is a good practice for ensuring a stable and predictable build environment by using a fixed version of Ubuntu. This change helps avoid potential issues that could arise from unexpected changes in the ubuntu-latest runner.
  • 20-25: The addition of a job using Jimver/[email protected] for CUDA version 12.3.2 is a significant improvement for ensuring compatibility with specific CUDA versions. This step allows for testing against a precise version of CUDA, enhancing the reliability of the build process for CUDA-dependent components. However, ensure that the version of the CUDA toolkit aligns with the project's requirements and that any potential compatibility issues with other dependencies are addressed.

.github/workflows/build.yml Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 0482420 and 351d210.
Files selected for processing (4)
  • .github/workflows/build.yml (2 hunks)
  • inc/mkn/gpu/cpu.hpp (2 hunks)
  • inc/mkn/gpu/cuda.hpp (6 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
Additional comments: 1
.github/workflows/build.yml (1)
  • 11-11: Updating the runs-on configuration to ubuntu-20.04 is a good practice to ensure the workflow runs on a specific, stable version of Ubuntu, rather than relying on the moving target of ubuntu-latest. This change enhances predictability and stability of the CI environment.

.github/workflows/build.yml Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 351d210 and b556a40.
Files selected for processing (1)
  • .github/workflows/build.yml (2 hunks)
Additional comments: 2
.github/workflows/build.yml (2)
  • 11-11: Updating the runs-on configuration to ubuntu-20.04 is a good practice for ensuring consistency and reliability in the build environment. This change helps avoid potential issues that could arise from using the ubuntu-latest tag, which automatically updates to the latest version.
  • 20-25: The commented-out steps for building with CUDA configuration using Jimver/[email protected] and specifying CUDA version 12.3.2 are prepared for future use. When uncommenting and implementing these steps, ensure that the CUDA toolkit action is from a trusted source and that the specified CUDA version aligns with the project's requirements. Additionally, consider adding error handling for the build steps to gracefully handle any failures.
Verification successful

The corrected script successfully fetched the necessary information about the Jimver/cuda-toolkit GitHub action. The repository was created on March 10, 2021, and has been actively maintained, with the last update on February 10, 2024, and the most recent push on February 25, 2024. It has garnered 121 stars and 38 forks, indicating a decent level of community interest and engagement.

Based on this information, the Jimver/cuda-toolkit action appears to be from a trusted source and is actively maintained, aligning with the review comment's advice to ensure the CUDA toolkit action is from a trusted source. The repository's activity and community interest further support its trustworthiness and relevance for future use in the project.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify the trustworthiness and maintenance status of the Jimver/cuda-toolkit action
gh repo view Jimver/cuda-toolkit --json owner,createdAt,updatedAt,pushedAt,stargazersCount,forksCount

Length of output: 1223



Script:

#!/bin/bash
# Corrected script to verify the trustworthiness and maintenance status of the Jimver/cuda-toolkit action
gh repo view Jimver/cuda-toolkit --json owner,createdAt,updatedAt,pushedAt,stargazerCount,forkCount

Length of output: 298

.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/build.yml Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 12

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between b556a40 and 576fa7b.
Files selected for processing (5)
  • inc/mkn/gpu/cpu.hpp (4 hunks)
  • inc/mkn/gpu/cuda.hpp (7 hunks)
  • inc/mkn/gpu/launchers.hpp (2 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Additional comments: 9
test/any/async_streaming.cpp (1)
  • 42-43: The usage of the stream function within StreamDeviceFunction demonstrates a practical application of launching kernels on specific streams. This is a good example of how the new functionality can be utilized in real-world scenarios. However, ensure that the lambda capture [=, fn = fn] correctly captures the necessary variables and does not lead to unintended copies or memory issues. Additionally, consider adding comments to explain the lambda's purpose and how it interacts with the stream function.
inc/mkn/gpu/cpu.hpp (2)
  • 145-145: Changing the destroy function to be inline is a minor optimization that could improve performance in some cases. However, ensure that this change is consistent with the project's coding standards and performance goals. Additionally, consider whether other utility functions in this file and related files should also be inlined for consistency and performance optimization.
  • 193-193: The inline specification for the sync function is another minor optimization. Given that this function is a no-op in the CPU context, the impact on performance might be negligible. Still, it's good practice to maintain consistency in the treatment of utility functions across the project. Ensure that the inline specification aligns with the project's coding standards and performance optimization strategies.
inc/mkn/gpu/cuda.hpp (3)
  • 42-42: The reordering of includes and the addition of cuda_runtime.h are necessary changes to ensure that CUDA functionalities are accessible and correctly prioritized over other includes. This adjustment helps avoid potential conflicts and ensures that the CUDA API is available for subsequent code. It's good practice to keep system and third-party includes well-organized and separated from project-specific includes.
  • 149-149: Changing the destroy function to be inline is a minor optimization that aligns with the changes made in other parts of the project. This consistency in handling utility functions is beneficial. However, ensure that the CUDA API calls within these functions are correctly error-checked to prevent silent failures.
  • 206-207: The addition of sync functions with and without the stream parameter enhances the flexibility of synchronization operations in CUDA contexts. These functions are essential for ensuring that GPU operations are completed before proceeding with CPU operations. Ensure that these functions are used appropriately throughout the project to manage synchronization effectively.
inc/mkn/gpu/rocm.hpp (3)
  • 113-130: Modifications to the Pointer struct to include HIP pointer attributes and memory type checks are essential for managing memory in HIP contexts. These changes improve the robustness and flexibility of memory management. Ensure that the implementation correctly handles various memory types (host, device, managed) and that it's consistent with HIP's memory management paradigms.
  • 161-161: Changing the destroy function to be inline is consistent with similar changes in other parts of the project. This optimization might have a minor impact but contributes to consistency. Ensure that HIP API calls within these functions are correctly error-checked to prevent silent failures and resource leaks.
  • 217-218: The addition of sync functions with and without the stream parameter in the HIP context is a valuable enhancement, providing more control over synchronization operations. These functions are crucial for ensuring that GPU operations are completed before proceeding with CPU operations. Ensure that these functions are appropriately utilized throughout the project to manage synchronization effectively.

inc/mkn/gpu/launchers.hpp Outdated Show resolved Hide resolved
inc/mkn/gpu/launchers.hpp Outdated Show resolved Hide resolved
Comment on lines 96 to 117
struct StreamEvent {
StreamEvent(Stream&) {}
~StreamEvent() {}

auto& operator()() { return event; };
void record() { ; }
bool finished() const { return true; }
void reset() {}

Stream stream;
std::size_t event = 0;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of the StreamEvent struct in the CPU context is an interesting addition, but it's essential to clarify its purpose and functionality, especially since the operations within the struct (e.g., record, finished, reset) are no-ops. If this struct is meant to provide a uniform interface across CPU and GPU backends, consider adding documentation to explain its intended use and the rationale behind the no-op implementations.

inc/mkn/gpu/cpu.hpp Outdated Show resolved Hide resolved
inc/mkn/gpu/cpu.hpp Outdated Show resolved Hide resolved
inc/mkn/gpu/cuda.hpp Outdated Show resolved Hide resolved
inc/mkn/gpu/cuda.hpp Outdated Show resolved Hide resolved
Comment on lines +55 to +68
namespace mkn::gpu::hip {

template <typename SIZE = uint32_t /*max 4294967296*/>
__device__ SIZE idx() {
SIZE width = hipGridDim_x * hipBlockDim_x;
SIZE height = hipGridDim_y * hipBlockDim_y;

SIZE x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
SIZE y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;
SIZE z = hipBlockDim_z * hipBlockIdx_z + hipThreadIdx_z;
return x + (y * width) + (z * width * height); // max 4294967296
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of the idx function within the mkn::gpu::hip namespace is a critical utility for indexing in HIP kernels, similar to its CUDA counterpart. This function facilitates the calculation of thread indices in a straightforward manner. Ensure comprehensive testing across various grid and block configurations to validate its accuracy and reliability.

inc/mkn/gpu/rocm.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 8b44c8c and b84a312.
Files ignored due to path filters (2)
  • mkn.yaml is excluded by: !**/*.yaml
  • res/mkn/hipcc.yaml is excluded by: !**/*.yaml
Files selected for processing (12)
  • .github/workflows/build.yml (1 hunks)
  • inc/mkn/gpu.hpp (1 hunks)
  • inc/mkn/gpu/alloc.hpp (1 hunks)
  • inc/mkn/gpu/cpu.hpp (4 hunks)
  • inc/mkn/gpu/cuda.hpp (7 hunks)
  • inc/mkn/gpu/def.hpp (1 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/device.hpp (1 hunks)
  • inc/mkn/gpu/launchers.hpp (2 hunks)
  • inc/mkn/gpu/rocm.hpp (6 hunks)
  • test/any/add.cpp (1 hunks)
  • test/any/managed.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/launchers.hpp
Additional comments: 10
.github/workflows/build.yml (1)
  • 13-13: The update of actions/checkout to v4 is a good practice for staying up-to-date with action versions. Ensure to verify the workflow execution post-merge to confirm there are no compatibility issues.
inc/mkn/gpu/def.hpp (1)
  • 5-10: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [6-17]

The simplification of conditional compilation for floating-point type checks is a good improvement. Consider adding comments explaining the inclusion of _Float16 support under non-CPU platforms for future maintainability.

inc/mkn/gpu/defines.hpp (1)
  • 8-48: The dynamic definition of macros based on the availability of CUDA and ROCm headers, along with the introduction of the CompileFlags structure, significantly enhances the codebase's flexibility and maintainability. Consider adding documentation for the CompileFlags structure to clarify its purpose and usage within the codebase.
test/any/managed.cpp (1)
  • 24-24: The addition of the assertion to check for managed GPU pointers before kernel launch is a good practice for ensuring memory safety. Consider adding a comment explaining the rationale behind this assertion to aid future maintainability.
inc/mkn/gpu.hpp (1)
  • 34-50: The reorganization of conditional includes and the modification of the idx() function to use the new MKN_GPU_* defines improve code clarity and consistency. Consider adding unit tests for the idx() function to ensure its correct behavior across different GPU platforms.
test/any/add.cpp (1)
  • 34-43: The addition of assertions to check for host and device pointers before and after memory operations in the test_add1 function is a good practice for ensuring memory safety and correctness. Consider adding comments explaining the purpose of these assertions to aid future maintainability.
inc/mkn/gpu/alloc.hpp (1)
  • 75-86: Swapping the order of arguments in the take function call and adding assertions before creating Pointer instances correct potential issues and improve code robustness. Consider adding comments explaining the reason behind these changes to aid future maintainability.
inc/mkn/gpu/rocm.hpp (1)
  • 89-137: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [40-292]

The introduction of the mkn::gpu::hip namespace, modifications to the Pointer struct, addition of the StreamEvent struct, and updates to memory type checks and synchronization functions significantly enhance the codebase's functionality and robustness when working with ROCm. Consider adding comprehensive tests to ensure the correct behavior of these new functionalities and modifications.

inc/mkn/gpu/device.hpp (2)
  • 103-104: The addition of data() member functions to the DeviceMem struct is a good practice for providing controlled access to the internal data pointer. This allows for both mutable and immutable access, which can be very useful in various contexts.

Ensure that proper memory management practices are followed throughout the DeviceMem struct to prevent issues such as memory leaks or dangling pointers, especially considering the manual memory management operations performed in the constructor and destructor.

  • 100-108: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [1-250]

Ensure consistency and proper memory management practices across all components related to device and host memory management in this file. Given the manual memory management operations performed in various places, it's crucial to carefully manage resources to prevent memory leaks, dangling pointers, and other related issues.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between b84a312 and e35887d.
Files selected for processing (1)
  • test/any/async_streaming.cpp (1 hunks)
Additional Context Used
Additional comments not posted (1)
test/any/async_streaming.cpp (1)

63-70: The constructor of StreamLauncher initializes member variables and sets up streams and events. However, the destructor only synchronizes streams without cleaning up or releasing any resources explicitly. Ensure that all allocated resources are properly managed and released to prevent memory leaks or resource exhaustion.

Consider verifying if additional cleanup is required for the streams and events to ensure proper resource management.

Comment on lines +2 to +6
#include <cassert>
#include <chrono>
#include <iostream>
#include <thread>
#include <algorithm>
Copy link

@coderabbitai coderabbitai bot Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider organizing the standard library includes in alphabetical order for better readability and maintainability.

#include <algorithm>
#include <cassert>
#include <chrono>
#include <iostream>
#include <thread>

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
#include <cassert>
#include <chrono>
#include <iostream>
#include <thread>
#include <algorithm>
#include <algorithm>
#include <cassert>
#include <chrono>
#include <iostream>
#include <thread>

Copy link
Member Author

@PhilipDeegan PhilipDeegan May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to sort by line length first, and alphabetically second, which would look like

#include <chrono>
#include <thread>
#include <cassert>
#include <iostream>
#include <algorithm>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


New Learnings Added
User: PhilipDeegan
PR: mkn/mkn.gpu#19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-05-18T14:17:54.861Z
Learning: @PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.

Note: Learnings are effective only in the context of specific code segments. To apply general coding guidelines, please configure review instructions.

test/any/async_streaming.cpp Outdated Show resolved Hide resolved
test/any/async_streaming.cpp Outdated Show resolved Hide resolved
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 3; }) //
();

for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing the first element of each vector in vecs using std::cout is useful for debugging but might clutter the output in production or automated testing environments. Consider using a logging framework or conditional compilation to control the verbosity of output.

-  for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
+  // For debugging purposes, consider using a logging framework or conditional compilation

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
// For debugging purposes, consider using a logging framework or conditional compilation

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e35887d and 1468f8b.

Files selected for processing (4)
  • inc/mkn/gpu.hpp (1 hunks)
  • inc/mkn/gpu/cli.hpp (1 hunks)
  • inc/mkn/gpu/cuda.hpp (9 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
Additional comments not posted (8)
inc/mkn/gpu.hpp (2)

34-35: LGTM!


40-51: Ensure comprehensive testing for the idx function.

The idx function determines the thread index based on the GPU type. Ensure that this function is thoroughly tested for various grid and block configurations to guarantee its correctness and robustness.

Verification successful

Comprehensive testing for the idx function is confirmed.

The idx function is thoroughly tested across various files, including different grid and block configurations.

  • test/hip/async.cpp
  • test/cuda/async.cpp
  • test/hip/add.cpp
  • test/cuda/atomic.cpp
  • test/any/async_streaming.cpp
  • test/any/add.cpp
  • test/any/managed.cpp
  • test/any/struct.cpp
  • test/any/array.cpp
  • test/cpu/namespace.cpp
  • test/any/class1.cpp
  • test/any/class2.cpp
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the `idx` function is thoroughly tested.

# Test: Search for test cases that cover the `idx` function. Expect: Comprehensive coverage.
rg --type cpp $'idx()'

Length of output: 2763



Script:

#!/bin/bash
# Description: Verify comprehensive testing of the `idx` function for various grid and block configurations.

# Search for grid and block configuration patterns in the test files that call `idx()`.
rg --type cpp -e 'dim3' -e '<<<' -e 'idx()'

Length of output: 5507

inc/mkn/gpu/cuda.hpp (2)

Line range hint 42-52: LGTM!


56-66: Ensure comprehensive testing for the idx function.

The idx function determines the thread index in CUDA kernels. Ensure that this function is thoroughly tested for various grid and block configurations to guarantee its correctness and robustness.

inc/mkn/gpu/rocm.hpp (4)

Line range hint 40-50: LGTM!


54-65: Ensure comprehensive testing for the idx function.

The idx function determines the thread index in HIP kernels. Ensure that this function is thoroughly tested for various grid and block configurations to guarantee its correctness and robustness.


95-111: Enable the destructor code for StreamEvent.

The commented-out destructor code should be enabled to ensure proper resource management and avoid resource leaks.


226-241: Document the implications of conditional synchronization in the launch function.

The update to the launch function to include conditional synchronization based on the _sync template parameter is a significant improvement. Ensure that the implications of this change are well-documented, especially regarding how it affects kernel execution and synchronization.

Comment on lines 1 to 67
#ifndef _MKN_GPU_CLI_HPP_
#define _MKN_GPU_CLI_HPP_

#include <optional>
#include <type_traits>

#include "mkn/kul/env.hpp"

namespace mkn::gpu {

template <typename Device>
struct Cli {
//

auto bx_threads() const {
char const* ENV = "MKN_GPU_BX_THREADS";
if (mkn::kul::env::EXISTS(ENV)) {
return as<std::int32_t>(mkn::kul::env::GET(ENV));
}
return dev.maxThreadsPerBlock;
}

template <typename T>
auto static as(std::string const& from) {
T t;
std::stringstream ss(from);
ss >> t;
return t;
}

Device const& dev;
};

} /* namespace mkn::gpu */

#endif /*_MKN_GPU_CLI_HPP_*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper error handling for environment variable retrieval.

The function bx_threads retrieves an environment variable and converts it to an integer. Consider adding error handling for cases where the environment variable is not a valid integer.

auto bx_threads() const {
  char const* ENV = "MKN_GPU_BX_THREADS";
  if (mkn::kul::env::EXISTS(ENV)) {
    try {
      return as<std::int32_t>(mkn::kul::env::GET(ENV));
    } catch (const std::invalid_argument& e) {
      // Handle error: invalid integer
    } catch (const std::out_of_range& e) {
      // Handle error: out of range
    }
  }
  return dev.maxThreadsPerBlock;
}

inc/mkn/gpu/cuda.hpp Outdated Show resolved Hide resolved
inc/mkn/gpu/cuda.hpp Outdated Show resolved Hide resolved
Copy link

coderabbitai bot commented Jul 7, 2024

Caution

Review failed

The pull request is closed.

Walkthrough

The recent updates significantly enhance GPU management capabilities within the codebase. Key improvements include new device functions for indexing and memory management, streamlined event handling with the StreamEvent struct, and robust stream management features in multi_launch.hpp. These changes collectively aim to improve performance and flexibility in concurrent GPU operations.

Changes

File Change Summary
inc/mkn/gpu/cuda.hpp, inc/mkn/gpu/rocm.hpp Added new functions for device management, memory operations, and enhanced event handling with StreamEvent struct.
inc/mkn/gpu/multi_launch.hpp Defined components for managing concurrent execution of GPU functions, including StreamFunction hierarchy and ThreadedStreamLauncher.
inc/mkn/gpu/cpu.hpp Introduced new functions for memory management and event handling; improved launch functionality.
test/any/async_streaming.cpp Implemented parallel processing functions utilizing new stream management features for performance validation.

Sequence Diagrams

sequenceDiagram
    participant User
    participant StreamLauncher
    participant StreamEvent
    participant Callable

    User->>StreamLauncher: Initiate parallel execution
    StreamLauncher->>StreamEvent: Create and manage events
    StreamLauncher->>Callable: Execute callable functions
    Callable-->>StreamLauncher: Return results
    StreamLauncher-->>User: Provide execution status/results
Loading

Assessment against linked issues

Objective Addressed Explanation
Guess if CUDA or HIP headers are available (#18) No code related to header checking was added.

🐇 In the realm of code where GPUs gleam,
Functions race, like a swift, bright beam.
Events now managed, streams in their flow,
With performance boosted, watch the data grow!
New features sprout, like carrots in spring,
In this code garden, the future we bring.
🌱✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (3)
test/any/async_streaming.cpp (3)

2-6: Organize includes by line length first, and alphabetically second.

Consider organizing the standard library includes by line length first, and alphabetically second for better readability and maintainability.

-#include <cassert>
-#include <chrono>
-#include <iostream>
-#include <thread>
-#include <algorithm>
+#include <chrono>
+#include <thread>
+#include <cassert>
+#include <iostream>
+#include <algorithm>

27-28: Declare destructor as virtual and use override for overridden methods.

It's good practice to declare the destructor of a base class with virtual methods as virtual to ensure proper cleanup in derived classes. Since C++11, it's recommended to use override for overridden methods and default for trivial destructors.

-  virtual ~StreamFunction() {}
+  virtual ~StreamFunction() = default;
-  virtual void run(std::uint32_t const){};
+  virtual void run(std::uint32_t const) override {};

136-150: Add assertions to verify the correctness of operations.

To ensure the test validates the expected behavior, consider adding assertions after the streaming operations to check the state of vecs.

+  for (std::size_t i = 0; i < vecs.size(); ++i) {
+    assert(vecs[i][0] == expected_value); // Replace `expected_value` with the actual expected value after operations
+  }
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8b44c8c and 814f51e.

Files selected for processing (16)
  • .github/workflows/build.yml (1 hunks)
  • inc/mkn/gpu.hpp (1 hunks)
  • inc/mkn/gpu/alloc.hpp (1 hunks)
  • inc/mkn/gpu/cli.hpp (1 hunks)
  • inc/mkn/gpu/cpu.hpp (6 hunks)
  • inc/mkn/gpu/cuda.hpp (9 hunks)
  • inc/mkn/gpu/def.hpp (1 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/device.hpp (1 hunks)
  • inc/mkn/gpu/launchers.hpp (2 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
  • mkn.yaml (2 hunks)
  • res/mkn/hipcc.yaml (1 hunks)
  • test/any/add.cpp (1 hunks)
  • test/any/async_streaming.cpp (1 hunks)
  • test/any/managed.cpp (1 hunks)
Files skipped from review due to trivial changes (4)
  • .github/workflows/build.yml
  • inc/mkn/gpu/def.hpp
  • res/mkn/hipcc.yaml
  • test/any/managed.cpp
Additional context used
Learnings (1)
test/any/async_streaming.cpp (1)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-05-18T14:17:54.861Z
Learning: @PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.
Additional comments not posted (29)
mkn.yaml (2)

13-13: Verify the intention behind commenting out the arg for the rocm profile.

Commenting out the arg may affect the build configuration. Ensure this change is intentional.


19-19: Verify the intention behind commenting out the arg for the cuda profile.

Commenting out the arg may affect the build configuration. Ensure this change is intentional.

inc/mkn/gpu/defines.hpp (1)

12-24: LGTM! The conditional compilation directives and includes are correctly defined.

The changes ensure that the appropriate headers are included based on the presence of ROCm and CUDA.

inc/mkn/gpu.hpp (1)

39-50: LGTM! The idx() function is correctly defined based on the GPU configuration.

The changes ensure that the correct idx() function is called based on the defined GPU configuration. The error directive for unknown configurations is a good addition.

test/any/add.cpp (2)

34-34: LGTM! The assertion for the host pointer is a good addition.

The assertion ensures that the data pointer is a host pointer before proceeding with the test.


38-38: LGTM! The assertion for the device pointer is a good addition.

The assertion ensures that the data pointer is a device pointer before proceeding with the test.

inc/mkn/gpu/cli.hpp (2)

46-51: Ensure proper error handling for environment variable retrieval.

The function bx_threads retrieves an environment variable and converts it to an integer. Consider adding error handling for cases where the environment variable is not a valid integer.

auto bx_threads() const {
  char const* ENV = "MKN_GPU_BX_THREADS";
  if (mkn::kul::env::EXISTS(ENV)) {
    try {
      return as<std::int32_t>(mkn::kul::env::GET(ENV));
    } catch (const std::invalid_argument& e) {
      // Handle error: invalid integer
    } catch (const std::out_of_range& e) {
      // Handle error: out of range
    }
  }
  return dev.maxThreadsPerBlock;
}

54-59: LGTM!

The as function correctly converts a string to the specified type.

inc/mkn/gpu/launchers.hpp (2)

45-47: Ensure the Stream type is well-defined and add documentation.

The addition of the stream function in the GDLauncher struct is a significant enhancement, allowing kernels to be launched on specific GPU streams. This change improves the flexibility and control over GPU resource management. Ensure the Stream type is well-defined and compatible with the intended GPU backends (CUDA, ROCm, etc.). Additionally, consider adding documentation for this function to clarify its usage and the expected type of the Stream parameter.

Verification successful

Ensure the Stream type is well-defined and add documentation.

The Stream type is well-defined and compatible with various GPU backends, including CUDA, ROCm, and CPU, as evidenced by its definitions in inc/mkn/gpu/cuda.hpp, inc/mkn/gpu/cpu.hpp, and inc/mkn/gpu/rocm.hpp. Consider adding documentation for the stream function to clarify its usage and the expected type of the Stream parameter.

  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/rocm.hpp
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the definition of the `Stream` type.

# Test: Search for the `Stream` type definition. Expect: A well-defined type compatible with GPU backends.
rg --type cpp $'Stream'

Length of output: 6525


59-61: Ensure effective handling of the stream parameter and add error handling.

The modification to the _launch function to accept a stream parameter (S& _s) is a necessary change to support the new stream function. This adjustment allows the kernel launch operation to target a specific stream, enhancing concurrency and resource utilization on the GPU. Ensure that the MKN_GPU_NS::launch function is updated accordingly to handle the stream parameter effectively. It might also be beneficial to add error handling or assertions to verify the validity of the stream before launching the kernel.

inc/mkn/gpu/alloc.hpp (1)

75-86: LGTM!

The assertions for dst and src ensure that the pointers are valid. The switching of arguments in the take function call is correct and aligns with the logic for sending and taking data.

inc/mkn/gpu/cpu.hpp (5)

41-41: Inclusion of cli.hpp is appropriate.

The inclusion of cli.hpp aligns with the new functionalities introduced in the file.


86-88: Addition of setLimitMallocHeapSize function is appropriate.

This no-op function maintains interface consistency with the GPU counterpart.


101-112: Addition of StreamEvent struct is appropriate, but document its purpose.

The StreamEvent struct provides a uniform interface for event handling across CPU and GPU contexts. Ensure to add documentation explaining its purpose and the rationale behind the no-op implementations.


120-121: Clarify behavior of Pointer struct functions.

Modifying the Pointer struct to always return true for is_device_ptr and is_managed_ptr in the CPU context might be misleading. Add documentation to clarify the behavior of these functions.


207-207: Document the implications of conditional synchronization in the launch function.

The update to the launch function to include conditional synchronization based on the _sync template parameter is a significant improvement. Ensure that the implications of this change are well-documented, especially regarding how it affects kernel execution and synchronization.

inc/mkn/gpu/cuda.hpp (6)

42-42: Inclusion of cli.hpp is appropriate.

The inclusion of cli.hpp aligns with the new functionalities introduced in the file.


57-67: Addition of idx function is appropriate, ensure thorough testing.

The idx function simplifies thread index calculation in CUDA kernels. Ensure that the function is thoroughly tested for various grid and block configurations.


81-83: Addition of setLimitMallocHeapSize function is appropriate.

This function sets the heap size limit for CUDA memory allocation.


97-113: Enable the destructor code for StreamEvent.

The StreamEvent struct addition for CUDA streams is crucial for event management. Ensure the destructor properly manages the destruction of CUDA events to prevent resource leaks.


215-230: Document the implications of conditional synchronization in the launch function.

The update to the launch function to include conditional synchronization based on the _sync template parameter is a significant improvement. Ensure that the implications of this change are well-documented, especially regarding how it affects kernel execution and synchronization.


255-262: Addition of GLauncher class is appropriate, ensure thorough testing.

The GLauncher class initializes the block size using cli.bx_threads(). Ensure that this functionality is thoroughly tested.

inc/mkn/gpu/rocm.hpp (6)

40-40: Inclusion of cli.hpp is appropriate.

The inclusion of cli.hpp aligns with the new functionalities introduced in the file.


55-66: Addition of idx function is appropriate, ensure thorough testing.

The idx function simplifies thread index calculation in HIP kernels. Ensure that the function is thoroughly tested for various grid and block configurations.


80-82: Addition of setLimitMallocHeapSize function is appropriate.

This function sets the heap size limit for HIP memory allocation.


96-112: Enable the destructor code for StreamEvent.

The StreamEvent struct addition for HIP streams is crucial for event management. Ensure the destructor properly manages the destruction of HIP events to prevent resource leaks.


117-138: Addition of Pointer struct is appropriate, ensure documentation.

The Pointer struct provides functions to determine the type of pointer (host, device, managed). Ensure that these functions are documented properly to clarify their behavior.


227-242: Document the implications of conditional synchronization in the launch function.

The update to the launch function to include conditional synchronization based on the _sync template parameter is a significant improvement. Ensure that the implications of this change are well-documented, especially regarding how it affects kernel execution and synchronization.

inc/mkn/gpu/device.hpp (1)

103-105: Addition of data member functions to DeviceMem struct is appropriate, ensure documentation.

The data member functions provide access to the data pointer in DeviceMem. Ensure that these functions are documented properly to clarify their behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 814f51e and 40db90a.

Files selected for processing (3)
  • inc/mkn/gpu/cuda.hpp (10 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
  • res/mkn/clang_cuda.yaml (1 hunks)
Files skipped from review due to trivial changes (1)
  • res/mkn/clang_cuda.yaml
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
test/any/coop.cpp (1)

16-34: Add error checks for GPU launcher setup.

The function does not include explicit error checks for the GPU launcher setup. Consider adding error handling to ensure robustness.

// Example error check for GPU launcher setup
if (!launcher) {
  KOUT(NON) << "Failed to set up GPU launcher";
  return 1;
}
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 40db90a and 8669dd3.

Files selected for processing (5)
  • inc/mkn/gpu/cpu.hpp (7 hunks)
  • inc/mkn/gpu/cuda.hpp (10 hunks)
  • inc/mkn/gpu/launchers.hpp (2 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
  • test/any/coop.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (4)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/launchers.hpp
  • inc/mkn/gpu/rocm.hpp
Additional comments not posted (3)
test/any/coop.cpp (3)

2-6: LGTM!

The includes and constants look appropriate for the context.


36-46: LGTM!

The main function is straightforward and correctly handles the cooperative launch support check.


16-34: Verify compatibility with ROCm.

The lambda function uses __device__ which is specific to CUDA. Ensure compatibility with ROCm.

Verification successful

Verified compatibility with ROCm.

The codebase includes ROCm-specific preprocessor directives and documentation, indicating that it is designed to be compatible with ROCm.

  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/defines.hpp
  • test/hip/add.cpp
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify compatibility with ROCm.

# Test: Search for usage of `__device__` in ROCm-related files.
rg --type cpp $'__device__'

Length of output: 3024


Script:

#!/bin/bash
# Description: Check for ROCm-specific conditional compilation directives and documentation.

# Search for ROCm-specific preprocessor directives.
rg -i 'ROCm' --type cpp

# Search for comments or documentation related to ROCm.
rg -i 'ROCm' --type md --type hpp --type cpp

Length of output: 1391

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8669dd3 and e2cca60.

Files selected for processing (3)
  • inc/mkn/gpu/cuda.hpp (10 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
  • test/any/coop.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
  • test/any/coop.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e2cca60 and 3d880b1.

Files selected for processing (2)
  • inc/mkn/gpu/cuda.hpp (10 hunks)
  • inc/mkn/gpu/rocm.hpp (8 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3d880b1 and df47426.

Files selected for processing (1)
  • inc/mkn/gpu/cpu.hpp (7 hunks)
Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/cpu.hpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 72aa34b and 35bbbe8.

Files selected for processing (2)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • test/any/construct.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/alloc.hpp
Additional comments not posted (5)
test/any/construct.cpp (5)

2-4: LGTM!

The include statement and constant definition are appropriate.


6-10: LGTM!

The template definitions for ManagedVector and ManagedMemory are correctly defined using mkn::gpu allocators.


12-15: LGTM!

The global variable and the struct S with a constructor that modifies alloced are correctly defined.


17-29: LGTM!

The test functions test_does_construct_on_resize and test_does_not_construct_on_resize correctly test the construction behavior on resize operations.


31-34: LGTM!

The main function correctly runs the tests and returns their combined result.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 35bbbe8 and c5e6bf0.

Files selected for processing (2)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • test/any/construct.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/alloc.hpp
  • test/any/construct.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c5e6bf0 and d1c3d36.

Files selected for processing (2)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • test/any/construct.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/alloc.hpp
  • test/any/construct.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d1c3d36 and e6006cb.

Files selected for processing (7)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • inc/mkn/gpu/cpu.hpp (9 hunks)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • test/any/construct.cpp (1 hunks)
  • test/any/managed.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (6)
  • inc/mkn/gpu/alloc.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/defines.hpp
  • inc/mkn/gpu/rocm.hpp
  • test/any/construct.cpp
  • test/any/managed.cpp
Additional comments not posted (6)
inc/mkn/gpu/cpu.hpp (6)

41-41: Include review: Approved

The new include for mkn/gpu/cli.hpp likely brings in CLI functionalities.


86-88: No-op function: Approved

The setLimitMallocHeapSize function is a no-op, likely for API consistency across platforms.


90-93: No support for cooperative launch: Approved

The supportsCooperativeLaunch function returns 0, indicating no support for cooperative launch. This is likely for API consistency.


155-155: Inlining function: Approved

Inlining the destroy function can improve performance by reducing function call overhead.


172-176: New function: Approved

The copy_on_device function performs a memcpy operation, copying data from src to dst on the CPU.


209-209: Inlining function: Approved

Inlining the sync function can improve performance by reducing function call overhead.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
test/any/add.cpp (1)

45-46: Remove or clarify commented-out assertion.

The commented-out assertion should either be removed or clarified with a comment explaining its purpose.

-  // assert(mkn::gpu::Pointer{a.data()}.is_device_ptr());
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e6006cb and e1fcf56.

Files selected for processing (6)
  • .github/workflows/build.yml (1 hunks)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • test/any/add.cpp (1 hunks)
  • test/any/construct.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • .github/workflows/build.yml
  • inc/mkn/gpu/alloc.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
  • test/any/construct.cpp
Additional comments not posted (4)
test/any/add.cpp (4)

33-35: LGTM! Assertion to check host pointer.

The added assertion is a good practice to verify memory types before operations.


38-41: LGTM! Conditional assertion to check device pointer.

The added conditional assertion is a good practice to verify memory types based on compile-time flags.


42-42: LGTM! Kernel function launcher call.

The launcher call correctly executes the vectoradd1 kernel function with appropriate arguments.


46-48: LGTM! Result verification loop.

The added for-loop correctly verifies the results of the kernel execution.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e1fcf56 and c2fda82.

Files selected for processing (1)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/alloc.hpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c2fda82 and f3d3566.

Files selected for processing (24)
  • .github/workflows/build.yml (1 hunks)
  • .sublime-project (1 hunks)
  • LICENSE.md (1 hunks)
  • inc/mkn/gpu.hpp (2 hunks)
  • inc/mkn/gpu/alloc.hpp (4 hunks)
  • inc/mkn/gpu/asio.hpp (1 hunks)
  • inc/mkn/gpu/cli.hpp (1 hunks)
  • inc/mkn/gpu/cpu.hpp (9 hunks)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/def.hpp (1 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/device.hpp (2 hunks)
  • inc/mkn/gpu/launchers.hpp (3 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • inc/mkn/gpu/tuple.hpp (1 hunks)
  • mkn.yaml (2 hunks)
  • res/mkn/clang_cuda.yaml (1 hunks)
  • res/mkn/hipcc.yaml (1 hunks)
  • test/any/add.cpp (1 hunks)
  • test/any/async_streaming.cpp (1 hunks)
  • test/any/construct.cpp (1 hunks)
  • test/any/coop.cpp (1 hunks)
  • test/any/managed.cpp (1 hunks)
Files skipped from review due to trivial changes (1)
  • res/mkn/clang_cuda.yaml
Files skipped from review as they are similar to previous changes (23)
  • .github/workflows/build.yml
  • .sublime-project
  • LICENSE.md
  • inc/mkn/gpu.hpp
  • inc/mkn/gpu/alloc.hpp
  • inc/mkn/gpu/asio.hpp
  • inc/mkn/gpu/cli.hpp
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/def.hpp
  • inc/mkn/gpu/defines.hpp
  • inc/mkn/gpu/device.hpp
  • inc/mkn/gpu/launchers.hpp
  • inc/mkn/gpu/multi_launch.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/tuple.hpp
  • mkn.yaml
  • res/mkn/hipcc.yaml
  • test/any/add.cpp
  • test/any/async_streaming.cpp
  • test/any/construct.cpp
  • test/any/coop.cpp
  • test/any/managed.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f3d3566 and c0fb261.

Files selected for processing (2)
  • inc/mkn/gpu/launchers.hpp (3 hunks)
  • test/any/managed.cpp (2 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/launchers.hpp
  • test/any/managed.cpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c0fb261 and 923b6e5.

Files selected for processing (2)
  • inc/mkn/gpu/launchers.hpp (3 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/launchers.hpp
  • inc/mkn/gpu/multi_launch.hpp

@PhilipDeegan PhilipDeegan changed the title updates for rocm 6.0.2 next Jul 25, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (3)
test/any/async_streaming.cpp (2)

45-45: Consider using a logging framework or conditional compilation for debugging output.

Printing the first element of each vector in vecs using std::cout is useful for debugging but might clutter the output in production or automated testing environments. Consider using a logging framework or conditional compilation to control the verbosity of output.

-  for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
+  // For debugging purposes, consider using a logging framework or conditional compilation

81-81: Consider using a logging framework or conditional compilation for debugging output.

Printing the first element of each vector in vecs using std::cout is useful for debugging but might clutter the output in production or automated testing environments. Consider using a logging framework or conditional compilation to control the verbosity of output.

-  for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
+  // For debugging purposes, consider using a logging framework or conditional compilation
inc/mkn/gpu/multi_launch.hpp (1)

50-58: Declare the destructor as default.

It's good practice to declare the destructor of a base class with virtual methods as default to ensure proper cleanup in derived classes.

-  virtual ~StreamFunction() {}
+  virtual ~StreamFunction() = default;
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 923b6e5 and 5831051.

Files selected for processing (2)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Additional context used
Learnings (1)
test/any/async_streaming.cpp (1)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#19
File: test/any/async_streaming.cpp:2-6
Timestamp: 2024-05-18T14:17:54.861Z
Learning: @PhilipDeegan, thank you for sharing your preferred style for organizing includes. I'll remember this for future reviews in your codebase.

<!--

PhilipDeegan prefers to sort `#include` statements by line length first, and alphabetically second.
Additional comments not posted (5)
test/any/async_streaming.cpp (2)

2-6: Includes are well-organized.

The includes are sorted by line length first, and alphabetically second, as per your preference.


93-97: LGTM!

The function main is straightforward and does not contain any issues.

inc/mkn/gpu/multi_launch.hpp (3)

60-75: LGTM!

The structure StreamDeviceFunction is well-implemented and does not contain any issues.


77-84: LGTM!

The structure StreamHostFunction is well-implemented and does not contain any issues.


183-301: LGTM!

The structure ThreadedStreamLauncher is well-implemented and does not contain any issues.

Comment on lines 23 to 55
std::uint32_t test() {
using namespace mkn::gpu;
using T = double;

KUL_DBG_FUNC_ENTER;

std::vector<ManagedVector<T>> vecs(C, ManagedVector<T>(NUM, 0));
for (std::size_t i = 0; i < vecs.size(); ++i) std::fill_n(vecs[i].data(), NUM, i);

ManagedVector<T*> datas(C);
for (std::size_t i = 0; i < vecs.size(); ++i) datas[i] = vecs[i].data();
auto views = datas.data();

StreamLauncher{vecs}
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 1; })
.host([&](auto i) mutable {
std::this_thread::sleep_for(200ms);
for (auto& e : vecs[i]) e += 1;
})
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 3; }) //
();

for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;

std::size_t val = 5;
for (auto const& vec : vecs) {
for (auto const& e : vec)
if (e != val) return 1;
++val;
};

return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assertions to verify correctness.

To ensure the test validates the expected behavior, consider adding assertions after the streaming operations to check the state of vecs.

+  for (std::size_t i = 0; i < vecs.size(); ++i) {
+    assert(vecs[i][0] == expected_value); // Replace `expected_value` with the actual expected value after operations
+  }

Committable suggestion was skipped due to low confidence.

Comment on lines 57 to 91
std::uint32_t test_threaded() {
using namespace mkn::gpu;
using T = double;

KUL_DBG_FUNC_ENTER;

std::vector<ManagedVector<T>> vecs(C, ManagedVector<T>(NUM, 0));
for (std::size_t i = 0; i < vecs.size(); ++i) std::fill_n(vecs[i].data(), NUM, i);

ManagedVector<T*> datas(C);
for (std::size_t i = 0; i < vecs.size(); ++i) datas[i] = vecs[i].data();
auto views = datas.data();

using namespace std::chrono_literals;

ThreadedStreamLauncher{vecs, 6}
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 1; })
.host([&](auto i) mutable {
std::this_thread::sleep_for(200ms);
for (auto& e : vecs[i]) e += 1;
})
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 3; }) //
();

for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;

std::size_t val = 5;
for (auto const& vec : vecs) {
for (auto const& e : vec)
if (e != val) return 1;
++val;
};

return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add assertions to verify correctness.

To ensure the test validates the expected behavior, consider adding assertions after the streaming operations to check the state of vecs.

+  for (std::size_t i = 0; i < vecs.size(); ++i) {
+    assert(vecs[i][0] == expected_value); // Replace `expected_value` with the actual expected value after operations
+  }
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::uint32_t test_threaded() {
using namespace mkn::gpu;
using T = double;
KUL_DBG_FUNC_ENTER;
std::vector<ManagedVector<T>> vecs(C, ManagedVector<T>(NUM, 0));
for (std::size_t i = 0; i < vecs.size(); ++i) std::fill_n(vecs[i].data(), NUM, i);
ManagedVector<T*> datas(C);
for (std::size_t i = 0; i < vecs.size(); ++i) datas[i] = vecs[i].data();
auto views = datas.data();
using namespace std::chrono_literals;
ThreadedStreamLauncher{vecs, 6}
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 1; })
.host([&](auto i) mutable {
std::this_thread::sleep_for(200ms);
for (auto& e : vecs[i]) e += 1;
})
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 3; }) //
();
for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
std::size_t val = 5;
for (auto const& vec : vecs) {
for (auto const& e : vec)
if (e != val) return 1;
++val;
};
return 0;
}
std::uint32_t test_threaded() {
using namespace mkn::gpu;
using T = double;
KUL_DBG_FUNC_ENTER;
std::vector<ManagedVector<T>> vecs(C, ManagedVector<T>(NUM, 0));
for (std::size_t i = 0; i < vecs.size(); ++i) std::fill_n(vecs[i].data(), NUM, i);
ManagedVector<T*> datas(C);
for (std::size_t i = 0; i < vecs.size(); ++i) datas[i] = vecs[i].data();
auto views = datas.data();
using namespace std::chrono_literals;
ThreadedStreamLauncher{vecs, 6}
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 1; })
.host([&](auto i) mutable {
std::this_thread::sleep_for(200ms);
for (auto& e : vecs[i]) e += 1;
})
.dev([=] __device__(auto i) { views[i][mkn::gpu::idx()] += 3; }) //
();
for (auto const& vec : vecs) std::cout << __LINE__ << " " << vec[0] << std::endl;
std::size_t val = 5;
for (auto const& vec : vecs) {
for (auto const& e : vec)
if (e != val) return 1;
++val;
};
for (std::size_t i = 0; i < vecs.size(); ++i) {
assert(vecs[i][0] == expected_value); // Replace `expected_value` with the actual expected value after operations
}
return 0;
}

Comment on lines 86 to 165
template <typename Datas, typename Self_ = void>
struct StreamLauncher {
using This = StreamLauncher<Datas, Self_>;
using Self = std::conditional_t<std::is_same_v<Self_, void>, This, Self_>;

StreamLauncher(Datas& datas_) : datas{datas_}, streams(datas.size()), data_step(datas.size(), 0) {
for (auto& s : streams) events.emplace_back(s);
}

~StreamLauncher() { sync(); }

void sync() noexcept {
for (auto& s : streams) s.sync();
}

template <typename Fn>
Self& dev(Fn&& fn) {
fns.emplace_back(std::make_shared<StreamDeviceFunction<Self, Fn>>(self, std::forward<Fn>(fn)));
return self;
}
template <typename Fn>
Self& host(Fn&& fn) {
fns.emplace_back(std::make_shared<StreamHostFunction<Self, Fn>>(self, std::forward<Fn>(fn)));
return self;
}

void operator()() {
using namespace std::chrono_literals;

if (fns.size() == 0) return;

for (std::size_t i = 0; i < datas.size(); ++i) self(i);

do {
for (std::size_t i = 0; i < datas.size(); ++i) {
if (is_finished(i)) continue;
if (is_fn_finished(i)) {
data_step[i] += 1;
if (not is_finished(i)) self(i);
}
}
std::this_thread::sleep_for(1ms); // make sleep time configurable
} while (!is_finished());
}

void operator()(std::uint32_t const i) {
auto const& step = data_step[i];
assert(step < fns.size());
fns[step]->run(i);
if (fns[step]->mode == StreamFunctionMode::DEVICE_WAIT) events[i].record();
}

bool is_finished() const {
std::uint32_t finished = 0;
for (std::size_t i = 0; i < datas.size(); ++i)
if (is_finished(i)) ++finished;
return finished == datas.size();
}

bool is_finished(std::uint32_t idx) const { return data_step[idx] == fns.size(); }

bool is_fn_finished(std::uint32_t i) {
auto const b = [&]() {
auto const& step = data_step[i];
if (fns[step]->mode == StreamFunctionMode::HOST_WAIT) return true;
return events[i].finished();
}();
if (b) {
events[i].reset();
}
return b;
}

Datas& datas;
std::vector<std::shared_ptr<StreamFunction<Self>>> fns;
std::vector<mkn::gpu::Stream> streams;
std::vector<mkn::gpu::StreamEvent> events;
std::vector<std::uint16_t> data_step;
Self& self = *reinterpret_cast<Self*>(this);
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sleep duration configurable.

The operator() method has a hardcoded sleep duration. Consider making it configurable to allow flexibility.

-      std::this_thread::sleep_for(1ms);  // make sleep time configurable
+      std::this_thread::sleep_for(sleep_duration);  // make sleep time configurable
std::chrono::milliseconds sleep_duration{1};

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5831051 and f1871c4.

Files selected for processing (4)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • res/mkn/clang_cuda.yaml (2 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/multi_launch.hpp
  • test/any/async_streaming.cpp
Additional comments not posted (13)
res/mkn/clang_cuda.yaml (2)

8-12: Addition of troubleshooting information is helpful.

The added troubleshooting information for the "GPUassert: unknown error" is relevant and useful for users.


26-27: Modification of env property is appropriate.

Including the CUDA binaries directory in the PATH ensures that CUDA tools are accessible, facilitating smoother compilation and execution of CUDA applications.

inc/mkn/gpu/cuda.hpp (11)

42-44: Inclusion of necessary headers is appropriate.

The included headers (cuda_runtime.h and cooperative_groups.h) are necessary for the functionalities implemented in the file.


Line range hint 48-54: Implementation of gpuAssert function is correct.

The gpuAssert function is correctly implemented to handle CUDA errors and is defined as inline for potential performance benefits.


60-68: Implementation of idx device function is correct.

The idx function is correctly implemented to calculate a unique index for each thread in a CUDA kernel.


70-73: Implementation of block_idx_x device function is correct.

The block_idx_x function is correctly implemented to return the block index in the x-dimension in CUDA kernels.


87-89: Implementation of setLimitMallocHeapSize function is correct.

The setLimitMallocHeapSize function is correctly implemented to set the CUDA device's heap size limit.


91-95: Implementation of supportsCooperativeLaunch function is correct.

The supportsCooperativeLaunch function is correctly implemented to check if the device supports cooperative kernel launches.


109-151: Implementation of StreamEvent struct is correct.

The StreamEvent struct is correctly implemented to manage CUDA events, ensuring proper resource management and providing methods for recording events and checking their completion status.


191-193: Implementation of destroy function is correct.

The destroy function is correctly implemented to free CUDA memory and is defined as inline for potential performance benefits.


208-212: Implementation of copy_on_device function is correct.

The copy_on_device function is correctly implemented to copy memory from one device location to another.


253-254: Implementation of sync function is correct.

The sync function is correctly implemented to synchronize the CUDA device and is defined as inline for potential performance benefits.


259-280: Implementation of launch function is correct.

The launch function is correctly implemented to support cooperative kernel launches and conditional synchronization, providing more flexible execution configurations for CUDA kernels.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (13)
inc/mkn/gpu/cpu.hpp (5)

86-87: Consider adding documentation for setLimitMallocHeapSize.

The function is a no-op in the CPU context. Adding a comment or documentation explaining this behavior would help future maintainers.

// No-op for CPU context
void setLimitMallocHeapSize(std::size_t const& /*bytes*/) {}

88-89: Consider adding documentation for setDevice.

The function is a no-op in the CPU context. Adding a comment or documentation explaining this behavior would help future maintainers.

// No-op for CPU context
void setDevice(std::size_t const& /*dev*/) {}

90-93: Consider adding documentation for supportsCooperativeLaunch.

The function always returns 0 in the CPU context. Adding a comment or documentation explaining this behavior would help future maintainers.

// Always returns 0 for CPU context
auto supportsCooperativeLaunch(int const /*dev*/ = 0) {
  int supportsCoopLaunch = 0;
  return supportsCoopLaunch;
}

106-117: Clarify the purpose of StreamEvent.

The StreamEvent struct is introduced in the CPU context with no-op implementations. Adding documentation to explain its intended use and the rationale behind the no-op implementations would be beneficial.

struct StreamEvent {
  // Constructor
  StreamEvent(Stream&) {}
  // Destructor
  ~StreamEvent() {}

  // Operator overload
  auto& operator()() { return event; }
  // Record event (no-op)
  void record() { ; }
  // Check if event is finished (always true)
  bool finished() const { return true; }
  // Reset event (no-op)
  void reset() {}

  // Member variables
  Stream stream;
  std::size_t event = 0;
};

Line range hint 218-265:
Document the implications of _sync and _coop in launch function.

The launch function now includes _sync and _coop template parameters. Ensure that the implications of these changes are well-documented, especially regarding how they affect kernel execution and synchronization.

template <bool _sync = true, bool _coop = false, typename F, typename... Args>
void launch(F f, dim3 g, dim3 b, std::size_t /*ds*/, std::size_t /*stream*/, Args&&... args) {
  std::size_t N = (g.x * g.y * g.z) * (b.x * b.y * b.z);
  KLOG(TRC) << N;
  std::apply(
      [&](auto&&... params) {
        for (std::size_t i = 0; i < N; ++i) {
          f(params...);
          detail::idx++;
        }
      },
      devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<sizeof...(Args)>()));

  detail::idx = 0;
}
inc/mkn/gpu/rocm.hpp (3)

244-265: Document the implications of _sync and _coop in launch function.

The launch function now includes _sync and _coop template parameters. Ensure that the implications of these changes are well-documented, especially regarding how they affect kernel execution and synchronization.

template <bool _sync = true, bool _coop = false, typename F, typename... Args>
void launch(F&& f, dim3 g, dim3 b, std::size_t ds, hipStream_t& s, Args&&... args) {
  std::size_t N = (g.x * g.y * g.z) * (b.x * b.y * b.z);
  KLOG(TRC) << N;
  std::apply(
      [&](auto&&... params) {
        if constexpr (_coop) {
          auto address_of = [](auto& a) { return (void*)&a; };
          void* kernelArgs[] = {(address_of(params), ...)};
          hipLaunchCooperativeKernel<F>(f, g, b, kernelArgs, ds, s);
        } else {
          hipLaunchKernelGGL(f, g, b, ds, s, params...);
        }
        MKN_GPU_ASSERT(hipGetLastError());
      },
      devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<sizeof...(Args)>()));
  if constexpr (_sync) {
    if (s)
      sync(s);
    else
      sync();
  }
}

88-93: Ensure supportsCooperativeLaunch function is tested.

The supportsCooperativeLaunch function is not currently tested in the codebase. The only instance found is commented out in test/any/coop.cpp. Please add tests to ensure this function is thoroughly tested.

  • test/any/coop.cpp: Uncomment and implement tests for supportsCooperativeLaunch.
Analysis chain

Ensure supportsCooperativeLaunch function is tested.

The function checks if the selected device supports cooperative launches. Ensure that this function is thoroughly tested.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `supportsCooperativeLaunch` function to ensure it is tested.

# Test: Search for the `supportsCooperativeLaunch` function usage. Expect: Verify if the function is tested.
rg --type cpp -A 5 $'supportsCooperativeLaunch'

Length of output: 1226


Script:

#!/bin/bash
# Description: Verify if the `supportsCooperativeLaunch` function is tested in the codebase.

# Search for test files that might contain tests for `supportsCooperativeLaunch`
rg --type cpp -l 'supportsCooperativeLaunch' | grep -E 'test|spec|unittest|check'

# Search for test functions that might be testing `supportsCooperativeLaunch`
rg --type cpp -A 5 'TEST|TEST_F|TEST_CASE|TEST_SUITE' | grep -A 5 'supportsCooperativeLaunch'

Length of output: 193


86-86: Ensure setDevice function is tested.

The setDevice function is not currently tested. Please add tests to ensure this function works as expected.

  • The function is defined in the following files but is not referenced in any test files:
    • inc/mkn/gpu/rocm.hpp
    • inc/mkn/gpu/cuda.hpp
    • inc/mkn/gpu/cpu.hpp
Analysis chain

Ensure setDevice function is tested.

The function sets the active GPU device. Ensure that this function is thoroughly tested.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `setDevice` function to ensure it is tested.

# Test: Search for the `setDevice` function usage. Expect: Verify if the function is tested.
rg --type cpp -A 5 $'setDevice'

Length of output: 1142


Script:

#!/bin/bash
# Description: Verify if the `setDevice` function is tested by searching for its usage in common test directories.

# Search for the `setDevice` function usage in test files.
rg 'setDevice' -g '*test*' -g '*Test*' -g '*tests*' -g '*Tests*'

Length of output: 213


Script:

#!/bin/bash
# Description: Diagnose why ripgrep did not search any files by using the --debug flag.

# Search for the `setDevice` function usage in test files with debug information.
rg 'setDevice' -g '*test*' -g '*Test*' -g '*tests*' -g '*Tests*' --debug

Length of output: 11076


Script:

#!/bin/bash
# Description: Search for the `setDevice` function usage in all files, including hidden ones, and explicitly include common test file extensions and directories.

# Search for the `setDevice` function usage in all files, including hidden ones.
rg 'setDevice' --hidden --glob '*test*' --glob '*Test*' --glob '*tests*' --glob '*Tests*' --glob '*.cpp' --glob '*.hpp'

Length of output: 393

inc/mkn/gpu/cuda.hpp (5)

127-128: Delete copy constructor and assignment operator for StreamEvent.

The copy constructor and assignment operator are deleted to prevent copying of the StreamEvent object. Ensure that this behavior is well-documented.

StreamEvent(StreamEvent const&) = delete;
StreamEvent& operator=(StreamEvent const&) = delete;

261-282: Document the implications of _sync and _coop in launch function.

The launch function now includes _sync and _coop template parameters. Ensure that the implications of these changes are well-documented, especially regarding how they affect kernel execution and synchronization.

template <bool _sync = true, bool _coop = false, typename F, typename... Args>
void launch(F&& f, dim3 g, dim3 b, std::size_t ds, cudaStream_t& s, Args&&... args) {
  std::size_t N = (g.x * g.y * g.z) * (b.x * b.y * b.z);
  KLOG(TRC) << N;
  std::apply(
      [&](auto&&... params) {
        if constexpr (_coop) {
          auto address_of = [](auto& a) { return (void*)&a; };
          void* kernelArgs[] = {(address_of(params), ...)};
          cudaLaunchCooperativeKernel((void*)f, g, b, kernelArgs, ds);
        } else {
          f<<<g, b, ds, s>>>(params...);
        }
        MKN_GPU_ASSERT(cudaGetLastError());
      },
      devmem_replace(std::forward_as_tuple(args...), std::make_index_sequence<sizeof...(Args)>

---

`93-97`: **Ensure `supportsCooperativeLaunch` function is tested.**

The `supportsCooperativeLaunch` function is not currently tested, as the only reference found in the test files is commented out.

- `test/any/coop.cpp`: Commented-out reference to `mkn::gpu::supportsCooperativeLaunch()`

Please add tests to ensure this function is thoroughly tested.



<details>
<summary>Analysis chain</summary>

**Ensure `supportsCooperativeLaunch` function is tested.**

The function checks if the selected device supports cooperative launches. Ensure that this function is thoroughly tested.

</details>

<details>
<summary>Scripts executed</summary>

The following scripts were executed for the analysis:

Script:

```shell
#!/bin/bash
# Description: Verify the usage of the `supportsCooperativeLaunch` function to ensure it is tested.

# Test: Search for the `supportsCooperativeLaunch` function usage. Expect: Verify if the function is tested.
rg --type cpp -A 5 $'supportsCooperativeLaunch'

Length of output: 1226


Script:

#!/bin/bash
# Description: Verify if there are any active tests for the `supportsCooperativeLaunch` function.

# Search for the `supportsCooperativeLaunch` function references in test files to see if there are any active tests.
rg --type cpp --glob 'test/**' 'supportsCooperativeLaunch'

Length of output: 165


91-91: Ensure setDevice function is tested.

The setDevice function is not currently tested. Please add test cases to ensure that this function works correctly.

  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/rocm.hpp
  • inc/mkn/gpu/cpu.hpp
Analysis chain

Ensure setDevice function is tested.

The function sets the active GPU device. Ensure that this function is thoroughly tested.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `setDevice` function to ensure it is tested.

# Test: Search for the `setDevice` function usage. Expect: Verify if the function is tested.
rg --type cpp -A 5 $'setDevice'

Length of output: 1142


Script:

#!/bin/bash
# Description: Verify if the `setDevice` function is tested by searching for its usage in potential test directories.

# Search for the `setDevice` function usage in potential test directories.
rg --type cpp -A 5 'setDevice' test tests

Length of output: 87


Script:

#!/bin/bash
# Description: Verify if the `setDevice` function is tested by searching for its usage in the entire codebase.

# Search for the `setDevice` function usage in the entire codebase.
rg --type cpp -A 5 'setDevice'

Length of output: 1142


87-89: Ensure setLimitMallocHeapSize function is tested.

The function setLimitMallocHeapSize is not explicitly tested in the current codebase. Please add tests to ensure its functionality.

  • No instances of setLimitMallocHeapSize were found in the test directories.
Analysis chain

Ensure setLimitMallocHeapSize function is tested.

The function sets the limit on the heap size for dynamic memory allocation. Ensure that this function is thoroughly tested.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `setLimitMallocHeapSize` function to ensure it is tested.

# Test: Search for the `setLimitMallocHeapSize` function usage. Expect: Verify if the function is tested.
rg --type cpp -A 5 $'setLimitMallocHeapSize'

Length of output: 1049


Script:

#!/bin/bash
# Description: Verify if the `setLimitMallocHeapSize` function is tested.

# Search for the `setLimitMallocHeapSize` function usage in typical test directories.
rg --type cpp -A 5 'setLimitMallocHeapSize' tests/ test/

Length of output: 103


Script:

#!/bin/bash
# Description: Identify potential test directories and files in the repository.

# List all directories and files in the repository to identify potential test locations.
fd --type d
fd --type f

Length of output: 853


Script:

#!/bin/bash
# Description: Verify if the `setLimitMallocHeapSize` function is tested.

# Search for the `setLimitMallocHeapSize` function usage in identified test directories.
rg --type cpp -A 5 'setLimitMallocHeapSize' test/any/ test/cpu/ test/cuda/ test/hip/

Length of output: 84

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f1871c4 and 329594d.

Files selected for processing (5)
  • inc/mkn/gpu/cpu.hpp (9 hunks)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/multi_launch.hpp
  • test/any/async_streaming.cpp
Additional comments not posted (10)
inc/mkn/gpu/cpu.hpp (3)

155-157: Inline destroy function.

The destroy function is now inline, which may optimize its usage. Ensure that this change is necessary and beneficial for performance.


209-209: Inline sync function.

The sync function is now inline, which may optimize its usage. Ensure that this change is necessary and beneficial for performance.

Verification successful

Inline sync function.

The sync function in inc/mkn/gpu/cpu.hpp is an empty function. Inlining it is unlikely to have a significant impact on performance due to its simplicity. The change is reasonable and should not introduce any unintended side effects.

  • File: inc/mkn/gpu/cpu.hpp
  • Line: 209
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `sync` function to ensure inlining is beneficial.

# Test: Search for the `sync` function usage. Expect: Verify if inlining is beneficial.
rg --type cpp -A 5 $'sync'

Length of output: 21438


Script:

#!/bin/bash
# Description: Find all instances where the `sync` function from `inc/mkn/gpu/cpu.hpp` is called.

# Search for calls to `sync` in the context of `inc/mkn/gpu/cpu.hpp`
rg --type cpp 'sync\(\)' -A 5

Length of output: 4431


172-176: Ensure copy_on_device function is tested.

The copy_on_device function is added to facilitate copying data to the device. Ensure that this function is thoroughly tested.

inc/mkn/gpu/rocm.hpp (5)

238-239: Inline sync function.

The sync function is now inline, which may optimize its usage. Ensure that this change is necessary and beneficial for performance.


134-145: Ensure Pointer struct is tested.

The Pointer struct now includes additional logic to determine the type of memory pointer. Ensure that this struct is thoroughly tested.

Verification successful

The Pointer struct is tested.

The Pointer struct is used in assertions within the test files test/any/managed.cpp and test/any/add.cpp, indicating that it is being tested.

  • test/any/managed.cpp: assert(mkn::gpu::Pointer{mem.data()}.is_managed_ptr());
  • test/any/add.cpp: assert(mkn::gpu::Pointer{b.data()}.is_host_ptr());
  • test/any/add.cpp: assert(mkn::gpu::Pointer{devA.p}.is_device_ptr());
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `Pointer` struct to ensure it is tested.

# Test: Search for the `Pointer` struct usage. Expect: Verify if the struct is tested.
rg --type cpp -A 5 $'Pointer'

Length of output: 3493


176-176: Inline destroy function.

The destroy function is now inline, which may optimize its usage. Ensure that this change is necessary and beneficial for performance.

Verification successful

Inlining of the destroy function may not be necessary.

The destroy function is primarily used in cleanup processes, such as destructors and deallocation functions, which are not typically performance-critical sections. Therefore, inlining this function may not provide significant performance benefits.

  • inc/mkn/gpu/device.hpp: Used in destructors and deallocation functions.
  • inc/mkn/gpu/alloc.hpp: Used in deallocation functions.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `destroy` function to ensure inlining is beneficial.

# Test: Search for the `destroy` function usage. Expect: Verify if inlining is beneficial.
rg --type cpp -A 5 $'destroy'

Length of output: 3666


Script:

#!/bin/bash
# Description: Verify the context of the `destroy` function usage to determine if inlining is beneficial.

# Test: Search for the `destroy` function usage within loops or performance-critical sections.
rg --type cpp -A 10 -B 10 $'destroy('

Length of output: 105


Script:

#!/bin/bash
# Description: Verify the context of the `destroy` function usage to determine if inlining is beneficial.

# Test: Search for the `destroy` function usage within loops or performance-critical sections.
rg --type cpp -A 10 -B 10 'destroy\('

Length of output: 7652


193-197: Ensure copy_on_device function is tested.

The copy_on_device function is added to facilitate copying data on the device. Ensure that this function is thoroughly tested.


82-84: Ensure setLimitMallocHeapSize function is tested.

The function sets the limit on the heap size for dynamic memory allocation. Ensure that this function is thoroughly tested.

inc/mkn/gpu/cuda.hpp (2)

255-256: Inline sync function.

The sync function is now inline, which may optimize its usage. Ensure that this change is necessary and beneficial for performance.


210-214: Ensure copy_on_device function is tested.

The copy_on_device function is added to facilitate copying data on the device. Ensure that this function is thoroughly tested.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 329594d and 8933c64.

Files selected for processing (5)
  • inc/mkn/gpu/cpu.hpp (9 hunks)
  • inc/mkn/gpu/cuda.hpp (12 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/multi_launch.hpp
  • test/any/async_streaming.cpp
Additional comments not posted (20)
inc/mkn/gpu/rocm.hpp (9)

45-53: LGTM!

The gpuAssert function has been made inline to potentially optimize its usage.


82-84: LGTM!

The setLimitMallocHeapSize function correctly sets the heap size limit for dynamic memory allocation using the hipDeviceSetLimit API.


86-86: LGTM!

The setDevice function correctly sets the active GPU device using the hipSetDevice API.


88-93: LGTM!

The supportsCooperativeLaunch function correctly checks if the selected device supports cooperative launches using the hipDeviceGetAttribute API.


Line range hint 94-106:
LGTM!

The Stream structure correctly manages HIP streams, including creation, destruction, and synchronization, ensuring proper resource management.


193-197: LGTM!

The copy_on_device function correctly performs device-to-device memory copying using the hipMemcpy API.


238-239: LGTM!

The sync function correctly synchronizes the device or a specific stream using the hipDeviceSynchronize and hipStreamSynchronize APIs.


244-265: LGTM! But document the implications of the changes.

The launch function has been modified to support cooperative kernel launches and conditional synchronization. Ensure that the implications of these changes are well-documented.


336-341: LGTM!

The grid_sync function correctly synchronizes the entire grid using the cooperative groups API.

inc/mkn/gpu/cuda.hpp (11)

Line range hint 43-54:
LGTM!

The gpuAssert function has been made inline to potentially optimize its usage.


60-68: LGTM!

The idx function correctly calculates the thread index in a CUDA kernel, simplifying the process of determining a unique index for each thread.


70-73: LGTM!

The block_idx_x function correctly returns the block index in the x-dimension, useful for grid and block indexing.


87-89: LGTM!

The setLimitMallocHeapSize function correctly sets the heap size limit for dynamic memory allocation using the cudaDeviceSetLimit API.


91-91: LGTM!

The setDevice function correctly sets the active GPU device using the cudaSetDevice API.


93-97: LGTM!

The supportsCooperativeLaunch function correctly checks if the selected device supports cooperative launches using the cudaDeviceGetAttribute API.


Line range hint 98-110:
LGTM!

The Stream structure correctly manages CUDA streams, including creation, destruction, and synchronization, ensuring proper resource management.


210-214: LGTM!

The copy_on_device function correctly performs device-to-device memory copying using the cudaMemcpy API.


255-256: LGTM!

The sync function correctly synchronizes the device or a specific stream using the cudaDeviceSynchronize and cudaStreamSynchronize APIs.


261-282: LGTM! But document the implications of the changes.

The launch function has been modified to support cooperative kernel launches and conditional synchronization. Ensure that the implications of these changes are well-documented.


359-364: LGTM!

The grid_sync function correctly synchronizes the entire grid using the cooperative groups API.

Comment on lines 107 to 123
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() { /*MKN_GPU_ASSERT(result = hipEventDestroy(event));*/
}

auto& operator()() { return event; };
void record() { MKN_GPU_ASSERT(result = hipEventRecord(event, stream())); }
bool finished() const { return hipEventQuery(event) == hipSuccess; }
void reset() {
if (event) MKN_GPU_ASSERT(result = hipEventDestroy(event));
MKN_GPU_ASSERT(result = hipEventCreate(&event));
}

Stream& stream;
hipError_t result;
hipEvent_t event = nullptr;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable the destructor code for StreamEvent.

The commented-out destructor code should be enabled to ensure proper resource management and avoid resource leaks.

-  /*MKN_GPU_ASSERT(result = hipEventDestroy(event));*/
+  MKN_GPU_ASSERT(result = hipEventDestroy(event));
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() { /*MKN_GPU_ASSERT(result = hipEventDestroy(event));*/
}
auto& operator()() { return event; };
void record() { MKN_GPU_ASSERT(result = hipEventRecord(event, stream())); }
bool finished() const { return hipEventQuery(event) == hipSuccess; }
void reset() {
if (event) MKN_GPU_ASSERT(result = hipEventDestroy(event));
MKN_GPU_ASSERT(result = hipEventCreate(&event));
}
Stream& stream;
hipError_t result;
hipEvent_t event = nullptr;
};
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() { MKN_GPU_ASSERT(result = hipEventDestroy(event));
}
auto& operator()() { return event; };
void record() { MKN_GPU_ASSERT(result = hipEventRecord(event, stream())); }
bool finished() const { return hipEventQuery(event) == hipSuccess; }
void reset() {
if (event) MKN_GPU_ASSERT(result = hipEventDestroy(event));
MKN_GPU_ASSERT(result = hipEventCreate(&event));
}
Stream& stream;
hipError_t result;
hipEvent_t event = nullptr;
};

Comment on lines 111 to 120
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() {
if (start) {
MKN_GPU_ASSERT(result = cudaEventDestroy(start))
}
if (stop) {
MKN_GPU_ASSERT(result = cudaEventDestroy(stop))
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the destructor code for StreamEvent.

The destructor code has a syntax error and should be corrected to ensure proper resource management.

-    MKN_GPU_ASSERT(result = cudaEventDestroy(start))
+    MKN_GPU_ASSERT(result = cudaEventDestroy(start));
-    MKN_GPU_ASSERT(result = cudaEventDestroy(stop))
+    MKN_GPU_ASSERT(result = cudaEventDestroy(stop));
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() {
if (start) {
MKN_GPU_ASSERT(result = cudaEventDestroy(start))
}
if (stop) {
MKN_GPU_ASSERT(result = cudaEventDestroy(stop))
}
}
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() {
if (start) {
MKN_GPU_ASSERT(result = cudaEventDestroy(start));
}
if (stop) {
MKN_GPU_ASSERT(result = cudaEventDestroy(stop));
}
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (6)
inc/mkn/gpu/rocm.hpp (6)

88-93: Ensure testing of supportsCooperativeLaunch function.

The function supportsCooperativeLaunch appears to lack active test cases. The only reference found is commented out in test/any/coop.cpp. Please ensure that this function is properly tested.

  • test/any/coop.cpp: Commented-out reference to supportsCooperativeLaunch.
Analysis chain

Ensure testing of supportsCooperativeLaunch function.

Ensure that the supportsCooperativeLaunch function is tested to verify its correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `supportsCooperativeLaunch` function.

# Test: Search for test cases involving the `supportsCooperativeLaunch` function. Expect: Test cases should exist.
rg --type cpp -A 5 $'supportsCooperativeLaunch'

Length of output: 1226


40-45: Review the necessity of hip/hip_cooperative_groups.h include

The include for hip/hip_runtime.h is necessary as its constructs are used in the file. However, there is no evidence of usage for hip/hip_cooperative_groups.h. Please review and remove the unnecessary include if it is not required.

  • hip/hip_runtime.h: Necessary
  • hip/hip_cooperative_groups.h: Unnecessary
Analysis chain

Verify the necessity of new includes.

Ensure that the new includes for hip/hip_runtime.h and hip/hip_cooperative_groups.h are necessary for the added functionality.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the necessity of the new includes.

# Test: Search for the usage of the included headers. Expect: The headers should be used in the file.
rg --type cpp -A 5 $'#include "hip/hip_runtime.h"'
rg --type cpp -A 5 $'#include "hip/hip_cooperative_groups.h"'

Length of output: 996


Script:

#!/bin/bash
# Description: Verify the usage of specific constructs from the included headers.

# Test: Search for specific constructs from hip/hip_runtime.h and hip/hip_cooperative_groups.h in the file inc/mkn/gpu/rocm.hpp.

# Searching for constructs from hip/hip_runtime.h
rg --type cpp 'hipLaunchKernelGGL|hipMalloc|hipFree|hipMemcpy|hipStreamCreate|hipStreamDestroy' inc/mkn/gpu/rocm.hpp

# Searching for constructs from hip/hip_cooperative_groups.h
rg --type cpp 'cooperative_groups::' inc/mkn/gpu/rocm.hpp

Length of output: 1154


278-299: Ensure testing of cooperative kernel launch enhancements.

The enhancements to the launch function, specifically the support for cooperative kernel launches, are not currently covered by test cases. Please add tests to verify this functionality.

  • File: inc/mkn/gpu/rocm.hpp
  • Lines: 278-299
Analysis chain

Ensure testing of launch function enhancements.

Ensure that the enhancements to the launch function are tested to verify their correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `launch` function enhancements.

# Test: Search for test cases involving the `launch` function enhancements. Expect: Test cases should exist.
rg --type cpp -A 5 $'launch'

Length of output: 8042


Script:

#!/bin/bash
# Description: Verify the testing of the cooperative kernel launch enhancements in the `launch` function.

# Test: Search for test cases involving cooperative kernel launches. Expect: Test cases should exist.
rg --type cpp -A 5 'hipLaunchCooperativeKernel'

Length of output: 394


227-231: Ensure testing of copy_on_device function.

No test cases were found for the copy_on_device function. Please ensure that this function is explicitly tested to verify its correctness.

Analysis chain

Ensure testing of copy_on_device function.

Ensure that the copy_on_device function is tested to verify its correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `copy_on_device` function.

# Test: Search for test cases involving the `copy_on_device` function. Expect: Test cases should exist.
rg --type cpp -A 5 $'copy_on_device'

Length of output: 1212


Script:

#!/bin/bash
# Description: Verify the testing of the `copy_on_device` function by checking for tests involving `alloc.hpp` or functions that call `copy_on_device`.

# Test: Search for test cases involving the `alloc.hpp` file or functions that use `copy_on_device`.
rg --type cpp -A 5 'alloc.hpp'

Length of output: 1093


Script:

#!/bin/bash
# Description: Verify the testing of the `copy_on_device` function by checking for integration tests involving `alloc.hpp` or `copy_on_device`.

# Test: Search for test files or directories that might contain integration tests involving `alloc.hpp` or `copy_on_device`.
rg --type cpp -l 'alloc.hpp' | xargs -I {} rg --type cpp -l 'copy_on_device' {}

Length of output: 143


82-84: Ensure testing of setLimitMallocHeapSize function.

The setLimitMallocHeapSize function is not currently tested. Please add test cases to verify its correctness.

  • File: inc/mkn/gpu/rocm.hpp
  • Lines: 82-84
Analysis chain

Ensure testing of setLimitMallocHeapSize function.

Ensure that the setLimitMallocHeapSize function is tested to verify its correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `setLimitMallocHeapSize` function.

# Test: Search for test cases involving the `setLimitMallocHeapSize` function. Expect: Test cases should exist.
rg --type cpp -A 5 $'setLimitMallocHeapSize'

Length of output: 1049


Script:

#!/bin/bash
# Description: Search for potential test files or directories that might contain tests for the `setLimitMallocHeapSize` function.

# Find all files and directories with names indicating they might contain tests.
fd test

# Search for references to `setLimitMallocHeapSize` in likely test files.
rg --type cpp -A 5 $'setLimitMallocHeapSize' tests/

Length of output: 113


Script:

#!/bin/bash
# Description: Search for C++ test files within the `test/` directory and check for references to `setLimitMallocHeapSize`.

# Find all C++ test files in the `test/` directory.
fd . --type f --extension cpp test/

# Search for references to `setLimitMallocHeapSize` in the found C++ test files.
rg --type cpp -A 5 $'setLimitMallocHeapSize' test/

Length of output: 428


86-86: Ensure testing of setDevice function.

No direct test cases for the setDevice function were found in the repository. It is important to add tests to verify the correctness of this function.

  • File: inc/mkn/gpu/rocm.hpp
  • Function: setDevice
Analysis chain

Ensure testing of setDevice function.

Ensure that the setDevice function is tested to verify its correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `setDevice` function.

# Test: Search for test cases involving the `setDevice` function. Expect: Test cases should exist.
rg --type cpp -A 5 $'setDevice'

Length of output: 1481


Script:

#!/bin/bash
# Description: Verify the testing of the `setDevice` function.

# Search for test files or directories
fd test

# Search for test cases involving the `setDevice` function in test files
rg --type cpp -A 5 $'setDevice' $(fd test)

Length of output: 63


Script:

#!/bin/bash
# Description: Verify the testing of the `setDevice` function.

# Search for test cases involving the `setDevice` function in all files
rg --type cpp -A 5 $'setDevice'

Length of output: 1481

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8933c64 and a8f3505.

Files selected for processing (9)
  • inc/mkn/gpu/cli.hpp (1 hunks)
  • inc/mkn/gpu/cpu.hpp (9 hunks)
  • inc/mkn/gpu/cuda.hpp (11 hunks)
  • inc/mkn/gpu/multi_launch.hpp (1 hunks)
  • inc/mkn/gpu/rocm.hpp (10 hunks)
  • mkn.yaml (2 hunks)
  • res/mkn/clang_cuda.yaml (2 hunks)
  • res/mkn/hipcc.yaml (2 hunks)
  • test/any/async_streaming.cpp (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/cuda.hpp
  • inc/mkn/gpu/multi_launch.hpp
  • res/mkn/clang_cuda.yaml
  • test/any/async_streaming.cpp
Additional comments not posted (9)
res/mkn/hipcc.yaml (2)

3-5: Activation of local settings.

The local settings have been activated by removing the comment markers. Ensure that these paths are correct and accessible in the build environment.


21-21: Compiler updated to C++20.

The compiler version has been updated from C++17 to C++20. This change should be verified for compatibility with all dependencies and the build environment.

mkn.yaml (2)

13-15: Deactivation of ROCM arguments and tests.

The arguments and test paths for the rocm profile have been commented out, indicating a potential deactivation. Ensure this change is intentional and verify its impact on the build and testing processes.


19-19: Deactivation of CUDA argument.

The argument for the cuda profile has been commented out, while the test paths remain active. Verify the necessity and impact of this change on the CUDA testing capabilities.

inc/mkn/gpu/rocm.hpp (5)

159-183: Ensure testing of Pointer structure.

Ensure that the Pointer structure is tested to verify its correctness.

Verification successful

The Pointer structure is adequately tested.

The search results show that the Pointer structure is referenced and tested in multiple test files, such as test/any/managed.cpp and test/any/add.cpp. Assertions in these files verify the functionality of the Pointer structure.

  • test/any/managed.cpp: assert(mkn::gpu::Pointer{mem.data()}.is_managed_ptr());
  • test/any/add.cpp: assert(mkn::gpu::Pointer{b.data()}.is_host_ptr());
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `Pointer` structure.

# Test: Search for test cases involving the `Pointer` structure. Expect: Test cases should exist.
rg --type cpp -A 5 $'Pointer'

Length of output: 3493


45-53: Verify the necessity and impact of making gpuAssert inline.

Ensure that making the gpuAssert function inline is beneficial and does not introduce any issues.


57-68: Ensure comprehensive testing of idx function.

The idx function is a critical utility for indexing in HIP kernels. Ensure comprehensive testing across various grid and block configurations to validate its accuracy and reliability.


272-273: Ensure testing of sync function overloads.

Ensure that the sync function overloads are tested to verify their correctness.


381-386: Ensure testing of grid_sync function.

Ensure that the grid_sync function is tested to verify its correctness.

Comment on lines +109 to +157
struct StreamEvent {
StreamEvent(Stream& stream_) : stream{stream_} { reset(); }
~StreamEvent() { clear(); }

StreamEvent(StreamEvent&& that) : stream{that.stream}, start{that.start}, stop{that.stop} {
that.start = nullptr;
that.stop = nullptr;
}

StreamEvent(StreamEvent const&) = delete;
StreamEvent& operator=(StreamEvent const&) = delete;

auto& operator()() { return stop; };
auto& record() {
if (stage == 0) {
MKN_GPU_ASSERT(result = hipEventRecord(start, stream()));
++stage;
} else {
MKN_GPU_ASSERT(result = hipEventRecord(stop, stream()));
++stage;
}
return *this;
}
auto& wait() {
if (stage == 0) {
MKN_GPU_ASSERT(result = hipStreamWaitEvent(stream(), start));
} else {
MKN_GPU_ASSERT(result = hipStreamWaitEvent(stream(), stop));
}
return *this;
}

void clear() {
if (start) MKN_GPU_ASSERT(result = hipEventDestroy(start));
if (stop) MKN_GPU_ASSERT(result = hipEventDestroy(stop));
}
bool finished() const { return stage == 2 and hipEventQuery(stop) == hipSuccess; }
void reset() {
clear();
MKN_GPU_ASSERT(result = hipEventCreate(&start));
MKN_GPU_ASSERT(result = hipEventCreate(&stop));
stage = 0;
}

Stream& stream;
hipError_t result;
hipEvent_t start = nullptr, stop = nullptr;
std::uint16_t stage = 0;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable the destructor code for StreamEvent.

The commented-out destructor code should be enabled to ensure proper resource management and avoid resource leaks.

-  /*MKN_GPU_ASSERT(result = hipEventDestroy(event));*/
+  MKN_GPU_ASSERT(result = hipEventDestroy(event));

Ensure testing of StreamEvent structure.

No test cases for the StreamEvent structure were found in the codebase. Ensure that the StreamEvent structure is tested to verify its correctness.

  • Add test cases in the test/ directory to cover the StreamEvent structure.
  • Verify the proper functioning of the destructor and other methods.
Analysis chain

Enable the destructor code for StreamEvent.

The commented-out destructor code should be enabled to ensure proper resource management and avoid resource leaks.

-  /*MKN_GPU_ASSERT(result = hipEventDestroy(event));*/
+  MKN_GPU_ASSERT(result = hipEventDestroy(event));

Ensure testing of StreamEvent structure.

Ensure that the StreamEvent structure is tested to verify its correctness.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the testing of the `StreamEvent` structure.

# Test: Search for test cases involving the `StreamEvent` structure. Expect: Test cases should exist.
rg --type cpp -A 5 $'StreamEvent'

Length of output: 2477


Script:

#!/bin/bash
# Description: Search for test cases involving the `StreamEvent` structure.

# Find all potential test files
test_files=$(fd --type f --extension cpp --exec grep -l 'StreamEvent')

# Print the test files found
echo "Test files potentially using StreamEvent:"
echo "$test_files"

# Search within these files for instantiation and usage of StreamEvent
for file in $test_files; do
  echo "Searching in $file"
  rg 'StreamEvent' "$file"
done

Length of output: 175


Script:

#!/bin/bash
# Description: Search for test cases involving the `StreamEvent` structure across various file extensions and directories.

# Define common test file extensions
extensions=("cpp" "hpp" "h" "cxx" "cc")

# Search for files containing 'StreamEvent' across the defined extensions
for ext in "${extensions[@]}"; do
  fd --type f --extension "$ext" --exec grep -l 'StreamEvent'
done

# Search for common test directories
test_dirs=$(fd --type d --glob '*test*')

# Print the test directories found
echo "Test directories potentially containing tests:"
echo "$test_dirs"

# Search within these directories for files containing 'StreamEvent'
for dir in $test_dirs; do
  echo "Searching in directory $dir"
  rg 'StreamEvent' "$dir"
done

Length of output: 867

@mkn mkn deleted a comment from coderabbitai bot Aug 3, 2024
@PhilipDeegan PhilipDeegan merged commit a77be09 into master Aug 3, 2024
1 check passed
@PhilipDeegan PhilipDeegan deleted the doh branch August 3, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

try guess cuda or hip
1 participant