From 573490233850e0eac5cec781b4be361bb99e8eeb Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 15 Jan 2025 15:14:36 -0800 Subject: [PATCH 01/52] start Intro effort --- docs/understand/programming_model.rst | 46 ++++++++++++++++++++++++++- 1 file changed, 45 insertions(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 6c7015996f..ac65868f81 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -7,7 +7,7 @@ .. _programming_model: ******************************************************************************* -HIP programming model +Introduction to HIP programming model ******************************************************************************* The HIP programming model makes it easy to map data-parallel C/C++ algorithms to @@ -21,6 +21,50 @@ A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general. +Getting into Hardware: CPU vs GPU +================================= + +developed to fit the capabilities of GPU hardware + +In theory, could be developed for the same as CPUs, but to get the best performance out of the hardware, it differs from (even multi-threaded) CPU applications + +Different design goals in CPU and GPU implementations: +CPUs have been designed to quickly execute a single thread, i.e. increase the amount of serial instructions that can be executed (this includes fetching data, reducing pipeline stalls where the ALU has to wait for previous instructions to finish, etc.) + --> low throughput, but also low latency (goal is to finish a single series of instructions (i.e. a thread) as quickly as possible) +GPUs have been designed to execute many independent (but "similar") threads as possible in parallel ("similar" in this case means, in the ideal case, the same instruction, but on different data) + --> high throughput, but also high latency (goal is to make progress on many threads in parallel, not finish a single one fast) + +Hardware differences CPU vs GPU +CPU: + one register file per thread (on "classical"/modern/normal CPUs you have at most 2 register files per core, that is called hyper-/multi-threading, depending on vendor) + one ALU executing the thread + - designed to quickly execute instructions of the same thread + - highly pipelined (might not be exclusive to CPUs) + - complex branch prediction + Comparably huge L1/L2 cache per core, shared by fewer threads (as already pointed out, maximum of 2 when hyperthreading is available) + Disadvantage: context switch (that is switching execution from one thread to another) takes a considerable amount of time (ALU pipeline needs to be emptied, register file has to be written to memory to free it for another thread) + +GPU: + register file is shared among threads (amount of threads that can be executed in parallel depends, among other factors, on the amount of registers needed per thread. Here would be a good reference to a better explanation of occupancy) + Many ALUs (technically as many as threads in a warp) to execute a whole warp at once. however, they can't execute arbitrary threads: the threads have to execute the same instruction. This is called SIMT (you execute a single instruction, but for many different threads) + - ALU is shared between warps! Not only threads of the same warp. If a warp can not issue its next instruction for some reason (waiting for data, branching, long-latency instruction), the core/compute unit issues an instruction from another warp. This improves utilization of the ALU without needing branch prediction or any other fancy features (but a huge register file) + - this collection of ALUs is called SIMD. There exists an equivalent on CPUs: SIMDs are an extension to the architecture, that allows a *S*ingle CPU *I*nstruction to operate on *M*ultiple *D*ata. Difference is, that CPU SIMDs are waay smaller than GPU SIMDs. (I think the newest SIMD extension for CPUs is AVX-512, which can operate on 512 bits at the same time. When considering 32-bit floating point, this is 512/32 = 16 elements at once. GPU "cores" (Compute Units) can operate on at least 64 32-bit floating point elements at once. Depending on architecture maybe even more [Don't have a source at hand for backing that up]) + - obviously designed to execute many threads at once + - not sure about pipeline-length + - no/bad branch prediction + - if the threads don't follow the same branch, the ALU is still occupied for a full warp, but the result for those threads is masked out -> wasted ALU cycles + L1 cache is shared between all threads residing on a "core"/compute unit. Differs between architectures: L2 Cache is shared between all cores/compute units. + - cache on GPUs (used to be, changed slightly with newer architectures) is there to coalesce accesses, so that if a "neighbouring" thread accesses the same data, it can be fetched from L1, not necessarily for holding values to reuse the data later. + + Context switching is easy! All threads that run on a core/compute unit have their registers on the compute unit all the time, so they don't need to be stored to global memory, and each cycle one instruction from any warp that resides on the compute unit can be issued + --> All of this should hopefully explain, why GPU threads are tightly coupled, why many of them are needed to get peak performance, and why they are considered "light-weight" (can be easily switched between) + --> This should also explain, why we have warps in the HIP programming model. While in theory warps with completelly different threads, that don't follow the same execution path are possible, this highlights why it's not a good idea. + +Points I didn't yet fully flesh out: On GPUs the threads in a warp can easily cooperate (warp-level intrinsics, run on the same "core"/compute unit). On CPUs communication between threads is a bit more costly (but not sure if it's worth mentioning that) + +Note: All of this up until now was only explaining everything on a warp (GPU)/thread (CPU) level. This does not yet explain, why we need thread-blocks. +Notes I haven't yet fleshed out for that: Blocks are assigned to a specific compute unit. Threads in a block usually work on a similar task. Being executed on the same compute unit gives more opportunities for cooperation (sharing cache to reduce accesses to global memory, using shared memory/LDS to share intermediate results with low latency, and other ways to cooperate [LDS-atomics, warp-level intrinsics, ...]) + RDNA & CDNA architecture summary ================================ From 2ac0f3a14815063b007e8537ffe72d042b63fbfe Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 28 Jan 2025 13:57:10 -0800 Subject: [PATCH 02/52] programming model update --- docs/understand/programming_model.rst | 178 ++++++++++++++++---------- 1 file changed, 110 insertions(+), 68 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index ac65868f81..e1b82ebfa8 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -12,73 +12,81 @@ Introduction to HIP programming model The HIP programming model makes it easy to map data-parallel C/C++ algorithms to massively parallel, wide single instruction, multiple data (SIMD) architectures, -such as GPUs. +such as GPUs. HIP supports many imperative languages, such as Python via PyHIP, +but this document focuses on the original C/C++ API of HIP. -While the model may be expressed in most imperative languages, (for example -Python via PyHIP) this document will focus on the original C/C++ API of HIP. +.. RJH>> If HIP programming uses SIMT for thread instructions, but the hardware implementation is SIMD for the execution of threads, then maybe we focus on SIMT as the top-level concept? -A basic understanding of the underlying device architecture helps you +While GPUs may be capable of running applications written for CPUs if properly ported +and compiled, it would not be an efficient use of GPU resources. GPUs are different +from CPUs in fundamental ways, and should be used accordingly to achieve optimum +performance. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) -programming in general. +programming in general. The topics that follow introduce you to the key concepts of +GPU-based programming, and the HIP programming model. Getting into Hardware: CPU vs GPU ================================= -developed to fit the capabilities of GPU hardware - -In theory, could be developed for the same as CPUs, but to get the best performance out of the hardware, it differs from (even multi-threaded) CPU applications - -Different design goals in CPU and GPU implementations: -CPUs have been designed to quickly execute a single thread, i.e. increase the amount of serial instructions that can be executed (this includes fetching data, reducing pipeline stalls where the ALU has to wait for previous instructions to finish, etc.) - --> low throughput, but also low latency (goal is to finish a single series of instructions (i.e. a thread) as quickly as possible) -GPUs have been designed to execute many independent (but "similar") threads as possible in parallel ("similar" in this case means, in the ideal case, the same instruction, but on different data) - --> high throughput, but also high latency (goal is to make progress on many threads in parallel, not finish a single one fast) - -Hardware differences CPU vs GPU -CPU: - one register file per thread (on "classical"/modern/normal CPUs you have at most 2 register files per core, that is called hyper-/multi-threading, depending on vendor) - one ALU executing the thread - - designed to quickly execute instructions of the same thread - - highly pipelined (might not be exclusive to CPUs) - - complex branch prediction - Comparably huge L1/L2 cache per core, shared by fewer threads (as already pointed out, maximum of 2 when hyperthreading is available) - Disadvantage: context switch (that is switching execution from one thread to another) takes a considerable amount of time (ALU pipeline needs to be emptied, register file has to be written to memory to free it for another thread) - -GPU: - register file is shared among threads (amount of threads that can be executed in parallel depends, among other factors, on the amount of registers needed per thread. Here would be a good reference to a better explanation of occupancy) - Many ALUs (technically as many as threads in a warp) to execute a whole warp at once. however, they can't execute arbitrary threads: the threads have to execute the same instruction. This is called SIMT (you execute a single instruction, but for many different threads) - - ALU is shared between warps! Not only threads of the same warp. If a warp can not issue its next instruction for some reason (waiting for data, branching, long-latency instruction), the core/compute unit issues an instruction from another warp. This improves utilization of the ALU without needing branch prediction or any other fancy features (but a huge register file) - - this collection of ALUs is called SIMD. There exists an equivalent on CPUs: SIMDs are an extension to the architecture, that allows a *S*ingle CPU *I*nstruction to operate on *M*ultiple *D*ata. Difference is, that CPU SIMDs are waay smaller than GPU SIMDs. (I think the newest SIMD extension for CPUs is AVX-512, which can operate on 512 bits at the same time. When considering 32-bit floating point, this is 512/32 = 16 elements at once. GPU "cores" (Compute Units) can operate on at least 64 32-bit floating point elements at once. Depending on architecture maybe even more [Don't have a source at hand for backing that up]) - - obviously designed to execute many threads at once - - not sure about pipeline-length - - no/bad branch prediction - - if the threads don't follow the same branch, the ALU is still occupied for a full warp, but the result for those threads is masked out -> wasted ALU cycles - L1 cache is shared between all threads residing on a "core"/compute unit. Differs between architectures: L2 Cache is shared between all cores/compute units. - - cache on GPUs (used to be, changed slightly with newer architectures) is there to coalesce accesses, so that if a "neighbouring" thread accesses the same data, it can be fetched from L1, not necessarily for holding values to reuse the data later. +CPUs and GPUs have been designed for different purposes. CPUs have been designed +to quickly execute a single thread, decreasing the time it takes for a single +operation, increasing the amount of serial instructions that can be executed. +This includes fetching data, and reducing pipeline stalls where the ALU has to +wait for previous instructions to finish. CPUs provide low latency processing for +serial instructions, but also lower throughput overall. Latency is the speed of +an operation, while throughput is the number of operations completed in a unit of +time. On CPUs the goal is to quickly process operations. + +On the other hand, GPUs have been designed to execute many similar commands, or threads, in parallel, +achieving high throughput, but also higher latency. For the GPU, the objective is +to process as many operations in parallel, rather than to finish a single instruction +quickly. GPUs in general are made up of basic building blocks called compute units (CUs), +that execute the threads of a kernel. These CUs provide the necessary resources +for the threads: the Arithmetic Logical Units (ALUs), register files, caches and +shared memory for efficient communication between the threads. + +The following defines a few hardware differences between CPUs and GPUs: + +.. RJH>> I think the following section does a good job of highlighting the differences in hardware that result in programming changes needed for GPUs. I think this should be our focus for the Programming model content. + +* CPU: + + - One register file per thread. On modern CPUs you have at most 2 register files + per core, called hyperthreading or multithreading. + + .. RJH>> Are these the same? I found a link discussing this: https://www.baeldung.com/cs/multithreading-vs-hyperthreading#:~:text=Hyperthreading%20breaks%20a%20single%20physical,distinction%20between%20the%20two%20techniques. + + - One ALU executing the thread. + + - Designed to quickly execute instructions of the same thread. + - Highly pipelined. + - Complex branch prediction. + + - Large L1/L2 cache per core, shared by fewer threads (maximum of 2 when hyperthreading is available). + - A disadvantage is switching execution from one thread to another (or context switching) takes a considerable amount of time: the ALU pipeline needs to be emptied, the register file has to be written to memory to free the register for another thread. - Context switching is easy! All threads that run on a core/compute unit have their registers on the compute unit all the time, so they don't need to be stored to global memory, and each cycle one instruction from any warp that resides on the compute unit can be issued - --> All of this should hopefully explain, why GPU threads are tightly coupled, why many of them are needed to get peak performance, and why they are considered "light-weight" (can be easily switched between) - --> This should also explain, why we have warps in the HIP programming model. While in theory warps with completelly different threads, that don't follow the same execution path are possible, this highlights why it's not a good idea. +* GPU: + + - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread as described in :ref:`hardware_implementation`. + - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. + + - ALUs are shared between the threads of a wavefront, and when the thread is idle due to data transfer or instruction branching, the ALU is shared with other wavefronts for better resource utilization. + - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture, that allows a `single instruction` to concurrently operate on `multiple data` inputs. CPU SIMDs are smaller than GPU SIMDs, which enables greater throughput on the GPU. + - For branching threads where conditional instructions lead to thread divergence, ALUs still processes the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles, and should be a consideration in your programming. Keep instructions consistent, and leave conditionals out of threads. + + .. RJH>> It feels like the first of these sub-bullets, and the last of them above, have different messages: ALUs is shared outside of wavefronts, or the ALU processes the thread in any case, but the results are masked out? -Points I didn't yet fully flesh out: On GPUs the threads in a warp can easily cooperate (warp-level intrinsics, run on the same "core"/compute unit). On CPUs communication between threads is a bit more costly (but not sure if it's worth mentioning that) + - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit all the time, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. -Note: All of this up until now was only explaining everything on a warp (GPU)/thread (CPU) level. This does not yet explain, why we need thread-blocks. -Notes I haven't yet fleshed out for that: Blocks are assigned to a specific compute unit. Threads in a block usually work on a similar task. Being executed on the same compute unit gives more opportunities for cooperation (sharing cache to reduce accesses to global memory, using shared memory/LDS to share intermediate results with low latency, and other ways to cooperate [LDS-atomics, warp-level intrinsics, ...]) - RDNA & CDNA architecture summary -================================ - -GPUs in general are made up of basic building blocks called compute units (CUs), -that execute the threads of a kernel. These CUs provide the necessary resources -for the threads: the Arithmetic Logical Units (ALUs), register files, caches and -shared memory for efficient communication between the threads. +-------------------------------- -This design allows for efficient execution of kernels while also being able to -scale from small GPUs embedded in APUs with few CUs up to GPUs designed for data +AMD GPU designs enable efficient execution of kernels while scaling from small +GPUs with a few CUs, embedded in APUs, to large GPUs designed for data centers with hundreds of CUs. Figure :ref:`rdna3_cu` and :ref:`cdna3_cu` show -examples of such compute units. +examples of such compute units. For additional architecture details, see :ref:`hardware_implementation`. -For architecture details, check :ref:`hardware_implementation`. +.. RJH>> I believe RDNA is on Radeon Graphic cards, and CDNA is on instinct data center accelerators. Do we want to add this distinction here? .. _rdna3_cu: @@ -105,8 +113,8 @@ For architecture details, check :ref:`hardware_implementation`. Heterogeneous Programming ========================= -The HIP programming model assumes two execution contexts. One is referred to as -*host* while compute kernels execute on a *device*. These contexts have +The HIP programming model assumes two execution contexts. The application starts on the CPU +*host* while compute kernels are launched on the GPU *device*. These contexts have different capabilities, therefor slightly different rules apply. The *host* execution is defined by the C++ abstract machine, while *device* execution follows the :ref:`SIMT model` of HIP. These execution contexts in @@ -125,23 +133,57 @@ a few key differences between the two: memory, the performance benefits of the segmented memory subsystem are supported by the inability of asynchronous access from the host. -* Not all C++ language features map cleanly to typical device architectures, - some are very expensive (meaning slow) to implement on GPU devices, therefor - they are forbidden in device contexts to avoid users tapping into features - that unexpectedly decimate their program's performance. Offload devices targeted +.. RJH>> The prior sentence is not clear to me. The performance benefits of the shared memory on the GPU are based on the CPUs inability to access it? + +* Not all C++ language features map cleanly to typical GPU device architectures. + Some C++ features, such as XXX, are very expensive (meaning slow) to implement on GPU devices, therefor + they are forbidden in device contexts to avoid using features + that unexpectedly decimate the program's performance. Offload devices targeted by HIP aren't general purpose devices, at least not in the sense that a CPU is. HIP focuses on data parallel computations and as such caters to throughput optimized architectures, such as GPUs or accelerators derived from GPU architectures. +.. RJH>> I think the above could list some example features that are too expensive for GPUs, and clarify whether it is HIP or the GPU hardware that is forbidding these features? + * Asynchrony is at the forefront of the HIP API. Computations launched on the device execute asynchronously with respect to the host, and it is the user's responsibility to synchronize their data dispatch/fetch with computations on the device. .. note:: - HIP does perform implicit synchronization on occasions, more advanced than other - APIs such as OpenCL or SYCL, in which the responsibility of synchronization mostly - depends on the user. + HIP does perform implicit synchronization on occasions, unlike other + APIs such as OpenCL or SYCL, where the responsibility of synchronization depends mostly on the user. + +Host programming +---------------- + +In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and computation exchanges between the CPU (host) and GPU (device). Here is a typical sequence of operations: + +1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU. +2. Memory Management: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, and the transfer of input data from the host to the device. Note that the data is transferred to the device, and passed as an input parameter for the kernel. +3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (e.g., hipLaunchKernelGGL), and pass parameters as needed. +4. Synchronization: As described in Asynchronous execution use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations, waiting for events or streams to finish execution and transfer results from the GPU back to the host. +5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations (e.g., using hipGetErrorString to retrieve error messages). +6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices. + +This structure allows for efficient use of GPU resources and facilitates the acceleration of compute-intensive tasks while keeping the host CPU available for other tasks. + +.. _device_program: + +Device programming +------------------ + +Launching the kernel in the host application starts a kernel program running on the GPU to perform parallel computations. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. The general flow of the kernel program looks like this: + +1. Thread Grouping: As described in :ref:`SIMT model`, threads are organized into blocks, and blocks are organized into grids. +2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. +3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. +4. Computation: Threads perform the required computations on the input data, and generate any needed output. +5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory. + +Kernel programs can be simple with single instructions deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can become quite complex, managing hundreds or thousands of threads with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage asynchronous operations, using rich libraries of functions defined for operation on GPUs as described in `Kernel program <./kernel_program>`. + +.. RJH>> This "Kernel program" topic does not currently exist, though I think we should discuss whether it could be included here or as a separate topic. .. _programming_model_simt: @@ -196,7 +238,7 @@ usually isn't exploited from the width of the built-in vector types, but across .. _inherent_thread_model: Inherent thread model -===================== +--------------------- The SIMT nature of HIP is captured by the ability to execute user-provided device programs, expressed as single-source C/C++ functions or sources compiled @@ -219,10 +261,10 @@ following figure. Hierarchy of thread groups. -Warp (or Wavefront) - The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A warp +Wavefront (or Warp) + The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A wavefront is the most tightly coupled groups of threads, both physically and logically. Threads - inside a warp are also called lanes, and the integral value identifying them is the lane ID. + inside a wavefront are also called lanes, and the integral value identifying them is the lane ID. .. tip:: @@ -230,8 +272,8 @@ Warp (or Wavefront) consequence, they are only as multidimensional as the user interprets the calculated values to be. - The size of a warp is architecture dependent and always fixed. For AMD GPUs - the wavefront is typically 64 threads, though sometimes 32 threads. Warps are + The size of a wavefront is architecture dependent and always fixed. For AMD GPUs + the wavefront is typically 64 threads, though sometimes 32 threads. Wavefronts are signified by the set of communication primitives at their disposal, as discussed in :ref:`warp-cross-lane`. From f72727541113fe5f21909a6db48235b144954f20 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 4 Feb 2025 21:23:17 -0800 Subject: [PATCH 03/52] update programming model --- docs/understand/programming_model.rst | 121 +++++++++++--------------- 1 file changed, 53 insertions(+), 68 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index e1b82ebfa8..4857ea63c8 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -15,8 +15,6 @@ massively parallel, wide single instruction, multiple data (SIMD) architectures, such as GPUs. HIP supports many imperative languages, such as Python via PyHIP, but this document focuses on the original C/C++ API of HIP. -.. RJH>> If HIP programming uses SIMT for thread instructions, but the hardware implementation is SIMD for the execution of threads, then maybe we focus on SIMT as the top-level concept? - While GPUs may be capable of running applications written for CPUs if properly ported and compiled, it would not be an efficient use of GPU resources. GPUs are different from CPUs in fundamental ways, and should be used accordingly to achieve optimum @@ -47,19 +45,13 @@ shared memory for efficient communication between the threads. The following defines a few hardware differences between CPUs and GPUs: -.. RJH>> I think the following section does a good job of highlighting the differences in hardware that result in programming changes needed for GPUs. I think this should be our focus for the Programming model content. - * CPU: - One register file per thread. On modern CPUs you have at most 2 register files - per core, called hyperthreading or multithreading. - - .. RJH>> Are these the same? I found a link discussing this: https://www.baeldung.com/cs/multithreading-vs-hyperthreading#:~:text=Hyperthreading%20breaks%20a%20single%20physical,distinction%20between%20the%20two%20techniques. - + per core, called hyperthreading. - One ALU executing the thread. - Designed to quickly execute instructions of the same thread. - - Highly pipelined. - Complex branch prediction. - Large L1/L2 cache per core, shared by fewer threads (maximum of 2 when hyperthreading is available). @@ -70,12 +62,9 @@ The following defines a few hardware differences between CPUs and GPUs: - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread as described in :ref:`hardware_implementation`. - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. - - ALUs are shared between the threads of a wavefront, and when the thread is idle due to data transfer or instruction branching, the ALU is shared with other wavefronts for better resource utilization. - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture, that allows a `single instruction` to concurrently operate on `multiple data` inputs. CPU SIMDs are smaller than GPU SIMDs, which enables greater throughput on the GPU. - For branching threads where conditional instructions lead to thread divergence, ALUs still processes the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles, and should be a consideration in your programming. Keep instructions consistent, and leave conditionals out of threads. - .. RJH>> It feels like the first of these sub-bullets, and the last of them above, have different messages: ALUs is shared outside of wavefronts, or the ALU processes the thread in any case, but the results are masked out? - - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit all the time, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. RDNA & CDNA architecture summary @@ -86,8 +75,6 @@ GPUs with a few CUs, embedded in APUs, to large GPUs designed for data centers with hundreds of CUs. Figure :ref:`rdna3_cu` and :ref:`cdna3_cu` show examples of such compute units. For additional architecture details, see :ref:`hardware_implementation`. -.. RJH>> I believe RDNA is on Radeon Graphic cards, and CDNA is on instinct data center accelerators. Do we want to add this distinction here? - .. _rdna3_cu: .. figure:: ../data/understand/programming_model/rdna3_cu.png @@ -113,10 +100,8 @@ examples of such compute units. For additional architecture details, see :ref:`h Heterogeneous Programming ========================= -The HIP programming model assumes two execution contexts. The application starts on the CPU -*host* while compute kernels are launched on the GPU *device*. These contexts have -different capabilities, therefor slightly different rules apply. The *host* -execution is defined by the C++ abstract machine, while *device* execution +The HIP programming model has two execution contexts. The main application starts on the CPU +*host*, and compute kernels are launched on the *device* side such as Instinct accelerators or GPUs. The *host* execution is defined by the C++ abstract machine, while *device* execution follows the :ref:`SIMT model` of HIP. These execution contexts in code are signified by the ``__host__`` and ``__device__`` decorators. There are a few key differences between the two: @@ -136,8 +121,7 @@ a few key differences between the two: .. RJH>> The prior sentence is not clear to me. The performance benefits of the shared memory on the GPU are based on the CPUs inability to access it? * Not all C++ language features map cleanly to typical GPU device architectures. - Some C++ features, such as XXX, are very expensive (meaning slow) to implement on GPU devices, therefor - they are forbidden in device contexts to avoid using features + Some C++ features, such as XXX, are very expensive (meaning slow) to implement on GPU devices, therefore they are forbidden in device contexts to avoid using features that unexpectedly decimate the program's performance. Offload devices targeted by HIP aren't general purpose devices, at least not in the sense that a CPU is. HIP focuses on data parallel computations and as such caters to throughput @@ -151,8 +135,8 @@ a few key differences between the two: synchronize their data dispatch/fetch with computations on the device. .. note:: - HIP does perform implicit synchronization on occasions, unlike other - APIs such as OpenCL or SYCL, where the responsibility of synchronization depends mostly on the user. + HIP performs implicit synchronization on occasions, unlike some + APIs where the responsibility for synchronization is left to the user. Host programming ---------------- @@ -161,9 +145,9 @@ In heterogeneous programming, the CPU is available for processing operations but 1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU. 2. Memory Management: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, and the transfer of input data from the host to the device. Note that the data is transferred to the device, and passed as an input parameter for the kernel. -3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (e.g., hipLaunchKernelGGL), and pass parameters as needed. -4. Synchronization: As described in Asynchronous execution use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations, waiting for events or streams to finish execution and transfer results from the GPU back to the host. -5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations (e.g., using hipGetErrorString to retrieve error messages). +3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``), and pass parameters as needed. +4. Synchronization: As described in :ref:`asynchronous_how-to`, kernel execution occurs in the context of device streams, specifically the default (`0`) stream. You can use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations. Wait for events or streams to finish execution and transfer results from the GPU back to the host. +5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations. For example, use ``hipGetErrorString`` to retrieve error messages. 6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices. This structure allows for efficient use of GPU resources and facilitates the acceleration of compute-intensive tasks while keeping the host CPU available for other tasks. @@ -175,51 +159,59 @@ Device programming Launching the kernel in the host application starts a kernel program running on the GPU to perform parallel computations. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. The general flow of the kernel program looks like this: -1. Thread Grouping: As described in :ref:`SIMT model`, threads are organized into blocks, and blocks are organized into grids. +1. Thread Grouping: As described in :ref:`SIMT model`, threads are organized into blocks, and blocks are organized into grids. 2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. 3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. 4. Computation: Threads perform the required computations on the input data, and generate any needed output. 5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory. -Kernel programs can be simple with single instructions deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can become quite complex, managing hundreds or thousands of threads with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage asynchronous operations, using rich libraries of functions defined for operation on GPUs as described in `Kernel program <./kernel_program>`. - -.. RJH>> This "Kernel program" topic does not currently exist, though I think we should discuss whether it could be included here or as a separate topic. +Kernels can be simple single instruction programs deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds or thousands of threads with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. .. _programming_model_simt: Single instruction multiple threads (SIMT) ========================================== -The SIMT programming model behind the HIP device-side execution is a middle-ground -between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, -and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting -relevant instruction sets on CPUs (for example SSE/AVX/Neon). +The HIP kernel code, which is written as a series of scalar instructions for multiple threads with different thread indices, gets mapped to the SIMD units of the GPUs. +Every single instruction, which is executed for every participating thread of a +kernel, gets mapped to the SIMD as often as there are threads. -A HIP device compiler maps SIMT code written in HIP C++ to an inherently SIMD -architecture (like GPUs). This is done by scalarizing the entire kernel and issuing the scalar -instructions of multiple kernel instances (called threads) to each of the SIMD engine lanes, rather -than exploiting data parallelism within a single instance of a kernel and spreading -identical instructions over the available SIMD engines. +This is done by grouping threads into warps, which contain as many threads as there +are physical lanes in a SIMD, and issuing that instruction to the SIMD for every +warp of a kernel. Ideally the SIMD is always fully utilized, however if the number of threads +can't be evenly divided by the warpsize, then the unused lanes are masked out +from the corresponding SIMD execution. -Consider the following kernel: +A kernel follows the same C++ rules as the functions on the host, but it has a special __global__ label to mark it for execution on the device, as shown in the following example: .. code-block:: cpp - __global__ void k(float4* a, const float4* b) + __global__ void AddKernel(float* a, const float* b) { - int tid = threadIdx.x; - int bid = blockIdx.x; - int dim = blockDim.x; + int global_id = threadIdx.x + blockIdx.x * blockDim.x; - a[tid] += (tid + bid - dim) * b[tid]; + a[global_id] += b[global_id]; } -The incoming four-vector of floating-point values ``b`` is multiplied by a -scalar and then added element-wise to the four-vector floating-point values of -``a``. On modern SIMD-capable architectures, the four-vector ops are expected to -compile to a single SIMD instruction. However, GPU execution of this kernel will -typically break down the vector elements into 4 separate threads for parallel execution, -as seen in the following figure: +One of the first differences to note, is the usage of the special ``threadIdx``, ``blockIdx`` and ``blockDim`` variables. +Unlike normal C++ host functions, a kernel is not launched once, but as often as specified by the user. Each of these instances is a separate thread, with its own values for ``threadIdx``, ``blockIdx`` and ``blockDim``. +This is called SIMT, meaning that a *S*ingle *I*nstruction is executed in *M*ultiple *T*hreads. + +Kernels are launched using the "triple chevron" syntax, for example: + +.. code-block:: cpp + + AddKernel<<>>(a, b); + +Here the total number of threads launched for the ``AddKernel`` program is defined by ``number_of_blocks * threads_per_block``. These values are defined by the programmer to address the problem to be solved and the available resources within the system. In other words, the thread configuration is customized to the needs of the operations. + +For comparison, the ``AddKernel`` program could be written in plain C++ as a ``FOR`` loop: + +.. code-block:: cpp + + for(int i = 0; i < (number_of_blocks * threads_per_block); ++i){ + a[i] += b[i]; + } .. _simt: @@ -240,18 +232,12 @@ usually isn't exploited from the width of the built-in vector types, but across Inherent thread model --------------------- -The SIMT nature of HIP is captured by the ability to execute user-provided -device programs, expressed as single-source C/C++ functions or sources compiled -online/offline to binaries, in bulk. - All threads of a kernel are uniquely identified by a set of integral values, called thread IDs. The set of integers identifying a thread relate to the hierarchy in which the threads execute. -The thread hierarchy inherent to how AMD GPUs operate is depicted in the +The thread hierarchy is integral to how AMD GPUs operate, and is depicted in the following figure. -.. _inherent_thread_hierarchy: - .. figure:: ../data/understand/programming_model/thread_hierarchy.svg :alt: Diagram depicting nested rectangles of varying color. The outermost one titled "Grid", inside sets of uniform rectangles layered on one another @@ -261,6 +247,8 @@ following figure. Hierarchy of thread groups. +.. _wavefront: + Wavefront (or Warp) The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A wavefront is the most tightly coupled groups of threads, both physically and logically. Threads @@ -280,23 +268,20 @@ Wavefront (or Warp) .. _inherent_thread_hierarchy_block: Block - The middle grouping is called a block or thread block. The defining feature - of a block is that all threads in a block will share an instance of memory - which they may use to share data or synchronize with one another. + The next level of the thread hierarchy is called a thread block, or block. The + defining feature of a block is that all threads in a block will share an instance + of memory which they may use to share data or synchronize with one another, + as described in :ref:`memory_hierarchy`. The size of a block is user-configurable but is limited by the queryable capabilities of the executing hardware. The unique ID of the thread within a - block is 3-dimensional as provided by the API. When linearizing thread IDs - within a block, assume the "fast index" being dimension ``x``, followed by + block can be 1, 2, or 3-dimensional as provided by the HIP API. You can configure the thread block to best represent the data associated with the instruction set. When linearizing thread IDs within a block, assume the "fast index" being dimension ``x``, followed by the ``y`` and ``z`` dimensions. .. _inherent_thread_hierarchy_grid: Grid - The outermost grouping is called a grid. A grid manifests as a single - dispatch of kernels for execution. The unique ID of each block within a grid - is 3-dimensional, as provided by the API and is queryable by every thread - within the block. + The top-most level of the thread hierarchy is a grid. A grid is the collection of blocks, which are collections of threads, defined for the kernel. A grid manifests as a single launch of the kernel to run. The unique ID of each block within a grid can be 1, 2, or 3-dimensional, as provided by the API and is queryable by every thread within the block. Cooperative groups thread model ------------------------------- @@ -315,6 +300,8 @@ better than the defaults defined by the hardware. For further information, see :doc:`Cooperative groups `. +.. _memory_hierarchy: + Memory model ============ @@ -322,7 +309,6 @@ The hierarchy of threads introduced by the :ref:`inherent_thread_model` is induc by the memory subsystem of GPUs. The following figure summarizes the memory namespaces and how they relate to the various levels of the threading model. -.. _memory_hierarchy: .. figure:: ../data/understand/programming_model/memory_hierarchy.svg :alt: Diagram depicting nested rectangles of varying color. The outermost one @@ -448,4 +434,3 @@ intended use-cases. compiler itself and not intended towards end-user code. Should you be writing a tool having to launch device code using HIP, consider using these over the alternatives. - From 8a23c7597c54a05e0b12f34d33e45e3b68bb48db Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:14:10 -0800 Subject: [PATCH 04/52] update content base don Adel feedback, minus Execution Model --- .../hip_runtime_api/cooperative_groups.rst | 2 +- docs/understand/hardware_implementation.rst | 9 +- docs/understand/programming_model.rst | 327 +++++++++++------- 3 files changed, 209 insertions(+), 129 deletions(-) diff --git a/docs/how-to/hip_runtime_api/cooperative_groups.rst b/docs/how-to/hip_runtime_api/cooperative_groups.rst index 3170e197ef..a3e32cd294 100644 --- a/docs/how-to/hip_runtime_api/cooperative_groups.rst +++ b/docs/how-to/hip_runtime_api/cooperative_groups.rst @@ -164,7 +164,7 @@ The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, ` Coalesced groups ------------------ -Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp. +Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads are referred to as coalesced, and coalesced group represents an active thread group within a warp. .. note:: diff --git a/docs/understand/hardware_implementation.rst b/docs/understand/hardware_implementation.rst index 7038262812..e57f7d4505 100644 --- a/docs/understand/hardware_implementation.rst +++ b/docs/understand/hardware_implementation.rst @@ -45,12 +45,13 @@ The amount of warps that can reside concurrently on a CU, known as occupancy, is determined by the warp's resource usage of registers and shared memory. +.. _gcn_cu: + .. figure:: ../data/understand/hardware_implementation/compute_unit.svg :alt: Diagram depicting the general structure of a compute unit of an AMD GPU. - An AMD Graphics Core Next (GCN) CU. The CDNA and RDNA CUs are based on - variations of the GCN CU. + AMD Graphics Core Next (GCN) CU On AMD GCN GPUs the basic structure of a CU is: @@ -102,6 +103,8 @@ The scalar unit performs instructions that are uniform within a warp. It thereby improves efficiency and reduces the pressure on the vector ALUs and the vector register file. +.. _cdna3_cu: + CDNA architecture ================= @@ -121,6 +124,8 @@ multiply-accumulate operations for Block Diagram of a CDNA3 Compute Unit. +.. _rdna3_cu: + RDNA architecture ================= diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 4857ea63c8..7c8f7b5d2c 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -20,7 +20,7 @@ and compiled, it would not be an efficient use of GPU resources. GPUs are differ from CPUs in fundamental ways, and should be used accordingly to achieve optimum performance. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) -programming in general. The topics that follow introduce you to the key concepts of +programming in general. The following topics introduce you to the key concepts of GPU-based programming, and the HIP programming model. Getting into Hardware: CPU vs GPU @@ -28,27 +28,30 @@ Getting into Hardware: CPU vs GPU CPUs and GPUs have been designed for different purposes. CPUs have been designed to quickly execute a single thread, decreasing the time it takes for a single -operation, increasing the amount of serial instructions that can be executed. +operation, increasing the amount of sequential instructions that can be executed. This includes fetching data, and reducing pipeline stalls where the ALU has to -wait for previous instructions to finish. CPUs provide low latency processing for -serial instructions, but also lower throughput overall. Latency is the speed of -an operation, while throughput is the number of operations completed in a unit of -time. On CPUs the goal is to quickly process operations. - -On the other hand, GPUs have been designed to execute many similar commands, or threads, in parallel, -achieving high throughput, but also higher latency. For the GPU, the objective is -to process as many operations in parallel, rather than to finish a single instruction -quickly. GPUs in general are made up of basic building blocks called compute units (CUs), -that execute the threads of a kernel. These CUs provide the necessary resources -for the threads: the Arithmetic Logical Units (ALUs), register files, caches and -shared memory for efficient communication between the threads. +wait for previous instructions to finish. + +On CPUs the goal is to quickly process operations. CPUs provide low latency processing for +serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, +in parallel, achieving higher throughput. Latency is the delay from when an operation +is started to when it returns, such as 2 ns, while throughput is the number of operations completed +in a period of time, such as ten thousand threads completed. + +For the GPU, the objective is to process as many operations in parallel, rather +than to finish a single instruction quickly. GPUs in general are made up of basic +building blocks called compute units (CUs), that execute the threads of a kernel. +As described in :ref:`hardware_implementation`, these CUs provide the necessary +resources for the threads: the Arithmetic Logical Units (ALUs), register files, +caches and shared memory for efficient communication between the threads. The following defines a few hardware differences between CPUs and GPUs: * CPU: - - One register file per thread. On modern CPUs you have at most 2 register files - per core, called hyperthreading. + - Optimized for sequential processing with a few powerful cores (4-64 typically) + - High clock speeds (3-5 GHz) + - One register file per thread. On modern CPUs you have at most 2 register files per core, called hyperthreading. - One ALU executing the thread. - Designed to quickly execute instructions of the same thread. @@ -59,52 +62,36 @@ The following defines a few hardware differences between CPUs and GPUs: * GPU: - - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread as described in :ref:`hardware_implementation`. + - Designed for parallel processing with many simpler cores (hundreds/thousands) + - Lower clock speeds (1-2 GHz) + - Streamlined control logic + - Small caches, more registers + - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread. - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. - - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture, that allows a `single instruction` to concurrently operate on `multiple data` inputs. CPU SIMDs are smaller than GPU SIMDs, which enables greater throughput on the GPU. + - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture, that allows a `single instruction` to concurrently operate on `multiple data` inputs. - For branching threads where conditional instructions lead to thread divergence, ALUs still processes the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles, and should be a consideration in your programming. Keep instructions consistent, and leave conditionals out of threads. - - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit all the time, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. - -RDNA & CDNA architecture summary --------------------------------- - -AMD GPU designs enable efficient execution of kernels while scaling from small -GPUs with a few CUs, embedded in APUs, to large GPUs designed for data -centers with hundreds of CUs. Figure :ref:`rdna3_cu` and :ref:`cdna3_cu` show -examples of such compute units. For additional architecture details, see :ref:`hardware_implementation`. - -.. _rdna3_cu: - -.. figure:: ../data/understand/programming_model/rdna3_cu.png - :alt: Block diagram showing the structure of an RDNA3 Compute Unit. It - consists of four SIMD units, each including a vector and scalar register - file, with the corresponding scalar and vector ALUs. All four SIMDs - share a scalar and instruction cache, as well as the shared memory. Two - of the SIMD units each share an L0 cache. + - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. - Block Diagram of an RDNA3 Compute Unit. +When programming for a heterogeneous system, which incorporates CPUs and GPUs, you must +write your program to take advantage of the strengths of the available hardware. +Use the CPU for tasks that require complex logic with conditional branching, to reduce the +time to reach a decision. Use the GPU for parallel operations of the same instruction +across large datasets, with little branching, where the volume of operations is the key. -.. _cdna3_cu: - -.. figure:: ../data/understand/programming_model/cdna3_cu.png - :alt: Block diagram showing the structure of a CDNA3 compute unit. It includes - Shader Cores, the Matrix Core Unit, a Local Data Share used for sharing - memory between threads in a block, an L1 Cache and a Scheduler. The - Shader Cores represent the vector ALUs and the Matrix Core Unit the - matrix ALUs. The Local Data Share is used as the shared memory. - - Block Diagram of a CDNA3 Compute Unit. +.. _heterogeneous_programming: Heterogeneous Programming ========================= -The HIP programming model has two execution contexts. The main application starts on the CPU -*host*, and compute kernels are launched on the *device* side such as Instinct accelerators or GPUs. The *host* execution is defined by the C++ abstract machine, while *device* execution -follows the :ref:`SIMT model` of HIP. These execution contexts in -code are signified by the ``__host__`` and ``__device__`` decorators. There are -a few key differences between the two: +The HIP programming model has two execution contexts. The main application starts on the CPU, or +the *host* processor, and compute kernels are launched on the *device* such as `Instinct +accelerators `_ or AMD GPUs. +The host execution is defined by the C++ abstract machine, while device execution +follows the :ref:`SIMT model` of HIP. These two execution contexts +are signified by the ``__host__`` and ``__global__`` (or ``__device__``) decorators +in HIP program code. There are a few key differences between the two contexts: * The C++ abstract machine assumes a unified memory address space, meaning that one can always access any given address in memory (assuming the absence of @@ -112,25 +99,20 @@ a few key differences between the two: from one means nothing in another. Moreover, not all address spaces are accessible from all contexts. - Looking at :ref:`rdna3_cu` and :ref:`cdna3_cu`, you can see that - every CU has an instance of storage backing the namespace ``__shared__``. - Even if the host were to have access to these regions of - memory, the performance benefits of the segmented memory subsystem are + Looking at the :ref:`gcn_cu` figure, you can see that every CU has an instance of storage + backing the namespace ``__shared__``. Even if the host were to have access to these + regions of memory, the performance benefits of the segmented memory subsystem are supported by the inability of asynchronous access from the host. -.. RJH>> The prior sentence is not clear to me. The performance benefits of the shared memory on the GPU are based on the CPUs inability to access it? - * Not all C++ language features map cleanly to typical GPU device architectures. - Some C++ features, such as XXX, are very expensive (meaning slow) to implement on GPU devices, therefore they are forbidden in device contexts to avoid using features - that unexpectedly decimate the program's performance. Offload devices targeted - by HIP aren't general purpose devices, at least not in the sense that a CPU is. - HIP focuses on data parallel computations and as such caters to throughput - optimized architectures, such as GPUs or accelerators derived from GPU - architectures. - -.. RJH>> I think the above could list some example features that are too expensive for GPUs, and clarify whether it is HIP or the GPU hardware that is forbidding these features? - -* Asynchrony is at the forefront of the HIP API. Computations launched on the device + Some C++ features have poor latency when implemented on GPU devices, therefore + they are forbidden in device contexts to avoid using features that unexpectedly + decimate the program's performance. Offload devices targeted by HIP aren't general + purpose devices, at least not in the sense that a CPU is. HIP focuses on data + parallel computations and as such caters to throughput optimized architectures, + such as GPUs or accelerators derived from GPU architectures. + +* Asynchronicity is at the forefront of the HIP API. Computations launched on the device execute asynchronously with respect to the host, and it is the user's responsibility to synchronize their data dispatch/fetch with computations on the device. @@ -141,11 +123,11 @@ a few key differences between the two: Host programming ---------------- -In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and computation exchanges between the CPU (host) and GPU (device). Here is a typical sequence of operations: +In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and computation exchanges between the CPU (host) and GPU (device). The host acts as the application manager, coordinating the overall workflow and directing operations to the appropriate context, handles data preparation and data transfers, and manages GPU tasks and synchronization. Here is a typical sequence of operations: 1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU. -2. Memory Management: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, and the transfer of input data from the host to the device. Note that the data is transferred to the device, and passed as an input parameter for the kernel. -3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``), and pass parameters as needed. +2. Data preparation: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, preparing input data and transferring it from the host to the device. The data is both transferred to the device, and passed as an input parameter when launching the kernel. +3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``), and pass parameters as needed. On the GPU, kernels are run on streams, or a queue of operations. Within the same stream operations run in the order they were issued, but different streams are independent and can execute concurrently. In the HIP runtime, kernels run on the default stream when one is not specified, but specifying a stream for the kernel lets you increase concurrency in task scheduling and resource utilization, and launch and manage multiple kernels from the host program. 4. Synchronization: As described in :ref:`asynchronous_how-to`, kernel execution occurs in the context of device streams, specifically the default (`0`) stream. You can use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations. Wait for events or streams to finish execution and transfer results from the GPU back to the host. 5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations. For example, use ``hipGetErrorString`` to retrieve error messages. 6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices. @@ -157,24 +139,25 @@ This structure allows for efficient use of GPU resources and facilitates the acc Device programming ------------------ -Launching the kernel in the host application starts a kernel program running on the GPU to perform parallel computations. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. The general flow of the kernel program looks like this: +The device or kernel program acts as workers on the GPU application, distributing operations to be handled quickly and efficiently. Launching a kernel in the host application starts the kernel program running on the GPU, defining the parallel operations to repeat the same instructions across many datasets. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. Threads, blocks, and grids provide a hierarchical approach to parallel operations. Understanding the thread hierarchy is critical to distributing work across the available CUs, managing parallel operations, and optimizing memory access. The general flow of the kernel program looks like this: -1. Thread Grouping: As described in :ref:`SIMT model`, threads are organized into blocks, and blocks are organized into grids. -2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. -3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. -4. Computation: Threads perform the required computations on the input data, and generate any needed output. +1. Thread Grouping: As described in :ref:`inherent_thread_model`, threads are organized into a hierarchy consisting of threads which are individual instances of parallel operations, blocks that group the threads together, and grids that group blocks into the kernel. Each thread runs an instance of the kernel in parallel with other threads in the block. +2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. +3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. As described in :ref:`memory_hierarchy`, the hierarchy of threads is influenced by the memory subsystem of GPUs. The memory hierarchy includes local memory per-thread with very fast access, shared memory for the block of threads which also supports quick access, and larger amounts of global memory visible to the whole kernel,but accesses are expensive due to high latency. Understanding the memory model is a key concept for kernel programming. +4. Computation: Threads perform the required computations on the input data, and generate any needed output. Each thread of the kernel runs the same instruction simultaneously on the different datasets. This sometimes require multiple iterations when the number of operations exceeds the resources of the CU. 5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory. -Kernels can be simple single instruction programs deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds or thousands of threads with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. +Kernels can be simple single instruction programs deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds, thousands, or hundreds of thousands of operations with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. .. _programming_model_simt: Single instruction multiple threads (SIMT) ========================================== -The HIP kernel code, which is written as a series of scalar instructions for multiple threads with different thread indices, gets mapped to the SIMD units of the GPUs. +The HIP kernel code, which is written as a series of scalar instructions for multiple +threads with different thread indices, gets mapped to the SIMD units of the GPUs. Every single instruction, which is executed for every participating thread of a -kernel, gets mapped to the SIMD as often as there are threads. +kernel, gets mapped to the SIMD. This is done by grouping threads into warps, which contain as many threads as there are physical lanes in a SIMD, and issuing that instruction to the SIMD for every @@ -182,28 +165,44 @@ warp of a kernel. Ideally the SIMD is always fully utilized, however if the numb can't be evenly divided by the warpsize, then the unused lanes are masked out from the corresponding SIMD execution. -A kernel follows the same C++ rules as the functions on the host, but it has a special __global__ label to mark it for execution on the device, as shown in the following example: +A kernel follows the same C++ rules as the functions on the host, but it has a special ``__global__`` label to mark it for execution on the device, as shown in the following example: .. code-block:: cpp __global__ void AddKernel(float* a, const float* b) { - int global_id = threadIdx.x + blockIdx.x * blockDim.x; + int global_idx = threadIdx.x + blockIdx.x * blockDim.x; - a[global_id] += b[global_id]; + a[global_idx] += b[global_idx]; } -One of the first differences to note, is the usage of the special ``threadIdx``, ``blockIdx`` and ``blockDim`` variables. -Unlike normal C++ host functions, a kernel is not launched once, but as often as specified by the user. Each of these instances is a separate thread, with its own values for ``threadIdx``, ``blockIdx`` and ``blockDim``. -This is called SIMT, meaning that a *S*ingle *I*nstruction is executed in *M*ultiple *T*hreads. +One of the first things you might notice is the usage of the special ``threadIdx``, +``blockIdx`` and ``blockDim`` variables. Unlike normal C++ host functions, a kernel +is not launched once, but as often as specified by the user. Each of these instances +is a separate thread, with its own values for ``threadIdx``, ``blockIdx`` and ``blockDim``. -Kernels are launched using the "triple chevron" syntax, for example: +The kernel program is launched from the host application using a language extension +called the triple chevron syntax, which looks like the following: .. code-block:: cpp AddKernel<<>>(a, b); -Here the total number of threads launched for the ``AddKernel`` program is defined by ``number_of_blocks * threads_per_block``. These values are defined by the programmer to address the problem to be solved and the available resources within the system. In other words, the thread configuration is customized to the needs of the operations. +Inside the angle brackets you provide the following: + +* The number of blocks to launch, which defines the grid size (relating to blockDim). +* The number of threads in a block, which defines the block size (relating to blockIdx). +* The amount of shared memory to allocate by the host, not specified above. +* The device stream to enqueue the operation on, not specified above so the default stream is used. + +.. note:: + The kernel can also be launched through other methods, such as the ``hipLaunchKernel()`` function. + +Here the total number of threads launched for the ``AddKernel`` program is defined by +``number_of_blocks * threads_per_block``. You define these values when launching the +kernel program to address the problem to be solved with the available resources within +the system. In other words, the thread configuration is customized to the needs of the +operations and the available hardware. For comparison, the ``AddKernel`` program could be written in plain C++ as a ``FOR`` loop: @@ -213,6 +212,11 @@ For comparison, the ``AddKernel`` program could be written in plain C++ as a ``F a[i] += b[i]; } +In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT +execution, one thread down each lane of an SIMD engine. Execution parallelism +usually isn't exploited from the width of the built-in vector types, but across +multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc. + .. _simt: .. figure:: ../data/understand/programming_model/simt.svg @@ -221,22 +225,26 @@ For comparison, the ``AddKernel`` program could be written in plain C++ as a ``F inside and ellipsis between the arrows. The instructions represented in the arrows are, from top to bottom: ADD, DIV, FMA, FMA, FMA and FMA. - Instruction flow of the sample SIMT program. - -In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT -execution, one thread down each lane of an SIMD engine. Execution parallelism -usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc. + Instruction flow of a sample SIMT program. .. _inherent_thread_model: -Inherent thread model +Hierarchical thread model --------------------- -All threads of a kernel are uniquely identified by a set of integral values, called thread IDs. -The set of integers identifying a thread relate to the hierarchy in which the threads execute. +As previously discussed, all threads of a kernel are uniquely identified by a set +of integral values called thread IDs. The hierarchy consists of three levels: thread, +blocks, and grids. + +* Threads are single instances of kernel operations, running concurrently across warps +* Blocks group threads together and enable cooperation and shared memory +* Grids define the number of thread blocks for a single kernel launch +* Blocks, and grids can be defined in 3 dimensions (``x``, ``y``, ``z``) +* By default, the Y and Z dimensions are set to 1 -The thread hierarchy is integral to how AMD GPUs operate, and is depicted in the -following figure. +The combined values represent the thread index, and relate to the sequence that the +threads execute. The thread hierarchy is integral to how AMD GPUs operate, and is +depicted in the following figure. .. figure:: ../data/understand/programming_model/thread_hierarchy.svg :alt: Diagram depicting nested rectangles of varying color. The outermost one @@ -249,10 +257,11 @@ following figure. .. _wavefront: -Wavefront (or Warp) - The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A wavefront - is the most tightly coupled groups of threads, both physically and logically. Threads - inside a wavefront are also called lanes, and the integral value identifying them is the lane ID. +Warp (or Wavefront) + The innermost grouping of threads is called a warp. A warp is the most tightly + coupled groups of threads, both physically and logically. Threads inside a warp + are executed in lockstep, with each thread executing the same instruction. Threads + in a warp are also called lanes, and the value identifying them is the lane ID. .. tip:: @@ -260,8 +269,8 @@ Wavefront (or Warp) consequence, they are only as multidimensional as the user interprets the calculated values to be. - The size of a wavefront is architecture dependent and always fixed. For AMD GPUs - the wavefront is typically 64 threads, though sometimes 32 threads. Wavefronts are + The size of a warp is architecture dependent and always fixed. For AMD GPUs + the warp is typically 64 threads, though sometimes 32 threads. Warps are signified by the set of communication primitives at their disposal, as discussed in :ref:`warp-cross-lane`. @@ -269,30 +278,45 @@ Wavefront (or Warp) Block The next level of the thread hierarchy is called a thread block, or block. The - defining feature of a block is that all threads in a block will share an instance - of memory which they may use to share data or synchronize with one another, - as described in :ref:`memory_hierarchy`. - - The size of a block is user-configurable but is limited by the queryable - capabilities of the executing hardware. The unique ID of the thread within a - block can be 1, 2, or 3-dimensional as provided by the HIP API. You can configure the thread block to best represent the data associated with the instruction set. When linearizing thread IDs within a block, assume the "fast index" being dimension ``x``, followed by - the ``y`` and ``z`` dimensions. + defining feature of a block is that all threads in the block have shared memory + that they can use to share data or synchronize with one another, as described in + :ref:`memory_hierarchy`. + + The size of a block, or the block dimension, is the user-configurable number of + threads per block, but is limited by the queryable capabilities of the executing + hardware. The unique ID of the thread within a block can be 1, 2, or 3-dimensional + as provided by the HIP API. You can configure the thread block to best represent + the data associated with the kernel instruction set. + + .. note:: + When linearizing thread IDs within a block, assume the *fast index* is the ``x`` + dimension, followed by the ``y`` and ``z`` dimensions. .. _inherent_thread_hierarchy_grid: Grid - The top-most level of the thread hierarchy is a grid. A grid is the collection of blocks, which are collections of threads, defined for the kernel. A grid manifests as a single launch of the kernel to run. The unique ID of each block within a grid can be 1, 2, or 3-dimensional, as provided by the API and is queryable by every thread within the block. + The top-most level of the thread hierarchy is a grid. A grid is the number of blocks + needed for a single launch of the kernel. The unique ID of each block within + a grid can be 1, 2, or 3-dimensional, as provided by the API and is queryable + by every thread within the block. + +The three-dimensional thread hierarchy available to a kernel program lends itself to solutions +that align closely to the computational problem. The following are some examples: + +* 1 dimensional: array processing, linear data structures, or sequential data transformation +* 2 dimensional: Image processing, matrix operations, 2 dimensional simulations +* 3 dimensions: Volume rendering, 3D scientific simulations, spatial algorithms Cooperative groups thread model ------------------------------- -The Cooperative groups API introduces new APIs to launch, group, subdivide, +The Cooperative groups API introduces new functions to launch, group, subdivide, synchronize and identify threads, as well as some predefined group-collective -algorithms, but most importantly a matching threading model to think in terms of. -It relaxes some restrictions of the :ref:`inherent_thread_model` imposed by the -strict 1:1 mapping of architectural details to the programming model. Cooperative -groups let you define your own set of thread groups which may fit your user-cases -better than the defaults defined by the hardware. +algorithms. Most importantly it offers a matching thread model to think of the +cooperative groups in terms of. It relaxes some restrictions of the :ref:`inherent_thread_model` +imposed by the strict 1:1 mapping of architectural details to the programming model. +Cooperative groups let you define your own set of thread groups which may better +fit your use-case than the defaults defined by the hardware. .. note:: The implicit groups defined by kernel launch parameters are still available @@ -305,9 +329,9 @@ For further information, see :doc:`Cooperative groups `, + an important concept in resource usage and performance optimization. Shared memory Read-write storage visible to all the threads in a given block. @@ -346,10 +371,60 @@ Global Surface A read-write version of texture memory. +Using different memory types +---------------------------- + +* Use global memory when: + + - You are transferring data from the host to the device + - You have large data sets, and latency isn't an issue + - You are sharing data between thread blocks + +* Use shared memory when: + + - The data is reused within a thread block + - Cross-thread communication is needed + - To reduce global memory bandwidth + +* Use local memory when: + + - The data is specific to a thread + - To store automatic variables for the thread + - To provide register pressure relief for the thread + +* Use constant memory when: + + - The data is read-only + - The same value is used across threads + - The data size is small + +Memory access patterns and best practices +----------------------------------------- + +While you should refer to the :ref:`memory_management`, the following are a few memory +access patterns and best practices: + +* Global memory: Coalescing reduces memory transactions. +* Shared memory: Avoiding bank conflicts is crucial. +* Texture memory: Spatial locality improves caching. +* Unified memory: Structured access minimizes page migration overhead. + +When a kernel accesses global memory, the memory transactions typically occur in chunks of 32, 64, or 128 bytes. If threads access memory in a coalesced manner, meaning consecutive threads read or write consecutive memory locations, the memory controller can merge these accesses into a single transaction. Coalesced access primarily applies to global memory, which is the largest but slowest type of memory on a GPU and coalesced access significantly improves performance by reducing memory latency and increasing bandwidth efficiency. + +To achieve coalesced memory access in HIP, ensure that memory addresses accessed by consecutive threads are aligned. Structure data for coalesced access by storing it in a contiguous manner so that thread[i] can access array[i], and not some random location. Avoid strided access patterns, for example array[i * stride] can lead to memory bank conflicts and inefficient access. If all the threads in a warp can access consecutive memory locations, memory access is fully coalesced. + +Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. + +Texture memory is read-only memory optimized for spatial locality and caching rather than coalescing. Texture memory is cached, unlike standard global memory, and it provides optimized access patterns for 2D and spatially local data. Accessing neighboring values results in cache hits, improving performance. Therefore, instead of worrying about coalescing, optimal memory access patterns involve ensuring that threads access spatially adjacent texture elements, and the memory layout aligns well with the 2D caching mechanism. + +Unified memory allows the CPU and GPU to share memory seamlessly, but performance depends on access patterns. Unified memory enables automatic page migration between CPU and GPU memory. However, if different threads access different pages, it can lead to expensive page migrations and slow throughput performance. Accessing unified memory in a structured, warp-friendly manner reduces unnecessary page transfers. Ensure threads access memory in a structured, consecutive manner, minimizing page faults. Prefetch data to the GPU before computation by using ``hipMemPrefetchAsync()``. In addition, using small batch transfers as described below, can reduce unexpected page migrations when using unified memory. + +Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of a dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. + Execution model =============== -HIP programs consist of two distinct scopes: +As previously discussed in :ref:`heterogeneous_programming`, HIP programs consist of two distinct scopes: * The host-side API running on the host processor. There are two APIs available: From 058193715486f247f756ce3f12de2a56022024b5 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:19:08 -0800 Subject: [PATCH 05/52] Update wordlist --- .wordlist.txt | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.wordlist.txt b/.wordlist.txt index b3b8686678..b0434da0db 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -7,6 +7,7 @@ APUs AQL AXPY asm +Asynchronicity Asynchrony backtrace Bitcode @@ -67,6 +68,7 @@ hipModule hipModuleLaunchKernel hipother HIPRTC +hyperthreading icc IILE iGPU @@ -110,6 +112,7 @@ NDRange nonnegative NOP Numa +ns Nsight ocp omnitrace @@ -118,6 +121,7 @@ overindexing oversubscription overutilized parallelizable +pipelining pixelated pragmas preallocated @@ -167,5 +171,6 @@ unregister upscaled variadic vulkan +warpsize WinGDB zc From 7233ab6d1d38f9d81de44ba8f2deb989a057f5c0 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:21:31 -0800 Subject: [PATCH 06/52] warpSize --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 7c8f7b5d2c..64a92df470 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -162,7 +162,7 @@ kernel, gets mapped to the SIMD. This is done by grouping threads into warps, which contain as many threads as there are physical lanes in a SIMD, and issuing that instruction to the SIMD for every warp of a kernel. Ideally the SIMD is always fully utilized, however if the number of threads -can't be evenly divided by the warpsize, then the unused lanes are masked out +can't be evenly divided by the warpSize, then the unused lanes are masked out from the corresponding SIMD execution. A kernel follows the same C++ rules as the functions on the host, but it has a special ``__global__`` label to mark it for execution on the device, as shown in the following example: From 2d5663ab4c312bb016df116222926d58a70a43b7 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:23:57 -0800 Subject: [PATCH 07/52] wordlist --- .wordlist.txt | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 4afa5503ba..93742b8bb3 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -13,14 +13,11 @@ backtrace Bitcode bitcode bitcodes -blockDim -blockIdx builtins Builtins CAS clr compilable -constexpr coroutines Ctx cuBLASLt @@ -55,7 +52,6 @@ FNUZ fp gedit GPGPU -gridDim GROMACS GWS hardcoded @@ -93,7 +89,6 @@ iteratively Lapack latencies libc -libhipcxx libstdc lifecycle linearizing @@ -104,7 +99,6 @@ makefile Malloc malloc MALU -maxregcount MiB memset multicore @@ -127,7 +121,6 @@ overindexing oversubscription overutilized parallelizable -parallelized pipelining pixelated pragmas @@ -136,7 +129,6 @@ preconditioners predefining prefetched preprocessor -printf profilers PTX PyHIP @@ -161,12 +153,10 @@ sinewave SOMA SPMV structs -struct's SYCL syntaxes texel texels -threadIdx tradeoffs templated toolkits From de23034db0489a0972286f3c9f453251c7afbcd1 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:25:45 -0800 Subject: [PATCH 08/52] wordlist again --- .wordlist.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.wordlist.txt b/.wordlist.txt index 93742b8bb3..513c63fd57 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -13,6 +13,8 @@ backtrace Bitcode bitcode bitcodes +blockDim +blockIdx builtins Builtins CAS From 6f2e2dcb8205e0bd33d123bdbaaedc23db1bae03 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:28:01 -0800 Subject: [PATCH 09/52] wordlist redux --- .wordlist.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.wordlist.txt b/.wordlist.txt index 513c63fd57..09673755bd 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -54,6 +54,7 @@ FNUZ fp gedit GPGPU +gridDim GROMACS GWS hardcoded @@ -101,6 +102,7 @@ makefile Malloc malloc MALU +maxregcount MiB memset multicore @@ -159,6 +161,7 @@ SYCL syntaxes texel texels +threadIdx tradeoffs templated toolkits From 8dfc87c7d694f0897e5f49a6f3dae92306540f08 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sat, 8 Feb 2025 22:30:03 -0800 Subject: [PATCH 10/52] wordlist etc. --- .wordlist.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.wordlist.txt b/.wordlist.txt index 09673755bd..1b1ba81a4e 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -20,6 +20,7 @@ Builtins CAS clr compilable +constexpr coroutines Ctx cuBLASLt @@ -133,6 +134,7 @@ preconditioners predefining prefetched preprocessor +printf profilers PTX PyHIP From a72c5fdb95bb7fdf31e56fe5e3b16e89b214d653 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sun, 9 Feb 2025 08:16:05 -0800 Subject: [PATCH 11/52] complete memory_model update --- docs/understand/programming_model.rst | 132 +++++++++----------------- 1 file changed, 45 insertions(+), 87 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 64a92df470..434b3173a3 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -23,8 +23,8 @@ make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general. The following topics introduce you to the key concepts of GPU-based programming, and the HIP programming model. -Getting into Hardware: CPU vs GPU -================================= +Hardware differences: CPU vs GPU +================================ CPUs and GPUs have been designed for different purposes. CPUs have been designed to quickly execute a single thread, decreasing the time it takes for a single @@ -352,17 +352,27 @@ Local or per-thread memory CU as described in :doc:`Compute Units <./hardware_implementation>`, an important concept in resource usage and performance optimization. + Use local memory when the data is specific to a thread, to store variables generated + by the thread, or to provide register pressure relief for the thread. + Shared memory - Read-write storage visible to all the threads in a given block. + Read-write storage visible to all the threads in a given block. Use shared memory + when the data is reused within a thread block, when cross-thread communication + is needed, or to minimize global memory transactions by using device memory + whenever possible. Global Read-write storage visible to all threads in a given grid. There are specialized versions of global memory with different usage semantics which - are typically backed by the same hardware storing global. + are typically backed by the same hardware storing global. + + Use global memory when you have large datasets, are transferring memory between + the host and the device, and when you are sharing data between thread blocks. Constant Read-only storage visible to all threads in a given grid. It is a limited - segment of global with queryable size. + segment of global with queryable size. Use constant memory for read-only data + that is shared across multiple threads, and that has a small data size. Texture Read-only storage visible to all threads in a given grid and accessible @@ -371,92 +381,57 @@ Global Surface A read-write version of texture memory. -Using different memory types ----------------------------- - -* Use global memory when: - - - You are transferring data from the host to the device - - You have large data sets, and latency isn't an issue - - You are sharing data between thread blocks +Memory optimizations and best practices +--------------------------------------- -* Use shared memory when: +The following are a few memory access patterns and best practices to improve performance. See :ref:`memory_management` for additional details. - - The data is reused within a thread block - - Cross-thread communication is needed - - To reduce global memory bandwidth +* **Global memory**: Coalescing reduces the number of memory transactions. -* Use local memory when: + Coalesced memory access in HIP refers to the optimization of memory transactions to maximize throughput when accessing global memory. When a kernel accesses global memory, the memory transactions typically occur in chunks of 32, 64, or 128 bytes, which must be naturally aligned. Coalescing memory accesses means aligning and organizing these accesses so that multiple threads in a warp can combine their memory requests into the fewest possible transactions. If threads access memory in a coalesced manner, meaning consecutive threads read or write consecutive memory locations, the memory controller can merge these accesses into a single transaction. This is crucial because global memory bandwidth is relatively low compared to on-chip bandwidths, and non-optimal memory accesses can significantly impact performance. If all the threads in a warp can access consecutive memory locations, memory access is fully coalesced. - - The data is specific to a thread - - To store automatic variables for the thread - - To provide register pressure relief for the thread + To achieve coalesced memory access in HIP, you should: -* Use constant memory when: + 1. *Align Data*: Use data types that are naturally aligned and ensure that structures and arrays are aligned properly. + 2. *Optimize Access Patterns*: Arrange memory accesses so that consecutive threads in a warp access consecutive memory locations. For example, if threads access a 2D array, the array and thread block widths should be multiples of the warp size. + 3. *Avoid strided access*: For example array[i * stride] can lead to memory bank conflicts and inefficient access. + 4. *Pad Data*: If necessary, pad data structures to ensure alignment and coalescing. - - The data is read-only - - The same value is used across threads - - The data size is small +* **Shared memory**: Avoiding bank conflicts reduces serialization of memory transactions. -Memory access patterns and best practices ------------------------------------------ + Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. -While you should refer to the :ref:`memory_management`, the following are a few memory -access patterns and best practices: +* **Texture memory**: Spatial locality improves caching performance. -* Global memory: Coalescing reduces memory transactions. -* Shared memory: Avoiding bank conflicts is crucial. -* Texture memory: Spatial locality improves caching. -* Unified memory: Structured access minimizes page migration overhead. + Texture memory is read-only memory optimized for spatial locality and caching rather than coalescing. Texture memory is cached, unlike standard global memory, and it provides optimized access patterns for 2D and spatially local data. Accessing neighboring values results in cache hits, improving performance. Therefore, instead of worrying about coalescing, optimal memory access patterns involve ensuring that threads access spatially adjacent texture elements, and the memory layout aligns well with the 2D caching mechanism. -When a kernel accesses global memory, the memory transactions typically occur in chunks of 32, 64, or 128 bytes. If threads access memory in a coalesced manner, meaning consecutive threads read or write consecutive memory locations, the memory controller can merge these accesses into a single transaction. Coalesced access primarily applies to global memory, which is the largest but slowest type of memory on a GPU and coalesced access significantly improves performance by reducing memory latency and increasing bandwidth efficiency. +* **Unified memory**: Structured access reduces the overhead of page migrations. -To achieve coalesced memory access in HIP, ensure that memory addresses accessed by consecutive threads are aligned. Structure data for coalesced access by storing it in a contiguous manner so that thread[i] can access array[i], and not some random location. Avoid strided access patterns, for example array[i * stride] can lead to memory bank conflicts and inefficient access. If all the threads in a warp can access consecutive memory locations, memory access is fully coalesced. + Unified memory allows the CPU and GPU to share memory seamlessly, but performance depends on access patterns. Unified memory enables automatic page migration between CPU and GPU memory. However, if different threads access different pages, it can lead to expensive page migrations and slow throughput performance. Accessing unified memory in a structured, warp-friendly manner reduces unnecessary page transfers. Ensure threads access memory in a structured, consecutive manner, minimizing page faults. Prefetch data to the GPU before computation by using ``hipMemPrefetchAsync()``. In addition, using small batch transfers as described below, can reduce unexpected page migrations when using unified memory. -Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. +* **Small batch transfers**: Enable pipelining and improve PCIe bandwidth use. -Texture memory is read-only memory optimized for spatial locality and caching rather than coalescing. Texture memory is cached, unlike standard global memory, and it provides optimized access patterns for 2D and spatially local data. Accessing neighboring values results in cache hits, improving performance. Therefore, instead of worrying about coalescing, optimal memory access patterns involve ensuring that threads access spatially adjacent texture elements, and the memory layout aligns well with the 2D caching mechanism. - -Unified memory allows the CPU and GPU to share memory seamlessly, but performance depends on access patterns. Unified memory enables automatic page migration between CPU and GPU memory. However, if different threads access different pages, it can lead to expensive page migrations and slow throughput performance. Accessing unified memory in a structured, warp-friendly manner reduces unnecessary page transfers. Ensure threads access memory in a structured, consecutive manner, minimizing page faults. Prefetch data to the GPU before computation by using ``hipMemPrefetchAsync()``. In addition, using small batch transfers as described below, can reduce unexpected page migrations when using unified memory. - -Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of a dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. + Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of a dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. Execution model =============== As previously discussed in :ref:`heterogeneous_programming`, HIP programs consist of two distinct scopes: -* The host-side API running on the host processor. There are two APIs available: - - * The HIP runtime API which enables use of the single-source programming - model. - - * The HIP driver API which sits at a lower level and most importantly differs - by removing some facilities provided by the runtime API, most - importantly around kernel launching and argument setting. It is geared - towards implementing abstractions atop, such as the runtime API itself. - Offers two additional pieces of functionality not provided by the Runtime - API: ``hipModule`` and ``hipCtx`` APIs. For further details, check - :doc:`HIP driver API `. +* The host-side API running on the host processor. +* The device-side kernels running on GPUs. -* The device-side kernels running on GPUs. Both the host and the device-side - APIs have synchronous and asynchronous functions in them. - -.. note:: - - The HIP does not present two *separate* APIs link NVIDIA CUDA. HIP only extends - the HIP runtime API with new APIs for ``hipModule`` and ``hipCtx``. +Both the host and the device-side APIs have synchronous and asynchronous functions. Host-side execution ------------------- -The part of the host-side API which deals with device management and their -queries are synchronous. All asynchronous APIs, such as kernel execution, data -movement and potentially data allocation/freeing all happen in the context of -device streams. +The host-side API dealing with device management and their queries are synchronous. +All asynchronous APIs, such as kernel execution, data movement and potentially data +allocation/freeing all happen in the context of device streams. Streams are FIFO buffers of commands to execute relating to a given device. -Commands which enqueue tasks on a stream all return promptly and the command is +Operations which enqueue tasks on a stream all return promptly and the command is executed asynchronously. All side effects of a command on a stream are visible to all subsequent commands on the same stream. Multiple streams may point to the same device and those streams may be fed from multiple concurrent host-side @@ -465,10 +440,10 @@ be. Asynchronous APIs involving a stream all return a stream event which may be used to synchronize the execution of multiple streams. A user may enqueue a -barrier onto a stream referencing an event. The barrier will block until -the command related to the event does not complete, at which point all -side effects of the command shall be visible to commands following the barrier, -even if those side effects manifest on different devices. +barrier onto a stream referencing an event. The barrier will block activity on the +stream until the operation related to the event completes. After the event completes, all +side effects of the operation will be visible to subsequent commands even if those +side effects manifest on different devices. Streams also support executing user-defined functions as callbacks on the host. The stream will not launch subsequent commands until the callback completes. @@ -476,16 +451,7 @@ The stream will not launch subsequent commands until the callback completes. Device-side execution --------------------- -The SIMT programming model behind the HIP device-side execution is a -middle-ground between SMT (Simultaneous Multi-Threading) programming known from -multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming -mostly known from exploiting relevant instruction sets on CPUs (for example -SSE/AVX/Neon). - -Kernel launch -------------- - -Kernels may be launched in multiple ways all with different syntaxes and +Kernels may be launched in multiple ways, all with different syntaxes and intended use-cases. * Using the triple-chevron ``<<<...>>>`` operator on a ``__global__`` annotated @@ -501,11 +467,3 @@ intended use-cases. ``HIP_TEMPLATE_KERNEL_LAUNCH`` preprocessor macro before including the HIP headers to turn it into a templated function. -* Using the launch APIs supporting the triple-chevron syntax directly. - - .. caution:: - - These APIs are intended to be used/generated by tools such as the HIP - compiler itself and not intended towards end-user code. Should you be - writing a tool having to launch device code using HIP, consider using these - over the alternatives. From d565c12dcb85d31417268596135b677ccfa359f5 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Sun, 9 Feb 2025 13:21:53 -0800 Subject: [PATCH 12/52] started Execution model --- docs/understand/programming_model.rst | 34 +++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 434b3173a3..65eab248cf 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -467,3 +467,37 @@ intended use-cases. ``HIP_TEMPLATE_KERNEL_LAUNCH`` preprocessor macro before including the HIP headers to turn it into a templated function. +Asynchronous execution +---------------------- + +Asynchronous operations between the host and the kernel provide a variety of opportunities, or challenges, for managing synchronization. For instance, a basic model would be to launch an asynchronous operation on a kernel in a stream, create an event to track the operation, continue operations in the host program, and when the asynchronous operation completes synchronize the kernel to return the results. This basic example might look something like the following: + +.. code_block:: cpp + + // Create a HIP stream + hipStream_t stream; + hipStreamCreate(&stream); + + // Launch the kernel asynchronously + myKernel<<>>(d_data); + + // Perform continued host processing here + // This could be any CPU-bound work that doesn't depend on the kernel's result + doHostProcessing(); + + // Synchronize the stream to ensure kernel execution is complete + hipStreamSynchronize(stream); + + // Any host processing that depends on the kernel's result should occur after synchronization + processKernelResults(); + + // Copy the result back to the host + hipMemcpy(...); + +However, one of the opportunities of asynchronous operation is the pipelining of operations between launching kernels and transferring memory. In this case you would be working with multiple streams running concurrently, or at least overlapping in some regard, and managing any dependencies between the streams in the host application. + +There is also the producer-consumer paradigm that can be used to convert a sequential program into parallel operations to improve performance. This process can employ multiple streams to kick off asynchronous kernels, provide data to the kernels, perform operations, and return the results for further processing in the host application. + +These asynchronous activities call for stream management strategies. In the case of the single stream, the only management would be the synchronization of the stream when the work was complete. However, with multiple streams, you have overlapping execution of operations, and synchronization becomes more complex. You need to manage the activities of each stream, evaluating the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. + +All of this could probably use some examples. Let us know where you find them. \ No newline at end of file From 5fcab493e4199fd12386a684c50039cc31bd5cbe Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 11 Feb 2025 13:11:06 -0800 Subject: [PATCH 13/52] remove Introduxtory content --- docs/programming_guide.rst | 83 -------------------------------------- 1 file changed, 83 deletions(-) delete mode 100644 docs/programming_guide.rst diff --git a/docs/programming_guide.rst b/docs/programming_guide.rst deleted file mode 100644 index 7444408866..0000000000 --- a/docs/programming_guide.rst +++ /dev/null @@ -1,83 +0,0 @@ -.. meta:: - :description: HIP programming guide introduction - :keywords: HIP programming guide introduction, HIP programming guide - -.. _hip-programming-guide: - -******************************************************************************** -HIP programming guide introduction -******************************************************************************** - -This topic provides key HIP programming concepts and links to more detailed -information. - -Write GPU Kernels for Parallel Execution -================================================================================ - -To make the most of the parallelism inherent to GPUs, a thorough understanding -of the :ref:`programming model ` is helpful. The HIP -programming model is designed to make it easy to map data-parallel algorithms to -architecture of the GPUs. HIP employs the SIMT-model (Single Instruction -Multiple Threads) with a multi-layered thread hierarchy for efficient execution. - -Understand the Target Architecture (CPU and GPU) -================================================================================ - -The :ref:`hardware implementation ` topic outlines the -GPUs supported by HIP. In general, GPUs are made up of Compute Units that excel -at executing parallelizable, computationally intensive workloads without complex -control-flow. - -Increase parallelism on multiple level -================================================================================ - -To maximize performance and keep all system components fully utilized, the -application should expose and efficiently manage as much parallelism as possible. -:ref:`Parallel execution ` can be achieved at the -application, device, and multiprocessor levels. - -The application’s host and device operations can achieve parallel execution -through asynchronous calls, streams, or HIP graphs. On the device level, -multiple kernels can execute concurrently when resources are available, and at -the multiprocessor level, developers can overlap data transfers with -computations to further optimize performance. - -Memory management -================================================================================ - -GPUs generally have their own distinct memory, also called :ref:`device -memory `, separate from the :ref:`host memory `. -Device memory needs to be managed separately from the host memory. This includes -allocating the memory and transfering it between the host and the device. These -operations can be performance critical, so it's important to know how to use -them effectively. For more information, see :ref:`Memory management `. - -Synchronize CPU and GPU Workloads -================================================================================ - -Tasks on the host and devices run asynchronously, so proper synchronization is -needed when dependencies between those tasks exist. The asynchronous execution -of tasks is useful for fully utilizing the available resources. Even when only a -single device is available, memory transfers and the execution of tasks can be -overlapped with asynchronous execution. - -Error Handling -================================================================================ - -All functions in the HIP runtime API return an error value of type -:cpp:enum:`hipError_t` that can be used to verify whether the function was -successfully executed. It's important to confirm these returned values, in order -to catch and handle those errors, if possible. An exception is kernel launches, -which don't return any value. These errors can be caught with specific functions -like :cpp:func:`hipGetLastError()`. - -For more information, see :ref:`error_handling` . - -Multi-GPU and Load Balancing -================================================================================ - -Large-scale applications that need more compute power can use multiple GPUs in -the system. This requires distributing workloads across multiple GPUs to balance -the load to prevent GPUs from being overutilized while others are idle. - -For more information, see :ref:`multi-device` . \ No newline at end of file From e597a95171520be2386804fb91afd04b63a5b776 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 11 Feb 2025 14:04:20 -0800 Subject: [PATCH 14/52] Add references from Introduction --- docs/understand/programming_model.rst | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 65eab248cf..81e5c36cc9 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -384,7 +384,7 @@ Global Memory optimizations and best practices --------------------------------------- -The following are a few memory access patterns and best practices to improve performance. See :ref:`memory_management` for additional details. +The following are a few memory access patterns and best practices to improve performance. You can find additional information in :ref:`memory_management` and :doc:`../how-to/performance_guidelines`. * **Global memory**: Coalescing reduces the number of memory transactions. @@ -500,4 +500,13 @@ There is also the producer-consumer paradigm that can be used to convert a seque These asynchronous activities call for stream management strategies. In the case of the single stream, the only management would be the synchronization of the stream when the work was complete. However, with multiple streams, you have overlapping execution of operations, and synchronization becomes more complex. You need to manage the activities of each stream, evaluating the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. -All of this could probably use some examples. Let us know where you find them. \ No newline at end of file +All of this could probably use some examples. Let us know where you find them. + +Multi-GPU and Load Balancing +---------------------------- + +Large-scale applications that need more compute power can use multiple GPUs in +the system. This requires distributing workloads across multiple GPUs to balance +the load to prevent GPUs from being overutilized while others are idle. + +For more information, see :ref:`multi-device` . \ No newline at end of file From b136d515554b08c5a8c6f18a5d632d0e195ab693 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Tue, 11 Feb 2025 14:11:20 -0800 Subject: [PATCH 15/52] edit index and TOC --- docs/index.md | 1 - docs/sphinx/_toc.yml.in | 2 -- 2 files changed, 3 deletions(-) diff --git a/docs/index.md b/docs/index.md index 7678aaae79..d36f683dcb 100644 --- a/docs/index.md +++ b/docs/index.md @@ -22,7 +22,6 @@ The HIP documentation is organized into the following categories: :::{grid-item-card} Programming guide -* [Introduction](./programming_guide) * {doc}`./understand/programming_model` * {doc}`./understand/hardware_implementation` * {doc}`./understand/compilers` diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index ed0d7f914d..dcb19c2825 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -24,8 +24,6 @@ subtrees: - caption: Programming guide entries: - - file: programming_guide - title: Introduction - file: understand/programming_model - file: understand/hardware_implementation - file: understand/compilers From a378368d5074f6d6974b2b3c379990948b4cbdaf Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 12 Feb 2025 10:44:36 -0800 Subject: [PATCH 16/52] Add code examples and images --- .../cpu-gpu-comparison-adjusted.svg | 70 +++++ .../programming_model/host-device-flow.svg | 39 +++ .../programming_model/memory-access.svg | 69 +++++ .../simt-execution-revised.svg | 62 +++++ .../programming_model/stream-workflow.svg | 45 ++++ docs/understand/programming_model.rst | 252 +++++++++++++++++- 6 files changed, 525 insertions(+), 12 deletions(-) create mode 100644 docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg create mode 100644 docs/data/understand/programming_model/host-device-flow.svg create mode 100644 docs/data/understand/programming_model/memory-access.svg create mode 100644 docs/data/understand/programming_model/simt-execution-revised.svg create mode 100644 docs/data/understand/programming_model/stream-workflow.svg diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg b/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg new file mode 100644 index 0000000000..354b0107d0 --- /dev/null +++ b/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg @@ -0,0 +1,70 @@ + + + + + + CPU vs GPU Architecture + + + + + CPU + + + + + + + + + + + + Large Complex Cores + High Clock Speed (3-5 GHz) + + + + + + GPU + + + + + + + + + + + + + + + + + + + + + + + + Many Simple Cores + Lower Clock Speed (1-2 GHz) + + + + + + + Large Cache per Core + + + + + + Shared Memory across Cores + + diff --git a/docs/data/understand/programming_model/host-device-flow.svg b/docs/data/understand/programming_model/host-device-flow.svg new file mode 100644 index 0000000000..0a965a978d --- /dev/null +++ b/docs/data/understand/programming_model/host-device-flow.svg @@ -0,0 +1,39 @@ + + + + + + Host-Device Data Flow + + + + Host (CPU) + + + + Device (GPU) + + + + + 1. Initialize + + + + 2. Transfer Data + + + + 3. Execute Kernel + + + + 4. Return Results + + + + + + + + diff --git a/docs/data/understand/programming_model/memory-access.svg b/docs/data/understand/programming_model/memory-access.svg new file mode 100644 index 0000000000..09a620587b --- /dev/null +++ b/docs/data/understand/programming_model/memory-access.svg @@ -0,0 +1,69 @@ + + + + + + Memory Access Patterns + + + Uncoalesced Access + + + + + Threads in Warp + + + + + + + + + Memory + + + + + + + + + + + + + Coalesced Access + + + + + Threads in Warp + + + + + + + + + Memory + + + + + + + + + + + + + + + + + + + diff --git a/docs/data/understand/programming_model/simt-execution-revised.svg b/docs/data/understand/programming_model/simt-execution-revised.svg new file mode 100644 index 0000000000..806d565425 --- /dev/null +++ b/docs/data/understand/programming_model/simt-execution-revised.svg @@ -0,0 +1,62 @@ + + + + + + SIMT Execution Model + + + + a[i] = b[i] + c[i] + + + + + + + + + + + Thread 0 + + b[0] = 5 + c[0] = 3 + a[0] = 8 + + + + + + Thread 1 + + b[1] = 2 + c[1] = 4 + a[1] = 6 + + + + + Thread 2 + + b[2] = 7 + c[2] = 1 + a[2] = 8 + + + + + Thread 3 + + b[3] = 3 + c[3] = 5 + a[3] = 8 + + + + + + + + + diff --git a/docs/data/understand/programming_model/stream-workflow.svg b/docs/data/understand/programming_model/stream-workflow.svg new file mode 100644 index 0000000000..96da9cedcd --- /dev/null +++ b/docs/data/understand/programming_model/stream-workflow.svg @@ -0,0 +1,45 @@ + + + + + + Stream and Event Workflow + + + + + + + + Stream 1 + Stream 2 + Stream 3 + + + + + + + + + + + + + + + + + + + + + + + + + + Operation + + Event + diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 81e5c36cc9..b3b598f587 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -32,6 +32,15 @@ operation, increasing the amount of sequential instructions that can be executed This includes fetching data, and reducing pipeline stalls where the ALU has to wait for previous instructions to finish. +.. figure:: ../data/understand/programming_model/cpu-gpu-comparison-adjusted.svg + :alt: Diagram depicting the differences between CPU and GPU hardware. + The CPU block shows four large processing cores, lists Large Cache per + Core, and High Clock Speed of 3 to 5 gigahertz. The GPU block shows 42 + smaller processing cores, lists Shared Memory across Cores, and Lower + Clock SPeeds of 1 to 2 gigahertz. + + Differences in CPUs and GPUs + On CPUs the goal is to quickly process operations. CPUs provide low latency processing for serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, in parallel, achieving higher throughput. Latency is the delay from when an operation @@ -134,6 +143,14 @@ In heterogeneous programming, the CPU is available for processing operations but This structure allows for efficient use of GPU resources and facilitates the acceleration of compute-intensive tasks while keeping the host CPU available for other tasks. +.. figure:: ../data/understand/programming_model/host-device-flow.svg + :alt: Diagram depicting a host CPU and device GPU rectangles of varying color. + There are arrows pointing between the rectangles showing from the Host + to the Device the initialization, data transfer, and Kernel execution + steps, and from the Device back to the Host the returning results. + + Interaction of Host and Device in a GPU application + .. _device_program: Device programming @@ -165,6 +182,16 @@ warp of a kernel. Ideally the SIMD is always fully utilized, however if the numb can't be evenly divided by the warpSize, then the unused lanes are masked out from the corresponding SIMD execution. +.. _simt: + +.. figure:: ../data/understand/programming_model/simt-execution-revised.svg + :alt: Diagram depicting the SIMT execution model. There is a red rectangle + which contains the expression a[i] = b[i] + c[i], and below that four + arrows that point to Thread 0,1,2, and 3. Each thread contains different + values for b, c, and a, showing the parallel operations of this equation. + + Instruction flow of a sample SIMT program + A kernel follows the same C++ rules as the functions on the host, but it has a special ``__global__`` label to mark it for execution on the device, as shown in the following example: .. code-block:: cpp @@ -217,16 +244,6 @@ execution, one thread down each lane of an SIMD engine. Execution parallelism usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants ``threadIdx.x``, ``blockIdx.x``, etc. -.. _simt: - -.. figure:: ../data/understand/programming_model/simt.svg - :alt: Image representing the instruction flow of a SIMT program. Two identical - arrows pointing downward with blocks representing the instructions - inside and ellipsis between the arrows. The instructions represented in - the arrows are, from top to bottom: ADD, DIV, FMA, FMA, FMA and FMA. - - Instruction flow of a sample SIMT program. - .. _inherent_thread_model: Hierarchical thread model @@ -384,6 +401,14 @@ Global Memory optimizations and best practices --------------------------------------- +.. figure:: ../data/understand/programming_model/memory-access.svg + :alt: Diagram depicting an example memory access pattern for coalesced memory. + The diagram has uncoalesced access on the left side, with consecutive + threads accessing memory in a random pattern. With coalesced access on the + right showing consecutive threads accessing consecutive memory addresses. + + Coalesced memory accesses + The following are a few memory access patterns and best practices to improve performance. You can find additional information in :ref:`memory_management` and :doc:`../how-to/performance_guidelines`. * **Global memory**: Coalescing reduces the number of memory transactions. @@ -445,6 +470,15 @@ stream until the operation related to the event completes. After the event compl side effects of the operation will be visible to subsequent commands even if those side effects manifest on different devices. +.. figure:: ../data/understand/programming_model/stream-workflow.svg + :alt: Diagram depicting the stream and event workflow, with an example of + multiple streams working together. The diagram shows operations as red + rectangles, and events as white dots. There are three streams labelled + Stream 1, 2, and 3. The streams each have multiple operations and events + that require synchronization between the streams. + + Multiple stream workflow + Streams also support executing user-defined functions as callbacks on the host. The stream will not launch subsequent commands until the callback completes. @@ -496,11 +530,205 @@ Asynchronous operations between the host and the kernel provide a variety of opp However, one of the opportunities of asynchronous operation is the pipelining of operations between launching kernels and transferring memory. In this case you would be working with multiple streams running concurrently, or at least overlapping in some regard, and managing any dependencies between the streams in the host application. +.. code_block:: cpp + + #include + #include + #include + + #define hip_check(hip_call) \ + { \ + auto hip_res = hip_call; \ + if (hip_res != hipSuccess) { \ + std::cerr << "Failed in hip call: " << #hip_call \ + << " with error: " << hipGetErrorName(hip_res) << std::endl; \ + std::abort(); \ + } \ + } + + + __global__ void vector_add(const float* a, const float* b, float* c, int n_elements) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx < n_elements) { + c[idx] = a[idx] + b[idx]; + } + } + + int main() { + const int n_elements = 1 << 20; // 1M elements + const int n_streams = 4; + const int elements_per_stream = n_elements / n_streams; + + // Create streams + std::vector streams(n_streams); + for (int i = 0; i < n_streams; i++) { + hip_check(hipStreamCreate(&streams[i])); + } + + // Allocate and initialize memory + std::vector h_a(n_elements), h_b(n_elements), h_c(n_elements); + std::vector h_ref(n_elements); // For validation + for (int i = 0; i < n_elements; i++) { + h_a[i] = static_cast(i); + h_b[i] = static_cast(i * 2); + h_ref[i] = h_a[i] + h_b[i]; + } + + float *d_a, *d_b, *d_c; + hip_check(hipMalloc(&d_a, n_elements * sizeof(float))); + hip_check(hipMalloc(&d_b, n_elements * sizeof(float))); + hip_check(hipMalloc(&d_c, n_elements * sizeof(float))); + + // Pipeline operations across streams + for (int i = 0; i < n_streams; i++) { + int offset = i * elements_per_stream; + int bytes = elements_per_stream * sizeof(float); + + // Stage 1: Copy input data + hip_check(hipMemcpyAsync(d_a + offset, h_a.data() + offset, bytes, + hipMemcpyHostToDevice, streams[i])); + hip_check(hipMemcpyAsync(d_b + offset, h_b.data() + offset, bytes, + hipMemcpyHostToDevice, streams[i])); + + // Stage 2: Launch kernel + const int n_threads = 256; + const int n_blocks = (elements_per_stream + n_threads - 1) / n_threads; + vector_add<<>>( + d_a + offset, d_b + offset, d_c + offset, elements_per_stream); + hip_check(hipGetLastError()); + + // Stage 3: Copy result back + hip_check(hipMemcpyAsync(h_c.data() + offset, d_c + offset, bytes, + hipMemcpyDeviceToHost, streams[i])); + } + + // Wait for completion and validate + for (auto stream : streams) { + hip_check(hipStreamSynchronize(stream)); + } + + bool passed = true; + for (int i = 0; i < n_elements; i++) { + if (std::abs(h_c[i] - h_ref[i]) > 1e-5f) { + std::cerr << "Validation failed at " << i << ": " + << h_c[i] << " != " << h_ref[i] << std::endl; + passed = false; + break; + } + } + + // Cleanup + for (auto stream : streams) { + hip_check(hipStreamDestroy(stream)); + } + hip_check(hipFree(d_a)); + hip_check(hipFree(d_b)); + hip_check(hipFree(d_c)); + + std::cout << "Pipeline example " << (passed ? "PASSED" : "FAILED") << std::endl; + return passed ? 0 : 1; + } + There is also the producer-consumer paradigm that can be used to convert a sequential program into parallel operations to improve performance. This process can employ multiple streams to kick off asynchronous kernels, provide data to the kernels, perform operations, and return the results for further processing in the host application. -These asynchronous activities call for stream management strategies. In the case of the single stream, the only management would be the synchronization of the stream when the work was complete. However, with multiple streams, you have overlapping execution of operations, and synchronization becomes more complex. You need to manage the activities of each stream, evaluating the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. +.. code_block:: cpp -All of this could probably use some examples. Let us know where you find them. + #include + #include + #include + #include + + #define hip_check(hip_call) \ + { \ + auto hip_res = hip_call; \ + if (hip_res != hipSuccess) { \ + std::cerr << "Failed in hip call: " << #hip_call \ + << " with error: " << hipGetErrorName(hip_res) << std::endl; \ + std::abort(); \ + } \ + } + + __global__ void producer_kernel(int* data, int n_elements) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx < n_elements) { + data[idx] = idx * idx; // Generate data + } + } + + __global__ void consumer_kernel(const int* input, float* output, int n_elements) { + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx < n_elements) { + output[idx] = sqrt(static_cast(input[idx])); // Process data + } + } + + int main() { + const int n_elements = 1 << 20; // 1M elements + + // Allocate device memory + int* d_intermediate; + float* d_output; + hip_check(hipMalloc(&d_intermediate, n_elements * sizeof(int))); + hip_check(hipMalloc(&d_output, n_elements * sizeof(float))); + + // Create streams and event + hipStream_t producer_stream, consumer_stream; + hipEvent_t data_ready; + hip_check(hipStreamCreate(&producer_stream)); + hip_check(hipStreamCreate(&consumer_stream)); + hip_check(hipEventCreate(&data_ready)); + + // Launch configuration + const int n_threads = 256; + const int n_blocks = (n_elements + n_threads - 1) / n_threads; + + // Stage 1: Producer generates data + producer_kernel<<>>( + d_intermediate, n_elements); + hip_check(hipGetLastError()); + hip_check(hipEventRecord(data_ready, producer_stream)); + + // Stage 2: Consumer waits for data and processes it + hip_check(hipStreamWaitEvent(consumer_stream, data_ready)); + consumer_kernel<<>>( + d_intermediate, d_output, n_elements); + hip_check(hipGetLastError()); + + // Get result and validate + std::vector result(n_elements); + std::vector reference(n_elements); + hip_check(hipMemcpyAsync(result.data(), d_output, n_elements * sizeof(float), + hipMemcpyDeviceToHost, consumer_stream)); + hip_check(hipStreamSynchronize(consumer_stream)); + + // Compute reference results + for (int i = 0; i < n_elements; i++) { + reference[i] = std::sqrt(static_cast(i * i)); + } + + // Validate + bool passed = true; + for (int i = 0; i < n_elements; i++) { + if (std::abs(result[i] - reference[i]) > 1e-5f) { + std::cerr << "Validation failed at " << i << ": " + << result[i] << " != " << reference[i] << std::endl; + passed = false; + break; + } + } + + // Cleanup + hip_check(hipFree(d_intermediate)); + hip_check(hipFree(d_output)); + hip_check(hipEventDestroy(data_ready)); + hip_check(hipStreamDestroy(producer_stream)); + hip_check(hipStreamDestroy(consumer_stream)); + + std::cout << "Producer-consumer example " << (passed ? "PASSED" : "FAILED") << std::endl; + return passed ? 0 : 1; + } + +These asynchronous activities call for stream management strategies. In the case of the single stream, the only management would be the synchronization of the stream when the work was complete. However, with multiple streams, you have overlapping execution of operations, and synchronization becomes more complex. You need to manage the activities of each stream, evaluating the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. Multi-GPU and Load Balancing ---------------------------- From e9afd84c5cd8d6c2bbd0743888173985b4f68071 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Wed, 12 Feb 2025 11:01:45 -0800 Subject: [PATCH 17/52] Address Adel comments --- docs/understand/programming_model.rst | 34 +++++++++++++++------------ 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index b3b598f587..88daa9536a 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -10,10 +10,10 @@ Introduction to HIP programming model ******************************************************************************* -The HIP programming model makes it easy to map data-parallel C/C++ algorithms to -massively parallel, wide single instruction, multiple data (SIMD) architectures, -such as GPUs. HIP supports many imperative languages, such as Python via PyHIP, -but this document focuses on the original C/C++ API of HIP. +The HIP programming model enables mapping data-parallel C/C++ algorithms to massively +parallel SIMD (Single Instruction, Multiple Data) architectures like GPUs. HIP +supports many imperative languages, such as Python via PyHIP, but this document +focuses on the original C/C++ API of HIP. While GPUs may be capable of running applications written for CPUs if properly ported and compiled, it would not be an efficient use of GPU resources. GPUs are different @@ -43,9 +43,9 @@ wait for previous instructions to finish. On CPUs the goal is to quickly process operations. CPUs provide low latency processing for serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, -in parallel, achieving higher throughput. Latency is the delay from when an operation -is started to when it returns, such as 2 ns, while throughput is the number of operations completed -in a period of time, such as ten thousand threads completed. +in parallel, achieving higher throughput. Latency is the time between starting an +operation and receiving its result, such as 2 ns, while throughput is the rate of +completed operations, for example operations per second. For the GPU, the objective is to process as many operations in parallel, rather than to finish a single instruction quickly. GPUs in general are made up of basic @@ -164,7 +164,7 @@ The device or kernel program acts as workers on the GPU application, distributin 4. Computation: Threads perform the required computations on the input data, and generate any needed output. Each thread of the kernel runs the same instruction simultaneously on the different datasets. This sometimes require multiple iterations when the number of operations exceeds the resources of the CU. 5. Synchronization: When needed, threads synchronize within their block to ensure correct results when working with shared memory. -Kernels can be simple single instruction programs deployed across multiple threads in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds, thousands, or hundreds of thousands of operations with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. +Kernels are parallel programs that execute the same instruction set across multiple threads, organized in wavefronts, as described below and as demonstrated in the `Hello World tutorial `_ or :doc:`../tutorial/saxpy`. However, heterogeneous GPU applications can also become quite complex, managing hundreds, thousands, or hundreds of thousands of operations with repeated data transfers between host and device to support massive parallelization, using multiple streams to manage concurrent asynchronous operations, using rich libraries of functions optimized for GPU hardware as described in the `ROCm documentation `_. .. _programming_model_simt: @@ -346,10 +346,12 @@ For further information, see :doc:`Cooperative groups Date: Thu, 13 Feb 2025 16:16:51 -0800 Subject: [PATCH 18/52] Add Istvan links --- docs/understand/programming_model.rst | 257 +++----------------------- 1 file changed, 28 insertions(+), 229 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 88daa9536a..94d823aaaf 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -91,7 +91,7 @@ across large datasets, with little branching, where the volume of operations is .. _heterogeneous_programming: -Heterogeneous Programming +Heterogeneous programming ========================= The HIP programming model has two execution contexts. The main application starts on the CPU, or @@ -455,7 +455,7 @@ Host-side execution The host-side API dealing with device management and their queries are synchronous. All asynchronous APIs, such as kernel execution, data movement and potentially data -allocation/freeing all happen in the context of device streams. +allocation/freeing all happen in the context of device streams, as described in `Managing streams <../how-to/hip_runtime_api/asynchronous.html#managing-streams>`_. Streams are FIFO buffers of commands to execute relating to a given device. Operations which enqueue tasks on a stream all return promptly and the command is @@ -506,233 +506,32 @@ intended use-cases. Asynchronous execution ---------------------- -Asynchronous operations between the host and the kernel provide a variety of opportunities, or challenges, for managing synchronization. For instance, a basic model would be to launch an asynchronous operation on a kernel in a stream, create an event to track the operation, continue operations in the host program, and when the asynchronous operation completes synchronize the kernel to return the results. This basic example might look something like the following: - -.. code_block:: cpp - - // Create a HIP stream - hipStream_t stream; - hipStreamCreate(&stream); - - // Launch the kernel asynchronously - myKernel<<>>(d_data); - - // Perform continued host processing here - // This could be any CPU-bound work that doesn't depend on the kernel's result - doHostProcessing(); - - // Synchronize the stream to ensure kernel execution is complete - hipStreamSynchronize(stream); - - // Any host processing that depends on the kernel's result should occur after synchronization - processKernelResults(); - - // Copy the result back to the host - hipMemcpy(...); - -However, one of the opportunities of asynchronous operation is the pipelining of operations between launching kernels and transferring memory. In this case you would be working with multiple streams running concurrently, or at least overlapping in some regard, and managing any dependencies between the streams in the host application. - -.. code_block:: cpp - - #include - #include - #include - - #define hip_check(hip_call) \ - { \ - auto hip_res = hip_call; \ - if (hip_res != hipSuccess) { \ - std::cerr << "Failed in hip call: " << #hip_call \ - << " with error: " << hipGetErrorName(hip_res) << std::endl; \ - std::abort(); \ - } \ - } - - - __global__ void vector_add(const float* a, const float* b, float* c, int n_elements) { - int idx = blockDim.x * blockIdx.x + threadIdx.x; - if (idx < n_elements) { - c[idx] = a[idx] + b[idx]; - } - } - - int main() { - const int n_elements = 1 << 20; // 1M elements - const int n_streams = 4; - const int elements_per_stream = n_elements / n_streams; - - // Create streams - std::vector streams(n_streams); - for (int i = 0; i < n_streams; i++) { - hip_check(hipStreamCreate(&streams[i])); - } - - // Allocate and initialize memory - std::vector h_a(n_elements), h_b(n_elements), h_c(n_elements); - std::vector h_ref(n_elements); // For validation - for (int i = 0; i < n_elements; i++) { - h_a[i] = static_cast(i); - h_b[i] = static_cast(i * 2); - h_ref[i] = h_a[i] + h_b[i]; - } - - float *d_a, *d_b, *d_c; - hip_check(hipMalloc(&d_a, n_elements * sizeof(float))); - hip_check(hipMalloc(&d_b, n_elements * sizeof(float))); - hip_check(hipMalloc(&d_c, n_elements * sizeof(float))); - - // Pipeline operations across streams - for (int i = 0; i < n_streams; i++) { - int offset = i * elements_per_stream; - int bytes = elements_per_stream * sizeof(float); - - // Stage 1: Copy input data - hip_check(hipMemcpyAsync(d_a + offset, h_a.data() + offset, bytes, - hipMemcpyHostToDevice, streams[i])); - hip_check(hipMemcpyAsync(d_b + offset, h_b.data() + offset, bytes, - hipMemcpyHostToDevice, streams[i])); - - // Stage 2: Launch kernel - const int n_threads = 256; - const int n_blocks = (elements_per_stream + n_threads - 1) / n_threads; - vector_add<<>>( - d_a + offset, d_b + offset, d_c + offset, elements_per_stream); - hip_check(hipGetLastError()); - - // Stage 3: Copy result back - hip_check(hipMemcpyAsync(h_c.data() + offset, d_c + offset, bytes, - hipMemcpyDeviceToHost, streams[i])); - } - - // Wait for completion and validate - for (auto stream : streams) { - hip_check(hipStreamSynchronize(stream)); - } - - bool passed = true; - for (int i = 0; i < n_elements; i++) { - if (std::abs(h_c[i] - h_ref[i]) > 1e-5f) { - std::cerr << "Validation failed at " << i << ": " - << h_c[i] << " != " << h_ref[i] << std::endl; - passed = false; - break; - } - } - - // Cleanup - for (auto stream : streams) { - hip_check(hipStreamDestroy(stream)); - } - hip_check(hipFree(d_a)); - hip_check(hipFree(d_b)); - hip_check(hipFree(d_c)); - - std::cout << "Pipeline example " << (passed ? "PASSED" : "FAILED") << std::endl; - return passed ? 0 : 1; - } - -There is also the producer-consumer paradigm that can be used to convert a sequential program into parallel operations to improve performance. This process can employ multiple streams to kick off asynchronous kernels, provide data to the kernels, perform operations, and return the results for further processing in the host application. - -.. code_block:: cpp - - #include - #include - #include - #include - - #define hip_check(hip_call) \ - { \ - auto hip_res = hip_call; \ - if (hip_res != hipSuccess) { \ - std::cerr << "Failed in hip call: " << #hip_call \ - << " with error: " << hipGetErrorName(hip_res) << std::endl; \ - std::abort(); \ - } \ - } - - __global__ void producer_kernel(int* data, int n_elements) { - int idx = blockDim.x * blockIdx.x + threadIdx.x; - if (idx < n_elements) { - data[idx] = idx * idx; // Generate data - } - } - - __global__ void consumer_kernel(const int* input, float* output, int n_elements) { - int idx = blockDim.x * blockIdx.x + threadIdx.x; - if (idx < n_elements) { - output[idx] = sqrt(static_cast(input[idx])); // Process data - } - } - - int main() { - const int n_elements = 1 << 20; // 1M elements - - // Allocate device memory - int* d_intermediate; - float* d_output; - hip_check(hipMalloc(&d_intermediate, n_elements * sizeof(int))); - hip_check(hipMalloc(&d_output, n_elements * sizeof(float))); - - // Create streams and event - hipStream_t producer_stream, consumer_stream; - hipEvent_t data_ready; - hip_check(hipStreamCreate(&producer_stream)); - hip_check(hipStreamCreate(&consumer_stream)); - hip_check(hipEventCreate(&data_ready)); - - // Launch configuration - const int n_threads = 256; - const int n_blocks = (n_elements + n_threads - 1) / n_threads; - - // Stage 1: Producer generates data - producer_kernel<<>>( - d_intermediate, n_elements); - hip_check(hipGetLastError()); - hip_check(hipEventRecord(data_ready, producer_stream)); - - // Stage 2: Consumer waits for data and processes it - hip_check(hipStreamWaitEvent(consumer_stream, data_ready)); - consumer_kernel<<>>( - d_intermediate, d_output, n_elements); - hip_check(hipGetLastError()); - - // Get result and validate - std::vector result(n_elements); - std::vector reference(n_elements); - hip_check(hipMemcpyAsync(result.data(), d_output, n_elements * sizeof(float), - hipMemcpyDeviceToHost, consumer_stream)); - hip_check(hipStreamSynchronize(consumer_stream)); - - // Compute reference results - for (int i = 0; i < n_elements; i++) { - reference[i] = std::sqrt(static_cast(i * i)); - } - - // Validate - bool passed = true; - for (int i = 0; i < n_elements; i++) { - if (std::abs(result[i] - reference[i]) > 1e-5f) { - std::cerr << "Validation failed at " << i << ": " - << result[i] << " != " << reference[i] << std::endl; - passed = false; - break; - } - } - - // Cleanup - hip_check(hipFree(d_intermediate)); - hip_check(hipFree(d_output)); - hip_check(hipEventDestroy(data_ready)); - hip_check(hipStreamDestroy(producer_stream)); - hip_check(hipStreamDestroy(consumer_stream)); - - std::cout << "Producer-consumer example " << (passed ? "PASSED" : "FAILED") << std::endl; - return passed ? 0 : 1; - } - -These asynchronous activities call for stream management strategies. In the case of the single stream, the only management would be the synchronization of the stream when the work was complete. However, with multiple streams, you have overlapping execution of operations, and synchronization becomes more complex. You need to manage the activities of each stream, evaluating the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. - -Multi-GPU and Load Balancing +Asynchronous operations between the host and the kernel provide a variety of opportunities, +or challenges, for managing synchronization, as described in :ref:`asynchronous_how-to`. +For instance, a basic model would be to launch an asynchronous operation on a kernel +in a stream, create an event to track the operation, continue operations in the host +program, and when the asynchronous operation completes synchronize the kernel to return +the results. + +However, one of the opportunities of asynchronous operation is the pipelining of operations +between launching kernels and transferring memory. In this case you would be working +with multiple streams running concurrently, or at least overlapping in some regard, +and managing any dependencies between the streams in the host application. There is +also the producer-consumer paradigm that can be used to convert a sequential program +into parallel operations to improve performance. This process can employ multiple +streams to kick off asynchronous kernels, provide data to the kernels, perform operations, +and return the results for further processing in the host application. + +These asynchronous activities call for stream management strategies. In the case +of the single stream, the only management would be the synchronization of the +stream when the work was complete. However, with multiple streams you have +overlapping execution of operations and synchronization becomes more complex, as shown +in the variations of the example in `Programmatic dependent launch and synchronization <../how-to/hip_runtime_api/asynchronous.html#programmatic-dependent-launch-and-synchronization>`_. +You need to manage the activities of each stream, evaluating the availability of +results, evaluate the critical path of the tasks, allocate resources on the hardware, +and manage the execution order. + +Multi-GPU and load balancing ---------------------------- For applications requiring additional computational power beyond a single device, From bb24b332d9b534fe81107a126c7a696f06105697 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:49:18 -0800 Subject: [PATCH 19/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 94d823aaaf..3712e2ffe3 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -136,7 +136,7 @@ In heterogeneous programming, the CPU is available for processing operations but 1. Initialize the HIP runtime and select the GPU: As described in :ref:`initialization`, refers to identifying and selecting a target GPU, setting up a context to let the CPU interact with the GPU. 2. Data preparation: As discussed in :ref:`memory_management`, this includes allocating the required memory on the host and device, preparing input data and transferring it from the host to the device. The data is both transferred to the device, and passed as an input parameter when launching the kernel. -3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``), and pass parameters as needed. On the GPU, kernels are run on streams, or a queue of operations. Within the same stream operations run in the order they were issued, but different streams are independent and can execute concurrently. In the HIP runtime, kernels run on the default stream when one is not specified, but specifying a stream for the kernel lets you increase concurrency in task scheduling and resource utilization, and launch and manage multiple kernels from the host program. +3. Configure and launch the kernel on the GPU: As described in :ref:`device_program`, define and load the kernel or kernels to be run, launch kernels using the triple chevron syntax or appropriate API call (for example ``hipLaunchKernelGGL``), and pass parameters as needed. On the GPU, kernels run on streams, or a queue of operations. Within the same stream, operations run in the order they were issued, but different streams are independent and can execute concurrently. In the HIP runtime, kernels run on the default stream when one is not specified, but specifying a stream for the kernel lets you increase concurrency in task scheduling and resource utilization, and launch and manage multiple kernels from the host program. 4. Synchronization: As described in :ref:`asynchronous_how-to`, kernel execution occurs in the context of device streams, specifically the default (`0`) stream. You can use streams and events to manage task dependencies, overlap computation with data transfers, and manage asynchronous processes to ensure proper sequencing of operations. Wait for events or streams to finish execution and transfer results from the GPU back to the host. 5. Error handling: As described in :ref:`error_handling`, you should catch and handle potential errors from API calls, kernel launches, or memory operations. For example, use ``hipGetErrorString`` to retrieve error messages. 6. Cleanup and resource management: Validate results, clean up GPU contexts and resources, and free allocated memory on the host and devices. From 3a414cce4c6c276e08bc64d622b64a065ab1d0fb Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:50:15 -0800 Subject: [PATCH 20/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 3712e2ffe3..fdc3400ecf 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -158,7 +158,7 @@ Device programming The device or kernel program acts as workers on the GPU application, distributing operations to be handled quickly and efficiently. Launching a kernel in the host application starts the kernel program running on the GPU, defining the parallel operations to repeat the same instructions across many datasets. Understanding how the kernel works and the processes involved is essential to writing efficient GPU applications. Threads, blocks, and grids provide a hierarchical approach to parallel operations. Understanding the thread hierarchy is critical to distributing work across the available CUs, managing parallel operations, and optimizing memory access. The general flow of the kernel program looks like this: -1. Thread Grouping: As described in :ref:`inherent_thread_model`, threads are organized into a hierarchy consisting of threads which are individual instances of parallel operations, blocks that group the threads together, and grids that group blocks into the kernel. Each thread runs an instance of the kernel in parallel with other threads in the block. +1. Thread Grouping: As described in :ref:`inherent_thread_model`, threads are organized into a hierarchy consisting of threads, which are individual instances of parallel operations, blocks that group the threads, and grids that group blocks into the kernel. Each thread runs an instance of the kernel in parallel with other threads in the block. 2. Indexing: The kernel computes the unique index for each thread to access the relevant data to be processed by the thread. 3. Data Fetch: Threads fetch input data from memory previously transferred from the host to the device. As described in :ref:`memory_hierarchy`, the hierarchy of threads is influenced by the memory subsystem of GPUs. The memory hierarchy includes local memory per-thread with very fast access, shared memory for the block of threads which also supports quick access, and larger amounts of global memory visible to the whole kernel,but accesses are expensive due to high latency. Understanding the memory model is a key concept for kernel programming. 4. Computation: Threads perform the required computations on the input data, and generate any needed output. Each thread of the kernel runs the same instruction simultaneously on the different datasets. This sometimes require multiple iterations when the number of operations exceeds the resources of the CU. From 34f4530d63e6b60585babfe4aa6a48af35d90292 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:54:51 -0800 Subject: [PATCH 21/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index fdc3400ecf..1abe731508 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -215,7 +215,7 @@ called the triple chevron syntax, which looks like the following: AddKernel<<>>(a, b); -Inside the angle brackets you provide the following: +Inside the angle brackets, provide the following: * The number of blocks to launch, which defines the grid size (relating to blockDim). * The number of threads in a block, which defines the block size (relating to blockIdx). From a13a20ea9f1eda6526d118ac0ad996827d2866cc Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:55:11 -0800 Subject: [PATCH 22/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 1abe731508..da1da0125e 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -178,7 +178,7 @@ kernel, gets mapped to the SIMD. This is done by grouping threads into warps, which contain as many threads as there are physical lanes in a SIMD, and issuing that instruction to the SIMD for every -warp of a kernel. Ideally the SIMD is always fully utilized, however if the number of threads +warp of a kernel. Ideally, the SIMD is always fully utilized. However, if the number of threads can't be evenly divided by the warpSize, then the unused lanes are masked out from the corresponding SIMD execution. From 7288be69055dd769780e0f32aef1cd3a47e3afc4 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:57:44 -0800 Subject: [PATCH 23/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index da1da0125e..84884d040e 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -171,7 +171,7 @@ Kernels are parallel programs that execute the same instruction set across multi Single instruction multiple threads (SIMT) ========================================== -The HIP kernel code, which is written as a series of scalar instructions for multiple +The HIP kernel code, written as a series of scalar instructions for multiple threads with different thread indices, gets mapped to the SIMD units of the GPUs. Every single instruction, which is executed for every participating thread of a kernel, gets mapped to the SIMD. From 763f3853ac616038a22ef931218454f563cdc28d Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:58:01 -0800 Subject: [PATCH 24/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 84884d040e..d87ee960d8 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -225,7 +225,7 @@ Inside the angle brackets, provide the following: .. note:: The kernel can also be launched through other methods, such as the ``hipLaunchKernel()`` function. -Here the total number of threads launched for the ``AddKernel`` program is defined by +Here, the total number of threads launched for the ``AddKernel`` program is defined by ``number_of_blocks * threads_per_block``. You define these values when launching the kernel program to address the problem to be solved with the available resources within the system. In other words, the thread configuration is customized to the needs of the From 0dee5608b8cc3329d31317930a1a312bfda91a9b Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 08:58:34 -0800 Subject: [PATCH 25/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index d87ee960d8..f5d2030d7e 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -256,7 +256,7 @@ blocks, and grids. * Threads are single instances of kernel operations, running concurrently across warps * Blocks group threads together and enable cooperation and shared memory * Grids define the number of thread blocks for a single kernel launch -* Blocks, and grids can be defined in 3 dimensions (``x``, ``y``, ``z``) +* Blocks and grids can be defined in 3 dimensions (``x``, ``y``, ``z``) * By default, the Y and Z dimensions are set to 1 The combined values represent the thread index, and relate to the sequence that the From 272a363bf826d3cd6aff3ef2e891f8a311c08f81 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:07:26 -0800 Subject: [PATCH 26/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index f5d2030d7e..cc87838dbc 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -16,8 +16,8 @@ supports many imperative languages, such as Python via PyHIP, but this document focuses on the original C/C++ API of HIP. While GPUs may be capable of running applications written for CPUs if properly ported -and compiled, it would not be an efficient use of GPU resources. GPUs are different -from CPUs in fundamental ways, and should be used accordingly to achieve optimum +and compiled, it would not be an efficient use of GPU resources. GPUs fundamentally differ +from CPUs and should be used accordingly to achieve optimum performance. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general. The following topics introduce you to the key concepts of From b255dd32f793927f8c419711da8ef71701816b99 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:07:47 -0800 Subject: [PATCH 27/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index cc87838dbc..74179fc5cd 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -7,7 +7,7 @@ .. _programming_model: ******************************************************************************* -Introduction to HIP programming model +Introduction to the HIP programming model ******************************************************************************* The HIP programming model enables mapping data-parallel C/C++ algorithms to massively From f4a8ae2d27e02519498abf5f69afcc4c6367259f Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:08:23 -0800 Subject: [PATCH 28/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 74179fc5cd..b684402a7b 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -21,7 +21,7 @@ from CPUs and should be used accordingly to achieve optimum performance. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general. The following topics introduce you to the key concepts of -GPU-based programming, and the HIP programming model. +GPU-based programming and the HIP programming model. Hardware differences: CPU vs GPU ================================ From 56f096b714ad2b5a5f26a81f4bf8a1c0d87dda25 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:09:14 -0800 Subject: [PATCH 29/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index b684402a7b..3e71ff6695 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -26,11 +26,7 @@ GPU-based programming and the HIP programming model. Hardware differences: CPU vs GPU ================================ -CPUs and GPUs have been designed for different purposes. CPUs have been designed -to quickly execute a single thread, decreasing the time it takes for a single -operation, increasing the amount of sequential instructions that can be executed. -This includes fetching data, and reducing pipeline stalls where the ALU has to -wait for previous instructions to finish. +CPUs and GPUs have been designed for different purposes. CPUs quickly execute a single thread, decreasing the time for a single operation while increasing the number of sequential instructions that can be executed. This includes fetching data and reducing pipeline stalls where the ALU has to wait for previous instructions to finish. .. figure:: ../data/understand/programming_model/cpu-gpu-comparison-adjusted.svg :alt: Diagram depicting the differences between CPU and GPU hardware. From 6020bdc46a34206695f7ec8c19cbc5c759f68078 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:10:03 -0800 Subject: [PATCH 30/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 3e71ff6695..ff1debefae 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -37,7 +37,7 @@ CPUs and GPUs have been designed for different purposes. CPUs quickly execute a Differences in CPUs and GPUs -On CPUs the goal is to quickly process operations. CPUs provide low latency processing for +With CPUs, the goal is to quickly process operations. CPUs provide low-latency processing for serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, in parallel, achieving higher throughput. Latency is the time between starting an operation and receiving its result, such as 2 ns, while throughput is the rate of From 55c567007a2b6b144b8db120464abbb403a1792b Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:11:02 -0800 Subject: [PATCH 31/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index ff1debefae..0a116e9292 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -41,7 +41,7 @@ With CPUs, the goal is to quickly process operations. CPUs provide low-latency p serial instructions. On the other hand, GPUs have been designed to execute many similar commands, or threads, in parallel, achieving higher throughput. Latency is the time between starting an operation and receiving its result, such as 2 ns, while throughput is the rate of -completed operations, for example operations per second. +completed operations, for example, operations per second. For the GPU, the objective is to process as many operations in parallel, rather than to finish a single instruction quickly. GPUs in general are made up of basic From 756f68cf76116360877d9b524aab46deb2f4acf3 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:12:20 -0800 Subject: [PATCH 32/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 0a116e9292..4af2d89f45 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -74,7 +74,7 @@ The following defines a few hardware differences between CPUs and GPUs: - Register files are shared among threads. The number of threads that can be run in parallel depends on the registers needed per thread. - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. - - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture, that allows a `single instruction` to concurrently operate on `multiple data` inputs. + - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture that allows a `single instruction` to concurrently operate on `multiple data` inputs. - For branching threads where conditional instructions lead to thread divergence, ALUs still processes the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles, and should be a consideration in your programming. Keep instructions consistent, and leave conditionals out of threads. - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. From 9880b202d204fdd0da33f8800101e3ae62e6883a Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:12:51 -0800 Subject: [PATCH 33/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 4af2d89f45..a274a10041 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -75,7 +75,7 @@ The following defines a few hardware differences between CPUs and GPUs: - Multiple ALUs execute a collection of threads having the same operations, also known as a wavefront or warp. This is called single-instruction, multiple threads (SIMT) operation as described in :ref:`programming_model_simt`. - The collection of ALUs is called SIMD. SIMDs are an extension to the hardware architecture that allows a `single instruction` to concurrently operate on `multiple data` inputs. - - For branching threads where conditional instructions lead to thread divergence, ALUs still processes the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles, and should be a consideration in your programming. Keep instructions consistent, and leave conditionals out of threads. + - For branching threads where conditional instructions lead to thread divergence, ALUs still process the full wavefront, but the result for divergent threads is masked out. This leads to wasted ALU cycles and should be a consideration in your programming. Keep instructions consistent and leave conditionals out of threads. - The advantage for GPUs is that context switching is easy. All threads that run on a core/compute unit have their registers on the compute unit, so they don't need to be stored to global memory, and each cycle one instruction from any wavefront that resides on the compute unit can be issued. From d77834d524d0cfeafa553a7ca1f1d48eefaa5e4f Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:15:51 -0800 Subject: [PATCH 34/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index a274a10041..5c517a85ba 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -422,7 +422,7 @@ The following are a few memory access patterns and best practices to improve per * **Shared memory**: Avoiding bank conflicts reduces serialization of memory transactions. - Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. + Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage, ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. * **Texture memory**: Spatial locality improves caching performance. From 1447c9798a7c145b4f899b4f12e2c94a0322f75d Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:22:31 -0800 Subject: [PATCH 35/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 5c517a85ba..1f132029b1 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -420,7 +420,7 @@ The following are a few memory access patterns and best practices to improve per 3. *Avoid strided access*: For example array[i * stride] can lead to memory bank conflicts and inefficient access. 4. *Pad Data*: If necessary, pad data structures to ensure alignment and coalescing. -* **Shared memory**: Avoiding bank conflicts reduces serialization of memory transactions. +* **Shared memory**: Avoiding bank conflicts reduces the serialization of memory transactions. Shared memory is a small, fast memory region inside the CU. Unlike global memory, shared memory accesses do not require coalescing, but they can suffer from bank conflicts, which are another form of inefficient memory access. Shared memory is divided into multiple memory banks (usually 32 banks on modern GPUs). If multiple threads within a warp try to access different addresses that map to the same memory bank, accesses get serialized, leading to poor performance. To optimize shared memory usage, ensure that consecutive threads access different memory banks. Use padding if necessary to avoid conflicts. From e18f29f6f4f4a99726edb15a731218591dc17d18 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:23:46 -0800 Subject: [PATCH 36/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 1f132029b1..c00bf1aaa5 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -50,7 +50,7 @@ As described in :ref:`hardware_implementation`, these CUs provide the necessary resources for the threads: the Arithmetic Logical Units (ALUs), register files, caches and shared memory for efficient communication between the threads. -The following defines a few hardware differences between CPUs and GPUs: +The following describes a few hardware differences between CPUs and GPUs: * CPU: From 69a85f9a2b524572354518a2b4c8294fbd86d1c7 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:25:03 -0800 Subject: [PATCH 37/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index c00bf1aaa5..b951da5197 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -316,9 +316,9 @@ Grid The three-dimensional thread hierarchy available to a kernel program lends itself to solutions that align closely to the computational problem. The following are some examples: -* 1 dimensional: array processing, linear data structures, or sequential data transformation -* 2 dimensional: Image processing, matrix operations, 2 dimensional simulations -* 3 dimensions: Volume rendering, 3D scientific simulations, spatial algorithms +* 1-dimensional: array processing, linear data structures, or sequential data transformation +* 2-dimensional: Image processing, matrix operations, 2 dimensional simulations +* 3-dimensional: Volume rendering, 3D scientific simulations, spatial algorithms Cooperative groups thread model ------------------------------- From d8dd607da52e5db99321e894c632b80fc922922c Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:46:55 -0800 Subject: [PATCH 38/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index b951da5197..af0f1b76ee 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -434,7 +434,7 @@ The following are a few memory access patterns and best practices to improve per * **Small batch transfers**: Enable pipelining and improve PCIe bandwidth use. - Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of a dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. + Memory transfers between the host and the device can become a major bottleneck if not optimized. One method is to use small batch memory transfers where data is transferred in smaller chunks instead of dealing with large datasets to avoid long blocking operations. Small batch transfers offer better PCIe bandwidth utilization over large data transfers. Small batch transfers offer performance improvement by offering reduced latency with small batches that run asynchronously using ``hipMemcpyAsync()`` as described in :ref:`asynchronous_how-to`, pipelining data transfers and kernel execution using separate streams. Finally, using pinned memory with small batch transfers enables faster DMA transfers without CPU involvement, greatly improving memory transfer performance. Execution model =============== From 3a5a255f1201f0c92e1b384a4153b4dc7bab3104 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 09:55:46 -0800 Subject: [PATCH 39/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index af0f1b76ee..cd6a7ca7e5 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -454,7 +454,7 @@ All asynchronous APIs, such as kernel execution, data movement and potentially d allocation/freeing all happen in the context of device streams, as described in `Managing streams <../how-to/hip_runtime_api/asynchronous.html#managing-streams>`_. Streams are FIFO buffers of commands to execute relating to a given device. -Operations which enqueue tasks on a stream all return promptly and the command is +Operations that enqueue tasks on a stream all return promptly, and the command is executed asynchronously. All side effects of a command on a stream are visible to all subsequent commands on the same stream. Multiple streams may point to the same device and those streams may be fed from multiple concurrent host-side From d473e8ab328de99b14de85cd298068c852979b2c Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 10:05:09 -0800 Subject: [PATCH 40/52] Apply suggestions from code review Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index cd6a7ca7e5..cbbe9f854d 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -461,7 +461,7 @@ the same device and those streams may be fed from multiple concurrent host-side threads. Execution on multiple streams may be concurrent but isn't required to be. -Asynchronous APIs involving a stream all return a stream event which may be +Asynchronous APIs involving a stream all return a stream event, which can be used to synchronize the execution of multiple streams. A user may enqueue a barrier onto a stream referencing an event. The barrier will block activity on the stream until the operation related to the event completes. After the event completes, all @@ -484,7 +484,7 @@ Device-side execution --------------------- Kernels may be launched in multiple ways, all with different syntaxes and -intended use-cases. +intended use cases. * Using the triple-chevron ``<<<...>>>`` operator on a ``__global__`` annotated function. @@ -506,8 +506,7 @@ Asynchronous operations between the host and the kernel provide a variety of opp or challenges, for managing synchronization, as described in :ref:`asynchronous_how-to`. For instance, a basic model would be to launch an asynchronous operation on a kernel in a stream, create an event to track the operation, continue operations in the host -program, and when the asynchronous operation completes synchronize the kernel to return -the results. +program, and when the event shows that the asynchronous operation is complete, synchronize the kernel to return the results. However, one of the opportunities of asynchronous operation is the pipelining of operations between launching kernels and transferring memory. In this case you would be working @@ -519,13 +518,11 @@ streams to kick off asynchronous kernels, provide data to the kernels, perform o and return the results for further processing in the host application. These asynchronous activities call for stream management strategies. In the case -of the single stream, the only management would be the synchronization of the -stream when the work was complete. However, with multiple streams you have +of the single stream, the only management would be the stream synchronization +when the work was complete. However, with multiple streams you have overlapping execution of operations and synchronization becomes more complex, as shown in the variations of the example in `Programmatic dependent launch and synchronization <../how-to/hip_runtime_api/asynchronous.html#programmatic-dependent-launch-and-synchronization>`_. -You need to manage the activities of each stream, evaluating the availability of -results, evaluate the critical path of the tasks, allocate resources on the hardware, -and manage the execution order. +You need to manage each stream's activities, evaluate the availability of results, evaluate the critical path of the tasks, allocate resources on the hardware, and manage the execution order. Multi-GPU and load balancing ---------------------------- From 5f05d6ce3d7e00d08345e16bdd6946dd8b9b86db Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 10:10:08 -0800 Subject: [PATCH 41/52] Update docs/understand/programming_model.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- docs/understand/programming_model.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index cbbe9f854d..6b7462e745 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -509,10 +509,10 @@ in a stream, create an event to track the operation, continue operations in the program, and when the event shows that the asynchronous operation is complete, synchronize the kernel to return the results. However, one of the opportunities of asynchronous operation is the pipelining of operations -between launching kernels and transferring memory. In this case you would be working +between launching kernels and transferring memory. In this case, you would be working with multiple streams running concurrently, or at least overlapping in some regard, -and managing any dependencies between the streams in the host application. There is -also the producer-consumer paradigm that can be used to convert a sequential program +and managing any dependencies between the streams in the host application. +The producer-consumer paradigm can be used to convert a sequential program into parallel operations to improve performance. This process can employ multiple streams to kick off asynchronous kernels, provide data to the kernels, perform operations, and return the results for further processing in the host application. From 6639d0832a2b4c75e305a9acde6a1923b2ba6360 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 10:43:32 -0800 Subject: [PATCH 42/52] Apply suggestions from code review --- docs/understand/programming_model.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 6b7462e745..076c01710e 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -28,7 +28,7 @@ Hardware differences: CPU vs GPU CPUs and GPUs have been designed for different purposes. CPUs quickly execute a single thread, decreasing the time for a single operation while increasing the number of sequential instructions that can be executed. This includes fetching data and reducing pipeline stalls where the ALU has to wait for previous instructions to finish. -.. figure:: ../data/understand/programming_model/cpu-gpu-comparison-adjusted.svg +.. figure:: ../data/understand/programming_model/cpu-gpu-comparison.svg :alt: Diagram depicting the differences between CPU and GPU hardware. The CPU block shows four large processing cores, lists Large Cache per Core, and High Clock Speed of 3 to 5 gigahertz. The GPU block shows 42 @@ -180,7 +180,7 @@ from the corresponding SIMD execution. .. _simt: -.. figure:: ../data/understand/programming_model/simt-execution-revised.svg +.. figure:: ../data/understand/programming_model/simt-execution.svg :alt: Diagram depicting the SIMT execution model. There is a red rectangle which contains the expression a[i] = b[i] + c[i], and below that four arrows that point to Thread 0,1,2, and 3. Each thread contains different From 11077d70b72c2923552620e34871257982395332 Mon Sep 17 00:00:00 2001 From: randyh62 Date: Fri, 14 Feb 2025 10:49:28 -0800 Subject: [PATCH 43/52] update images --- .../cpu-gpu-comparison-adjusted.svg | 70 - .../programming_model/cpu-gpu-comparison.svg | 1 + .../programming_model/host-device-flow.svg | 40 +- .../programming_model/memory-access.svg | 70 +- .../programming_model/memory_hierarchy.drawio | 400 - .../programming_model/multi-gpu.svg | 1 + .../simt-execution-revised.svg | 62 - .../programming_model/simt-execution.svg | 1 + .../understand/programming_model/simt.drawio | 148 - .../programming_model/stream-workflow.svg | 46 +- .../programming_model/thread_hierarchy.drawio | 8332 ----------------- 11 files changed, 6 insertions(+), 9165 deletions(-) delete mode 100644 docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg create mode 100644 docs/data/understand/programming_model/cpu-gpu-comparison.svg delete mode 100644 docs/data/understand/programming_model/memory_hierarchy.drawio create mode 100644 docs/data/understand/programming_model/multi-gpu.svg delete mode 100644 docs/data/understand/programming_model/simt-execution-revised.svg create mode 100644 docs/data/understand/programming_model/simt-execution.svg delete mode 100644 docs/data/understand/programming_model/simt.drawio delete mode 100644 docs/data/understand/programming_model/thread_hierarchy.drawio diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg b/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg deleted file mode 100644 index 354b0107d0..0000000000 --- a/docs/data/understand/programming_model/cpu-gpu-comparison-adjusted.svg +++ /dev/null @@ -1,70 +0,0 @@ - - - - - - CPU vs GPU Architecture - - - - - CPU - - - - - - - - - - - - Large Complex Cores - High Clock Speed (3-5 GHz) - - - - - - GPU - - - - - - - - - - - - - - - - - - - - - - - - Many Simple Cores - Lower Clock Speed (1-2 GHz) - - - - - - - Large Cache per Core - - - - - - Shared Memory across Cores - - diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.svg b/docs/data/understand/programming_model/cpu-gpu-comparison.svg new file mode 100644 index 0000000000..76f7a3d5c3 --- /dev/null +++ b/docs/data/understand/programming_model/cpu-gpu-comparison.svg @@ -0,0 +1 @@ +
CPU vs GPU Architecture
CPU vs GPU Architecture
CPU
CPU
GPU
GPU
Large Complex Cores
Large Complex Cores
High Clock Speed (3-5 GHz)
High Clock Speed (3-5 GHz)
Many Simple Cores
Many Simple Cores
Lower Clock Speed (1-2 GHz)
Lower Clock Speed (1-2 GHz)
Large Cache per Core
Large Cache per Core
Shared Memory across Cores
Shared Memory across Cores
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/host-device-flow.svg b/docs/data/understand/programming_model/host-device-flow.svg index 0a965a978d..c5f57c255b 100644 --- a/docs/data/understand/programming_model/host-device-flow.svg +++ b/docs/data/understand/programming_model/host-device-flow.svg @@ -1,39 +1 @@ - - - - - - Host-Device Data Flow - - - - Host (CPU) - - - - Device (GPU) - - - - - 1. Initialize - - - - 2. Transfer Data - - - - 3. Execute Kernel - - - - 4. Return Results - - - - - - - - +
Host-Device Data Flow
Host-Device Data Flow
Host (CPU)
Host (CPU)
Device (GPU)
Device (GPU)
1. Initialize
1. Initialize
2. Transfer Data
2. Transfer Data
3. Execute Kernel
3. Execute Kernel
4. Return Results
4. Return Results
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/memory-access.svg b/docs/data/understand/programming_model/memory-access.svg index 09a620587b..80f7908636 100644 --- a/docs/data/understand/programming_model/memory-access.svg +++ b/docs/data/understand/programming_model/memory-access.svg @@ -1,69 +1 @@ - - - - - - Memory Access Patterns - - - Uncoalesced Access - - - - - Threads in Warp - - - - - - - - - Memory - - - - - - - - - - - - - Coalesced Access - - - - - Threads in Warp - - - - - - - - - Memory - - - - - - - - - - - - - - - - - - - +
Memory Access Patterns
Memory Access Patterns
Uncoalesced Access
Uncoalesced Access
Threads
Threads
Memory
Memory
Coalesced Access
Coalesced Access
Threads
Threads
Memory
Memory
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/memory_hierarchy.drawio b/docs/data/understand/programming_model/memory_hierarchy.drawio deleted file mode 100644 index 21c801a62d..0000000000 --- a/docs/data/understand/programming_model/memory_hierarchy.drawio +++ /dev/null @@ -1,400 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/docs/data/understand/programming_model/multi-gpu.svg b/docs/data/understand/programming_model/multi-gpu.svg new file mode 100644 index 0000000000..83e50f9fed --- /dev/null +++ b/docs/data/understand/programming_model/multi-gpu.svg @@ -0,0 +1 @@ +
Multi-GPU Workload Distribution
Multi-GPU Workload Distribution
Host CPU
Host CPU
GPU 0
GPU 0
GPU 1
GPU 1
GPU 2
GPU 2
GPU 3
GPU 3
25%
25%
25%
25%
25%
25%
25%
25%
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/simt-execution-revised.svg b/docs/data/understand/programming_model/simt-execution-revised.svg deleted file mode 100644 index 806d565425..0000000000 --- a/docs/data/understand/programming_model/simt-execution-revised.svg +++ /dev/null @@ -1,62 +0,0 @@ - - - - - - SIMT Execution Model - - - - a[i] = b[i] + c[i] - - - - - - - - - - - Thread 0 - - b[0] = 5 - c[0] = 3 - a[0] = 8 - - - - - - Thread 1 - - b[1] = 2 - c[1] = 4 - a[1] = 6 - - - - - Thread 2 - - b[2] = 7 - c[2] = 1 - a[2] = 8 - - - - - Thread 3 - - b[3] = 3 - c[3] = 5 - a[3] = 8 - - - - - - - - - diff --git a/docs/data/understand/programming_model/simt-execution.svg b/docs/data/understand/programming_model/simt-execution.svg new file mode 100644 index 0000000000..0eabd748ea --- /dev/null +++ b/docs/data/understand/programming_model/simt-execution.svg @@ -0,0 +1 @@ +
SIMT Execution Model
SIMT Execution Model
a[i] = b[i] + c[i]
a[i] = b[i] + c[i]
Thread 0
Thread 0
b[0] = 5
b[0] = 5
c[0] = 3
c[0] = 3
a[0] = 8
a[0] = 8
Thread 1
Thread 1
b[1] = 2
b[1] = 2
c[1] = 4
c[1] = 4
a[1] = 6
a[1] = 6
Thread 2
Thread 2
b[2] = 7
b[2] = 7
c[2] = 1
c[2] = 1
a[2] = 8
a[2] = 8
Thread 3
Thread 3
b[3] = 3
b[3] = 3
c[3] = 5
c[3] = 5
a[3] = 8
a[3] = 8
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/simt.drawio b/docs/data/understand/programming_model/simt.drawio deleted file mode 100644 index 4c5c5a3f26..0000000000 --- a/docs/data/understand/programming_model/simt.drawio +++ /dev/null @@ -1,148 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/docs/data/understand/programming_model/stream-workflow.svg b/docs/data/understand/programming_model/stream-workflow.svg index 96da9cedcd..d42d9a40b5 100644 --- a/docs/data/understand/programming_model/stream-workflow.svg +++ b/docs/data/understand/programming_model/stream-workflow.svg @@ -1,45 +1 @@ - - - - - - Stream and Event Workflow - - - - - - - - Stream 1 - Stream 2 - Stream 3 - - - - - - - - - - - - - - - - - - - - - - - - - - Operation - - Event - +
Stream and Event Workflow
Stream and Event Workf...
Stream 1
Stream 1
Stream 2
Stream 2
Stream 3
Stream 3
Operation
Operation
Event
Event
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/thread_hierarchy.drawio b/docs/data/understand/programming_model/thread_hierarchy.drawio deleted file mode 100644 index 61ac9aa59c..0000000000 --- a/docs/data/understand/programming_model/thread_hierarchy.drawio +++ /dev/null @@ -1,8332 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file From 2208c0829a98c615a2c43bce7df5a517601a3116 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 12:07:12 -0800 Subject: [PATCH 44/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 076c01710e..4489268034 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -468,7 +468,7 @@ stream until the operation related to the event completes. After the event compl side effects of the operation will be visible to subsequent commands even if those side effects manifest on different devices. -.. figure:: ../data/understand/programming_model/stream-workflow.svg +.. figure:: ../data/understand/programming_model/stream-workflow.sv :alt: Diagram depicting the stream and event workflow, with an example of multiple streams working together. The diagram shows operations as red rectangles, and events as white dots. There are three streams labelled From a746a94a0db7a71adadcc2f73fb80caeeef449d3 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 12:08:55 -0800 Subject: [PATCH 45/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 4489268034..076c01710e 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -468,7 +468,7 @@ stream until the operation related to the event completes. After the event compl side effects of the operation will be visible to subsequent commands even if those side effects manifest on different devices. -.. figure:: ../data/understand/programming_model/stream-workflow.sv +.. figure:: ../data/understand/programming_model/stream-workflow.svg :alt: Diagram depicting the stream and event workflow, with an example of multiple streams working together. The diagram shows operations as red rectangles, and events as white dots. There are three streams labelled From a490c576591e4195576bf78de08c4e877bb3660e Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Fri, 14 Feb 2025 12:27:09 -0800 Subject: [PATCH 46/52] Update programming_model.rst fix cooperative groups --- docs/understand/programming_model.rst | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 076c01710e..78e69c40c1 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -325,11 +325,10 @@ Cooperative groups thread model The Cooperative groups API introduces new functions to launch, group, subdivide, synchronize and identify threads, as well as some predefined group-collective -algorithms. Most importantly it offers a matching thread model to think of the -cooperative groups in terms of. It relaxes some restrictions of the :ref:`inherent_thread_model` -imposed by the strict 1:1 mapping of architectural details to the programming model. -Cooperative groups let you define your own set of thread groups which may better -fit your use-case than the defaults defined by the hardware. +algorithms. Cooperative groups let you define your own set of thread groups which +may fit your use-cases better than those defined by the hardware. It relaxes some +restrictions of the :ref:`inherent_thread_model` imposed by the strict 1:1 mapping +of architectural details to the programming model. .. note:: The implicit groups defined by kernel launch parameters are still available @@ -533,4 +532,4 @@ that need more compute power can use multiple GPUs in the system. This requires distributing workloads across multiple GPUs to balance the load to prevent GPUs from being over-utilized while others are idle. -For more information, see :ref:`multi-device` . \ No newline at end of file +For more information, see :ref:`multi-device` . From cf324af75e0fea944198e34de9134efa191bf8ee Mon Sep 17 00:00:00 2001 From: randyh62 Date: Mon, 17 Feb 2025 14:57:35 -0800 Subject: [PATCH 47/52] add borders to images --- docs/data/understand/programming_model/cpu-gpu-comparison.svg | 2 +- docs/data/understand/programming_model/host-device-flow.svg | 2 +- docs/data/understand/programming_model/memory-access.svg | 2 +- docs/data/understand/programming_model/multi-gpu.svg | 2 +- docs/data/understand/programming_model/simt-execution.svg | 2 +- docs/data/understand/programming_model/stream-workflow.svg | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.svg b/docs/data/understand/programming_model/cpu-gpu-comparison.svg index 76f7a3d5c3..14dcb6f948 100644 --- a/docs/data/understand/programming_model/cpu-gpu-comparison.svg +++ b/docs/data/understand/programming_model/cpu-gpu-comparison.svg @@ -1 +1 @@ -
CPU vs GPU Architecture
CPU vs GPU Architecture
CPU
CPU
GPU
GPU
Large Complex Cores
Large Complex Cores
High Clock Speed (3-5 GHz)
High Clock Speed (3-5 GHz)
Many Simple Cores
Many Simple Cores
Lower Clock Speed (1-2 GHz)
Lower Clock Speed (1-2 GHz)
Large Cache per Core
Large Cache per Core
Shared Memory across Cores
Shared Memory across Cores
Text is not SVG - cannot display
\ No newline at end of file +
CPU versus GPU Architecture
CPU versus GPU Archite...
CPU
CPU
GPU
GPU
Large Complex Cores
Large Complex Cores
High Clock Speed (3-5 GHz)
High Clock Speed (3-5 GHz)
Many Simple Cores
Many Simple Cores
Lower Clock Speed (1-2 GHz)
Lower Clock Speed (1-2 GHz)
Large Cache per Core
Large Cache per Core
Shared Memory across Cores
Shared Memory across Cores
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/host-device-flow.svg b/docs/data/understand/programming_model/host-device-flow.svg index c5f57c255b..02bce96c5d 100644 --- a/docs/data/understand/programming_model/host-device-flow.svg +++ b/docs/data/understand/programming_model/host-device-flow.svg @@ -1 +1 @@ -
Host-Device Data Flow
Host-Device Data Flow
Host (CPU)
Host (CPU)
Device (GPU)
Device (GPU)
1. Initialize
1. Initialize
2. Transfer Data
2. Transfer Data
3. Execute Kernel
3. Execute Kernel
4. Return Results
4. Return Results
Text is not SVG - cannot display
\ No newline at end of file +
Host-Device Data Flow
Host-Device Data Flow
Host (CPU)
Host (CPU)
Device (GPU)
Device (GPU)
1. Initialize
1. Initialize
2. Transfer Data
2. Transfer Data
3. Execute Kernel
3. Execute Kernel
4. Return Results
4. Return Results
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/memory-access.svg b/docs/data/understand/programming_model/memory-access.svg index 80f7908636..5f0dbd8aae 100644 --- a/docs/data/understand/programming_model/memory-access.svg +++ b/docs/data/understand/programming_model/memory-access.svg @@ -1 +1 @@ -
Memory Access Patterns
Memory Access Patterns
Uncoalesced Access
Uncoalesced Access
Threads
Threads
Memory
Memory
Coalesced Access
Coalesced Access
Threads
Threads
Memory
Memory
Text is not SVG - cannot display
\ No newline at end of file +
Memory Access Patterns
Memory Access Patterns
Uncoalesced Access
Uncoalesced Access
Threads
Threads
Memory
Memory
Coalesced Access
Coalesced Access
Threads
Threads
Memory
Memory
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
0
0
...
...
...
...
63
63
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/multi-gpu.svg b/docs/data/understand/programming_model/multi-gpu.svg index 83e50f9fed..190f2593d2 100644 --- a/docs/data/understand/programming_model/multi-gpu.svg +++ b/docs/data/understand/programming_model/multi-gpu.svg @@ -1 +1 @@ -
Multi-GPU Workload Distribution
Multi-GPU Workload Distribution
Host CPU
Host CPU
GPU 0
GPU 0
GPU 1
GPU 1
GPU 2
GPU 2
GPU 3
GPU 3
25%
25%
25%
25%
25%
25%
25%
25%
Text is not SVG - cannot display
\ No newline at end of file +
Multi-GPU Workload Distribution
Multi-GPU Workload Distribution
Host CPU
Host CPU
GPU 0
GPU 0
GPU 1
GPU 1
GPU 2
GPU 2
GPU 3
GPU 3
25%
25%
25%
25%
25%
25%
25%
25%
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/simt-execution.svg b/docs/data/understand/programming_model/simt-execution.svg index 0eabd748ea..412b9265e7 100644 --- a/docs/data/understand/programming_model/simt-execution.svg +++ b/docs/data/understand/programming_model/simt-execution.svg @@ -1 +1 @@ -
SIMT Execution Model
SIMT Execution Model
a[i] = b[i] + c[i]
a[i] = b[i] + c[i]
Thread 0
Thread 0
b[0] = 5
b[0] = 5
c[0] = 3
c[0] = 3
a[0] = 8
a[0] = 8
Thread 1
Thread 1
b[1] = 2
b[1] = 2
c[1] = 4
c[1] = 4
a[1] = 6
a[1] = 6
Thread 2
Thread 2
b[2] = 7
b[2] = 7
c[2] = 1
c[2] = 1
a[2] = 8
a[2] = 8
Thread 3
Thread 3
b[3] = 3
b[3] = 3
c[3] = 5
c[3] = 5
a[3] = 8
a[3] = 8
Text is not SVG - cannot display
\ No newline at end of file +
SIMT Execution Model
SIMT Execution Model
a[i] = b[i] + c[i]
a[i] = b[i] + c[i]
Thread 0
Thread 0
b[0] = 5
b[0] = 5
c[0] = 3
c[0] = 3
a[0] = 8
a[0] = 8
Thread 1
Thread 1
b[1] = 2
b[1] = 2
c[1] = 4
c[1] = 4
a[1] = 6
a[1] = 6
Thread 2
Thread 2
b[2] = 7
b[2] = 7
c[2] = 1
c[2] = 1
a[2] = 8
a[2] = 8
Thread 3
Thread 3
b[3] = 3
b[3] = 3
c[3] = 5
c[3] = 5
a[3] = 8
a[3] = 8
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/data/understand/programming_model/stream-workflow.svg b/docs/data/understand/programming_model/stream-workflow.svg index d42d9a40b5..9648351cad 100644 --- a/docs/data/understand/programming_model/stream-workflow.svg +++ b/docs/data/understand/programming_model/stream-workflow.svg @@ -1 +1 @@ -
Stream and Event Workflow
Stream and Event Workf...
Stream 1
Stream 1
Stream 2
Stream 2
Stream 3
Stream 3
Operation
Operation
Event
Event
Text is not SVG - cannot display
\ No newline at end of file +
Stream and Event Workflow
Stream and Event Workf...
Stream 1
Stream 1
Stream 2
Stream 2
Stream 3
Stream 3
Operation
Operation
Event
Event
Text is not SVG - cannot display
\ No newline at end of file From 33b5e259715076f8ab02069c2700c86a45ffd5f9 Mon Sep 17 00:00:00 2001 From: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Date: Tue, 18 Feb 2025 13:48:47 -0500 Subject: [PATCH 48/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 78e69c40c1..6282a26180 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -532,4 +532,4 @@ that need more compute power can use multiple GPUs in the system. This requires distributing workloads across multiple GPUs to balance the load to prevent GPUs from being over-utilized while others are idle. -For more information, see :ref:`multi-device` . +For more information, see :ref:`multi-device`. From fa06a7e435ebfb405c53bee0d925c850c05f0122 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Tue, 18 Feb 2025 15:58:33 -0800 Subject: [PATCH 49/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 6282a26180..0e6ad3ed6d 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -33,7 +33,7 @@ CPUs and GPUs have been designed for different purposes. CPUs quickly execute a The CPU block shows four large processing cores, lists Large Cache per Core, and High Clock Speed of 3 to 5 gigahertz. The GPU block shows 42 smaller processing cores, lists Shared Memory across Cores, and Lower - Clock SPeeds of 1 to 2 gigahertz. + Clock Speeds of 1 to 2 gigahertz. Differences in CPUs and GPUs From 397ecd1984bcdd366787cc77f883d00374f296e9 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Tue, 18 Feb 2025 15:59:10 -0800 Subject: [PATCH 50/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 0e6ad3ed6d..8be6576061 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -400,7 +400,7 @@ Memory optimizations and best practices .. figure:: ../data/understand/programming_model/memory-access.svg :alt: Diagram depicting an example memory access pattern for coalesced memory. - The diagram has uncoalesced access on the left side, with consecutive + The diagram has un-coalesced access on the left side, with consecutive threads accessing memory in a random pattern. With coalesced access on the right showing consecutive threads accessing consecutive memory addresses. From 935bb9a0b9849ca4e2dda73c7127b80959ec8bc9 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Tue, 18 Feb 2025 16:03:03 -0800 Subject: [PATCH 51/52] Update .wordlist.txt --- .wordlist.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/.wordlist.txt b/.wordlist.txt index 1b1ba81a4e..2341867d91 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -171,6 +171,7 @@ transfering typedefs unintuitive UMM +uncoalesced unmap unmapped unmapping From 6947acd4363086714fbd09ca3b1bd54fd69c80f9 Mon Sep 17 00:00:00 2001 From: randyh62 <42045079+randyh62@users.noreply.github.com> Date: Tue, 18 Feb 2025 16:03:57 -0800 Subject: [PATCH 52/52] Update docs/understand/programming_model.rst --- docs/understand/programming_model.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/understand/programming_model.rst b/docs/understand/programming_model.rst index 8be6576061..0e6ad3ed6d 100644 --- a/docs/understand/programming_model.rst +++ b/docs/understand/programming_model.rst @@ -400,7 +400,7 @@ Memory optimizations and best practices .. figure:: ../data/understand/programming_model/memory-access.svg :alt: Diagram depicting an example memory access pattern for coalesced memory. - The diagram has un-coalesced access on the left side, with consecutive + The diagram has uncoalesced access on the left side, with consecutive threads accessing memory in a random pattern. With coalesced access on the right showing consecutive threads accessing consecutive memory addresses.