[GPU/OpenCL] Initial version of FC Layer with OpenCL ops #2570

s-debadri · 2024-05-07T09:36:20Z

Added initial version of FC Layer for GPU. This is a basic implementation using naive kernels.
Changes added with this PR:

fc_layer_cl.cpp added containing the new FullyConnectedLayerCl class for OpenCL implementation.
Modified registerFactory in cl_context to add FullyConnectedLayerCl.
Re-used FullyConnected API in layer.h for CPU/GPU execution depending on compute engine option.
Created common BLAS OpenCL kernels for usage inside blas_kernels to enhance re-usability.
Updated LayerKernel enum inside layer_context.h.
Added unittest_layers_fully_connected_cl.cpp to test FC Layer on GPU.
Added unit test for incremental forwarding in layers_golden_tests.cpp

To do: Modify data transfer by adding cl_buffer with var_grad, optimize kernels.

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: Debadri Samaddar [email protected]

taos-ci · 2024-05-07T09:36:24Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2570. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

jijoongmoon · 2024-05-07T09:58:19Z

api/ccapi/include/layer.h

+/**
+ * @brief Helper function to create fully connected layer for GPU
+ */
+inline std::unique_ptr<Layer> FullyConnectedCl(


Not this PR, but how about create CL Layer with same interface with device options like cpu/gpu?

createLayer(const LayerType &type, const std::vector<std::string> &properties = {}, const LayerComputeEngine &compute_engine = LayerComputeEngine::CPU);

This will set lnode->setComputeEngine(compute_engine); which can be used inside layer classes using RunLayerContext reference. We can use that to have conditions within a layer class instead of having separate class for CPU/GPU.

jijoongmoon · 2024-05-07T09:59:48Z

nntrainer/layers/cl_layers/fc_layer_cl.cpp

@@ -0,0 +1,616 @@
+/**


It might be better to use SPDX-License-Identifier: Apache-2.0

Updated commit.

jijoongmoon · 2024-05-07T10:04:53Z

nntrainer/layers/cl_layers/fc_layer_cl.cpp

+  } while (false);
+}
+
+void FullyConnectedLayerCl::incremental_forwarding(RunLayerContext &context,


It seems like incremental_forwarding does not use GPU. I'm worried little bit.. Currently we are using incremental_forwording for LLaMA Application due to auto-regressive nature of LLM. Could you check once again?

Added GPU kernel call for incremental_forwarding as well in updated commit.

jijoongmoon · 2024-05-07T10:06:51Z

nntrainer/layers/cl_layers/fc_layer_cl.cpp

+  return cl_ret;
+}
+
+void FullyConnectedLayerCl::fc_sgemm_cl(const float *A, const float *B,


I wonder if we can use these blas cl operations in other layers?

Added in latest commit

jijoongmoon · 2024-05-07T10:08:47Z

nntrainer/layers/cl_layers/fc_layer_cl.cpp

+#include <node_exporter.h>
+#include <util_func.h>
+
+std::string fc_sgemv_cl_kernel_ =


can we make a binary cl kernel to reduce online compilation time?

This implementation is present in the current code. So, program.CreateCLProgram call from RunLayerContext::clCreateKernel will create kernel binaries for the first run and re-use them from the next runs.

We might need to set the proper directory to save the CL Kernel Binary then.

Right. As of now it is being set by opencl-kernel-path of meson_options.

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added naive version of OpenCl implementation for FC Layer. Incorporated separate kernels for ops used. Added unit test for fc_layer_cl. Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added incremental forwarding as an option for unit testing layers Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Added blas_kernels to enhance resuability of the common blas kernels. Used FullyConnected interface for both CPU and GPU calls. Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

jijoongmoon

LGTM!

Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

myungjoo · 2024-05-16T05:10:49Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+                      __global float* Y, unsigned int M, unsigned int N) {                                            
+        unsigned int i, j;
+        i = get_global_id(0);                         
+        float y0 = Y[i] * 0.0f;


float y0 = 0.0f; ?

Updated in latest commit. Thanks.

myungjoo · 2024-05-16T05:11:00Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+    })";
+
+std::string dot_cl_kernel_ =
+  R"(__kernel void dot_cl(const __global float* A, const __global float* X, unsigned int K, float res) {


float &res ?

Thanks for noticing, result was not properly fetched. Also the suggested syntax won't work since OpenCL buffers can not reference to host memory. Used global memory pointer to fix the issue.

djeong20 · 2024-05-16T07:22:56Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+        unsigned int m = get_global_id(0);
+        for (unsigned int n = 0; n < N; ++n) {
+          float c = 0.0;
+          float c_old = C[m * ldc + n];


seems unused.

Suggested change

float c_old = C[m * ldc + n];

Removed. Thanks.

djeong20 · 2024-05-16T07:37:06Z

nntrainer/layers/cl_layers/blas_kernels.h

+ * @param[in] M number of op(A)'s and C's row
+ * @param[in] N number of op(B)'s and C's columns
+ * @param[in] K number of op(A)'s and columns and op(B)'s rows
+ * @param[in] context RunLayerContext reference


could you add descriptions for lda, ldb, and ldc as well?

Added in the latest commit.

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

Fixed kernel argument bug for dot_cl kernel Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

EunjuYang · 2024-05-17T01:35:10Z

nntrainer/layers/cl_layers/blas_kernels.h

+ * @param[in] dim1 number of A's row
+ * @param[in] dim2 number of X's columns


Following your code, it seems
A's dim= (dim2 x dim1) / X's = (dim1,) / Y's = (dim2,) if vectors are column vector.
Please check.

Suggested change

* @param[in] dim1 number of A's row

* @param[in] dim2 number of X's columns

* @param[in] dim1 number of A's columns

* @param[in] dim2 number of A's rows

Corrected. Thanks.

EunjuYang · 2024-05-17T01:41:58Z

nntrainer/layers/cl_layers/blas_kernels.h

+ * @param[in] dim1 number of elements in both input vectors
+ * @param[in] context RunLayerContext reference
+ */
+float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1,


It seems better to change the variable name as

Suggested change

float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1,

float dot_cl(const float *vecAdata, const float *vecXdata, unsigned int dim1,

djeong20 · 2024-05-17T01:18:34Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+                      __global float* C, unsigned int M, unsigned int N, unsigned int K, unsigned int lda, unsigned int ldb, unsigned int ldc) {
+
+        unsigned int m = get_global_id(0);
+        for (unsigned int n = 0; n < N; ++n) {


quick question. is there room for optimization as follows or would it make no difference?

Suggested change

for (unsigned int n = 0; n < N; ++n) {

unsigned int n = get_global_id(1);

Yes it can be optimized with 2D global work size. I have added this change. However sometimes using a 2D work size may introduce unnecessary overhead and reduce the performance of the kernel depending on the memory access pattern.
Going forward we can experiment with large models and use the optimal work size based on the performance.

djeong20 · 2024-05-17T01:22:13Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+
+    size_t dim1_size = sizeof(float) * dim1;
+    size_t dim2_size = sizeof(float) * dim2;
+    opencl::Buffer inputA(context.context_inst_, dim1_size * dim2_size, true,


inputA size would be dim1 * dim2 * sizeof(float) in general.
what makes it different for OpenCL to require size as dim1 * dim2 * sizeof(float) * sizeof(float)?

Fixed this bug. Thanks for noticing.

Used proper size while creating OpenCL buffers. Optimized SGEMM kernel with 2D global work size. Modified function docs. Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci

@s-debadri, 💯 All CI checkers are successfully verified. Thanks.

jijoongmoon

LGTM

jijoongmoon

LGTM

EunjuYang

LGTM!

skykongkong8 · 2024-05-23T00:52:45Z

nntrainer/layers/cl_layers/blas_kernels.cpp

+std::string sgemv_cl_kernel_ =
+  R"(__kernel void sgemv_cl(const __global float* A, const __global float* X,
+                      __global float* Y, unsigned int M, unsigned int N) {                                            
+        unsigned int i;


Just curious and trivial, but why don't we use lda while sgemm_cl_kernel use lda and ldb?
I can even observe lda in

// @ same file ... void sgemv_cl(const float *matAdata, const float *vecXdata, float *vecYdata, unsigned int dim1, unsigned int dim2, unsigned int lda, RunLayerContext &context); ...

Adding params like lda and ldb would be much more helpful for future development as far as I am concerned

Sure, this can be added in future to make it more generic.

github-actions bot added the Need Review label May 7, 2024

jijoongmoon reviewed May 7, 2024

View reviewed changes

taos-ci approved these changes May 7, 2024

View reviewed changes

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops

f49a75e

Added naive version of OpenCl implementation for FC Layer. Incorporated separate kernels for ops used. Added unit test for fc_layer_cl. Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the gpu_fc branch from dafddca to f49a75e Compare May 7, 2024 10:57

taos-ci approved these changes May 7, 2024

View reviewed changes

taos-ci approved these changes May 9, 2024

View reviewed changes

[unittest] Added test for incremental forwarding for layers

5a34d75

Added incremental forwarding as an option for unit testing layers Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the gpu_fc branch from 0e5aee6 to 5a34d75 Compare May 10, 2024 13:30

taos-ci approved these changes May 10, 2024

View reviewed changes

[GPU/OpenCL] Resuable blas OpenCL kernels

9c45213

Added blas_kernels to enhance resuability of the common blas kernels. Used FullyConnected interface for both CPU and GPU calls. Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri marked this pull request as ready for review May 14, 2024 08:40

s-debadri requested review from myungjoo, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot and jihochu as code owners May 14, 2024 08:40

s-debadri requested review from DonghakPark, SeoHyungjun, baek2sm, skykongkong8, djeong20, EunjuYang and a team as code owners May 14, 2024 08:40

s-debadri changed the title ~~[WIP] [GPU/OpenCL] Initial version of FC Layer with OpenCL ops~~ [GPU/OpenCL] Initial version of FC Layer with OpenCL ops May 14, 2024

taos-ci approved these changes May 14, 2024

View reviewed changes

jijoongmoon approved these changes May 14, 2024

View reviewed changes

[bugfix] Renamed variables in unittest of FC Layer

1268e89

Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci approved these changes May 15, 2024

View reviewed changes

myungjoo reviewed May 16, 2024

View reviewed changes

djeong20 reviewed May 16, 2024

View reviewed changes

taos-ci approved these changes May 16, 2024

View reviewed changes

[bugfix] Used global memmory for result in dot_cl kernel

c914cc2

Fixed kernel argument bug for dot_cl kernel Signed-off-by: Debadri Samaddar <[email protected]>

s-debadri force-pushed the gpu_fc branch from 8bba9d5 to c914cc2 Compare May 16, 2024 08:22

taos-ci approved these changes May 16, 2024

View reviewed changes

EunjuYang reviewed May 17, 2024

View reviewed changes

djeong20 reviewed May 17, 2024

View reviewed changes

[bugfix/refactor] OpenCL buffer creation fix and optimization

0aead5a

Used proper size while creating OpenCL buffers. Optimized SGEMM kernel with 2D global work size. Modified function docs. Signed-off-by: Debadri Samaddar <[email protected]>

taos-ci approved these changes May 17, 2024

View reviewed changes

jijoongmoon approved these changes May 22, 2024

View reviewed changes

EunjuYang approved these changes May 23, 2024

View reviewed changes

skykongkong8 approved these changes May 23, 2024

View reviewed changes

jijoongmoon merged commit ddf8104 into nnstreamer:main May 23, 2024
36 checks passed

s-debadri deleted the gpu_fc branch May 23, 2024 07:40

		* @param[in] dim1 number of A's row
		* @param[in] dim2 number of X's columns

	float dot_cl(const float matAdata, const float vecXdata, unsigned int dim1,
	float dot_cl(const float vecAdata, const float vecXdata, unsigned int dim1,

	for (unsigned int n = 0; n < N; ++n) {
	unsigned int n = get_global_id(1);

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops #2570

[GPU/OpenCL] Initial version of FC Layer with OpenCL ops #2570

Conversation

s-debadri commented May 7, 2024 • edited Loading

taos-ci commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jijoongmoon May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jijoongmoon May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-debadri May 16, 2024 • edited Loading

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

EunjuYang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-debadri commented May 7, 2024 •

edited

Loading

jijoongmoon May 7, 2024 •

edited

Loading

jijoongmoon May 7, 2024 •

edited

Loading

s-debadri May 16, 2024 •

edited

Loading