-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU/OpenCL] Initial version of FC Layer with OpenCL ops #2570
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2570. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
api/ccapi/include/layer.h
Outdated
/** | ||
* @brief Helper function to create fully connected layer for GPU | ||
*/ | ||
inline std::unique_ptr<Layer> FullyConnectedCl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this PR, but how about create CL Layer with same interface with device options like cpu/gpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createLayer(const LayerType &type,
const std::vector<std::string> &properties = {},
const LayerComputeEngine &compute_engine = LayerComputeEngine::CPU);
This will set lnode->setComputeEngine(compute_engine);
which can be used inside layer classes using RunLayerContext
reference. We can use that to have conditions within a layer class instead of having separate class for CPU/GPU.
@@ -0,0 +1,616 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use SPDX-License-Identifier: Apache-2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated commit.
} while (false); | ||
} | ||
|
||
void FullyConnectedLayerCl::incremental_forwarding(RunLayerContext &context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like incremental_forwarding does not use GPU. I'm worried little bit.. Currently we are using incremental_forwording for LLaMA Application due to auto-regressive nature of LLM. Could you check once again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added GPU kernel call for incremental_forwarding
as well in updated commit.
return cl_ret; | ||
} | ||
|
||
void FullyConnectedLayerCl::fc_sgemm_cl(const float *A, const float *B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can use these blas cl operations in other layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in latest commit
#include <node_exporter.h> | ||
#include <util_func.h> | ||
|
||
std::string fc_sgemv_cl_kernel_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make a binary cl kernel to reduce online compilation time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is present in the current code. So, program.CreateCLProgram
call from RunLayerContext::clCreateKernel
will create kernel binaries for the first run and re-use them from the next runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to set the proper directory to save the CL Kernel Binary then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. As of now it is being set by opencl-kernel-path
of meson_options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
Added naive version of OpenCl implementation for FC Layer. Incorporated separate kernels for ops used. Added unit test for fc_layer_cl. Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
Added incremental forwarding as an option for unit testing layers Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
Added blas_kernels to enhance resuability of the common blas kernels. Used FullyConnected interface for both CPU and GPU calls. Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Renamed global variables in unittest_layers_fully_connected_cl.cpp to fix duplicate declaration error Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
__global float* Y, unsigned int M, unsigned int N) { | ||
unsigned int i, j; | ||
i = get_global_id(0); | ||
float y0 = Y[i] * 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float y0 = 0.0f;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest commit. Thanks.
})"; | ||
|
||
std::string dot_cl_kernel_ = | ||
R"(__kernel void dot_cl(const __global float* A, const __global float* X, unsigned int K, float res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float &res
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing, result was not properly fetched. Also the suggested syntax won't work since OpenCL buffers can not reference to host memory. Used global memory pointer to fix the issue.
unsigned int m = get_global_id(0); | ||
for (unsigned int n = 0; n < N; ++n) { | ||
float c = 0.0; | ||
float c_old = C[m * ldc + n]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems unused.
float c_old = C[m * ldc + n]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks.
* @param[in] M number of op(A)'s and C's row | ||
* @param[in] N number of op(B)'s and C's columns | ||
* @param[in] K number of op(A)'s and columns and op(B)'s rows | ||
* @param[in] context RunLayerContext reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add descriptions for lda
, ldb
, and ldc
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
Fixed kernel argument bug for dot_cl kernel Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
* @param[in] dim1 number of A's row | ||
* @param[in] dim2 number of X's columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following your code, it seems
A's dim= (dim2 x dim1) / X's = (dim1,) / Y's = (dim2,)
if vectors are column vector.
Please check.
* @param[in] dim1 number of A's row | |
* @param[in] dim2 number of X's columns | |
* @param[in] dim1 number of A's columns | |
* @param[in] dim2 number of A's rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected. Thanks.
* @param[in] dim1 number of elements in both input vectors | ||
* @param[in] context RunLayerContext reference | ||
*/ | ||
float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better to change the variable name as
float dot_cl(const float *matAdata, const float *vecXdata, unsigned int dim1, | |
float dot_cl(const float *vecAdata, const float *vecXdata, unsigned int dim1, |
__global float* C, unsigned int M, unsigned int N, unsigned int K, unsigned int lda, unsigned int ldb, unsigned int ldc) { | ||
|
||
unsigned int m = get_global_id(0); | ||
for (unsigned int n = 0; n < N; ++n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question. is there room for optimization as follows or would it make no difference?
for (unsigned int n = 0; n < N; ++n) { | |
unsigned int n = get_global_id(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it can be optimized with 2D global work size. I have added this change. However sometimes using a 2D work size may introduce unnecessary overhead and reduce the performance of the kernel depending on the memory access pattern.
Going forward we can experiment with large models and use the optimal work size based on the performance.
|
||
size_t dim1_size = sizeof(float) * dim1; | ||
size_t dim2_size = sizeof(float) * dim2; | ||
opencl::Buffer inputA(context.context_inst_, dim1_size * dim2_size, true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputA
size would be dim1 * dim2 * sizeof(float)
in general.
what makes it different for OpenCL to require size as dim1 * dim2 * sizeof(float) * sizeof(float)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this bug. Thanks for noticing.
Used proper size while creating OpenCL buffers. Optimized SGEMM kernel with 2D global work size. Modified function docs. Signed-off-by: Debadri Samaddar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s-debadri, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
std::string sgemv_cl_kernel_ = | ||
R"(__kernel void sgemv_cl(const __global float* A, const __global float* X, | ||
__global float* Y, unsigned int M, unsigned int N) { | ||
unsigned int i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious and trivial, but why don't we use lda
while sgemm_cl_kernel
use lda
and ldb
?
I can even observe lda
in
// @ same file
...
void sgemv_cl(const float *matAdata, const float *vecXdata, float *vecYdata,
unsigned int dim1, unsigned int dim2, unsigned int lda,
RunLayerContext &context);
...
Adding params like lda
and ldb
would be much more helpful for future development as far as I am concerned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this can be added in future to make it more generic.
Added initial version of FC Layer for GPU. This is a basic implementation using naive kernels.
Changes added with this PR:
fc_layer_cl.cpp
added containing the newFullyConnectedLayerCl
class for OpenCL implementation.registerFactory
incl_context
to addFullyConnectedLayerCl
.FullyConnected
API inlayer.h
for CPU/GPU execution depending on compute engine option.blas_kernels
to enhance re-usability.LayerKernel
enum insidelayer_context.h
.unittest_layers_fully_connected_cl.cpp
to test FC Layer on GPU.layers_golden_tests.cpp
Self evaluation:
Signed-off-by: Debadri Samaddar [email protected]