-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ BLAS ] Implement elementwise operations #2474
[ BLAS ] Implement elementwise operations #2474
Conversation
📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2474. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
c03861d
to
3cd78b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me. There are a few things to point out.
- we should validate this code by replacing the current implementation of std::transform in the Tensor class with this, especially
divide()
. - after 1, we should check the performance difference for non-neon cases. in other words, check the effect of replacing std::transform() to for loop.
+ please resolve the conflict!
@@ -400,12 +381,64 @@ void scopy_int8_to_float16(const unsigned int N, const uint8_t *X, | |||
copy_int8_to_fp16(N, X, incX, Y, incY); | |||
} | |||
|
|||
void ewvm(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z) { | |||
ewvm_FP16(N, X, Y, Z); | |||
void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z, | |
void elementwise_multiply(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z, |
one suggestion. how about renaming it into something more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked for many contributors' opinions offline, but shorter function name is quite preferred.
ele_* sounds clear enough for me.
if (std::abs(beta) > __FLT_MIN__) | ||
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] + | ||
static_cast<_FP16>(beta) * Z[i]; | ||
else | ||
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to differentiate beta == 0 case?
if (std::abs(beta) > __FLT_MIN__) | |
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] + | |
static_cast<_FP16>(beta) * Z[i]; | |
else | |
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i]; | |
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] + static_cast<_FP16>(beta) * Z[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if Z[i] != Z[i] ? (NaN in Z[i], or uninitialized Z)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless of beta, wouldn't it cause an error anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if beta is zero, Z = X * Y + beta * Z would be Z = X * Y.
if not, Z = X * Y + beta * Z.
for the case where NaN is in Z[i] or uninitialized Z, it would cause an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't NaN * 0 = NaN ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, what I mean is having beta != 0 condition to avoid NaN or uninitialized error seems offbeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could come up with a different way to handle these cases (e.g., check if tensor is initialized when using beta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds more reasonable.
will try to fix it during resolving conflicts
* @param[in] X _FP16 * for Vector X | ||
* @param[in] Y _FP16 * for Vector Y | ||
* @param[in] Z _FP16 * for Vector Z | ||
* @param[in] alpha scalar multiplier for input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question. it seems like scalars are added only for Y and Z. wouldn't there be cases where X also needs a scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, why is scalar added in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good point. Ususally, strict BLAS functions only get X and Y, not Z. So it would go like :
$X = X * \alpha (op) Y * \beta$ . But current function usage in the NNTrainer requires all$X, Y, Z$ . I thought of adding new scalar multiplier$\gamma$ , but there is no such case for now, maybe we can add it whenever we need it. - from tensor operation, we differentiate whether to use std::transform or ele_* by stride and alpha / beta. For more broader use of this function, scalar multiplier is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarification :)
@@ -1076,7 +1076,7 @@ Tensor &Tensor::add(Tensor const &m, Tensor &output, float const alpha) const { | |||
auto f = [&](const BroadcastInfo &e, const float *buf, const float *m_buf, | |||
float *out_buf) { | |||
if (e.strides[3] == 1 && strides[3] == 1 && strides[3] == 1 && | |||
alpha == 0) { | |||
alpha == 1.f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain why this condition has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been using this code wrong.
alpha
is an input scalar multiplier, so considering our total formula,
alpha == 0 is a nonsense. isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. then so far, ele_add (prev ewva) hasn't been used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I am concerned, that is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
I already conducted this with tensor unittests, and discovered that:
|
…fp32 - rename ewvm, ewva to ele_mul and ele_add - implement for fp32 case as well - add scalar multiplier alpha, beta parameter **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- It is quite common to use scalar multiplier in elementwise addition and multiplication. - However, in case of multiplier beta, if the output vector Z is set to NaN, it might produce invalid values. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- This commit introduces a basic structure of elementwise subtraction and division function structure - Function implementation will be added in a later commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Latest elementwise functions get alpha, beta as scalar multipliers. - With those parameters, formula in function brief can be discribed in a more precise way **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Implement elementwise subtraction and division functions based on function structures proposed from the previous commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- This commit introduces a basic structure of elementwise subtraction and division funct ion structure in blas_interface - Function implementation will be added in a later commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Implement elementwise subtraction and division functions based on function structures proposed from the previous commit - with NEON, we can use SIMD-acclerated function from blas_neon **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…lar multiplier - Default value of scalar multiplier alpha should be 1, not 0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- With default output scalar multiplier value beta, output Z might contain NaN values. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- According to discussions made from nnstreamer#2473, we found a better way of comparing float scalar multiplier using __FLT_MIN__ **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
3cd78b1
to
760fda9
Compare
- Like commit#7363546, alpha option in ewva should be set to 1, not 0. - Change function name : ew* -> ele_* **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
760fda9
to
9a3f5e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Implemented elementwise vector operations
ewva
)ewvm
)with fp32 / fp16 precisions
for current BLAS, implement with raw version
for NEON SIMD, implement with SIMD version