-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(lvgl_port): RGB888 SIMD fill #480
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pborcin thanks for the changes. Nice work with the switch-case implementation. I have never tried that myself (did not have a proper use case) but this looks like a good one.
Sorry for the review comments regarding your style of commenting, but as you might have seen the assembly is very hard be read and to be navigated through and commenting makes it bit easier for us to know what is going on and which buffer are you aligning and where to are you jumping. Thank you.
I will also try the test app and let you know.
#include "lv_draw_sw_blend.h" | ||
#include "lv_math.h" | ||
#include "lv_color.h" | ||
#include "string.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added lv_string.h
in #472 from lvgl, since it contains some general HW optimized memcpy()
and memset()
, (lv_memcpy()
and lv_memset()
).
As already mentioned, once my MR is merged, you should update this file to use the LVGL's lv_memset()
and lv_memcpy()
.
It is important, because right now, the std memcpy()
from string.h
is used is the LV_DRAW_SW_COLOR_BLEND_TO_RGB888
ANSI C implementation, the very implementation you are doing benchmark tests against.
Thus, the benchmark results of the LV_DRAW_SW_COLOR_BLEND_TO_RGB888
ANSI C implementation are not aligned with the LVGL's upstream ANSI C, and you are not getting "the real" benchmark results which you would normally get by running ANSI C blending functions in LVGL.
| RGB888 | 128x128 | 16 byte | 0.608 | 2.247 | | ||
| | 127x127 | 1 byte | 0.826 | 2.413 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this results should be updated after using lv_memcpy()
// TODO: lv_memcpy | ||
memcpy(dest_buf_u8, dest_buf_ori, w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: lv_memcpy | |
memcpy(dest_buf_u8, dest_buf_ori, w); | |
lv_memcpy(dest_buf_u8, dest_buf_ori, w); |
if (LV_RESULT_INVALID == LV_DRAW_SW_RGB888_BLEND_NORMAL_TO_RGB888(dsc, dest_px_size, src_px_size)) { | ||
if (src_px_size == dest_px_size) { | ||
for (y = 0; y < h; y++) { | ||
memcpy(dest_buf, src_buf, w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memcpy(dest_buf, src_buf, w); | |
lv_memcpy(dest_buf, src_buf, w); |
|
||
TEST_CASE("LV Fill benchmark RGB888", "[fill][benchmark][RGB888]") | ||
{ | ||
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES); | |
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 * sizeof(uint8_t) + UNALIGN_BYTES); |
TEST_CASE("Test fill functionality RGB888", "[fill][functionality][RGB888]") | ||
{ | ||
test_matrix_params_t test_matrix = { | ||
.min_w = 12, // 12 is the lower limit for the esp32s3 asm implementation, otherwise esp32 is executed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, just make sure, that the functionality test passes even with .min_w < 12
for esp32s3 assembly. Just check it locally (going from 1 to 12, or so)
// Copy q3 to q0 | ||
ee.zero.q q0 // clear q0 | ||
ee.orq q0, q0, q3 // copy q3 to q0 | ||
|
||
// Copy q4 to q1 | ||
ee.zero.q q1 // clear q1 | ||
ee.orq q1, q1, q4 // copy q4 to q1 | ||
|
||
// Copy q5 to q2 | ||
ee.zero.q q2 // clear q2 | ||
ee.orq q2, q2, q5 // copy q5 to q2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copy q3 to q0 | |
ee.zero.q q0 // clear q0 | |
ee.orq q0, q0, q3 // copy q3 to q0 | |
// Copy q4 to q1 | |
ee.zero.q q1 // clear q1 | |
ee.orq q1, q1, q4 // copy q4 to q1 | |
// Copy q5 to q2 | |
ee.zero.q q2 // clear q2 | |
ee.orq q2, q2, q5 // copy q5 to q2 | |
// Clear q0 to q3 | |
ee.zero.q q0 // clear q0 | |
ee.zero.q q1 // clear q1 | |
ee.zero.q q2 // clear q2 | |
// Refresh q0 to q3 | |
ee.orq q0, q0, q3 // copy q3 to q0 | |
ee.orq q1, q1, q4 // copy q4 to q1 | |
ee.orq q2, q2, q5 // copy q5 to q2 |
I would prefer something like that. Due to CPU pipelining.
You are writing to q0
by ee.zero.q q0
and then in the very next step you are reading from it by ee.orq q0, q0, q3
, so you might cause the CPU to wait one cycle, because the value in the q0 is not yet ready to be read from.
The pipelining is also a reason why, for example, this main loop (and majority of the other main loops) looks the way it looks. It uses 2 q registers, instead of one to prevent CPU from waiting one cycle. I could have just easily load from memory to q0 and then subsequently save from q0 to memory. But it would cause significant delay, due to the pipelining.
In this very case, it's not a very big deal (because this is in the outer loop), but why don't we save few cycle if we can.
extui a8, a3, 0, 4 // a8 = a3 AND 0xf | ||
|
||
// if a8 = 0 skip unalignment computation | ||
bnez a8, _unaligned_dest_buff // If already aligned, jump to aligned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, fix the comment.
If (what?) already aligned, jump to (where?) aligned
j _aligned_dest_buff | ||
_unaligned_dest_buff: | ||
|
||
// length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here could be: calculate length of something?
quou a9, a10, a12 // a9 = local_dest_w_bytes (a10) DIV 48 (a12) | ||
remu a10, a10, a12 // a10 = local_dest_w_bytes (a10) remainder div 48 (a12) | ||
|
||
beqz a8, _dest_buff_aligned // If already aligned, skip aligning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If already (what) aligned, skip aligning of (what)?
This MR adds SIMD support for LV_DRAW_SW_COLOR_BLEND_TO_RGB888
ESP-BSP Pull Request checklist
Change description
esp32s3
SIMD assembly implementation of theRGB888
simple fillesp32
assembly implementation of theRGB888
simple fill