feature(lvgl_port): RGB888 SIMD fill #480

pborcin · 2025-01-14T13:43:14Z

This MR adds SIMD support for LV_DRAW_SW_COLOR_BLEND_TO_RGB888

ESP-BSP Pull Request checklist

Note: For new BSPs create a PR with this link.

Version of modified component bumped
CI passing

Change description

esp32s3 SIMD assembly implementation of the RGB888 simple fill
esp32 assembly implementation of the RGB888 simple fill
Functionality and benchmark tests
Updated README with achieved benchmark results

peter-marcisovsky

@pborcin thanks for the changes. Nice work with the switch-case implementation. I have never tried that myself (did not have a proper use case) but this looks like a good one.

Sorry for the review comments regarding your style of commenting, but as you might have seen the assembly is very hard be read and to be navigated through and commenting makes it bit easier for us to know what is going on and which buffer are you aligning and where to are you jumping. Thank you.

I will also try the test app and let you know.

peter-marcisovsky · 2025-01-15T09:56:20Z

components/esp_lvgl_port/test_apps/simd/main/lv_blend/src/lv_draw_sw_blend_to_rgb888.c

+#include "lv_draw_sw_blend.h"
+#include "lv_math.h"
+#include "lv_color.h"
+#include "string.h"


I added lv_string.h in #472 from lvgl, since it contains some general HW optimized memcpy() and memset(), (lv_memcpy() and lv_memset()).

As already mentioned, once my MR is merged, you should update this file to use the LVGL's lv_memset() and lv_memcpy().

It is important, because right now, the std memcpy() from string.h is used is the LV_DRAW_SW_COLOR_BLEND_TO_RGB888 ANSI C implementation, the very implementation you are doing benchmark tests against.

Thus, the benchmark results of the LV_DRAW_SW_COLOR_BLEND_TO_RGB888 ANSI C implementation are not aligned with the LVGL's upstream ANSI C, and you are not getting "the real" benchmark results which you would normally get by running ANSI C blending functions in LVGL.

peter-marcisovsky · 2025-01-15T09:57:22Z

components/esp_lvgl_port/test_apps/simd/README.md

+| RGB888       | 128x128     |     16 byte      |     0.608      |     2.247      |
+|              | 127x127     |      1 byte      |     0.826      |     2.413      |


Also this results should be updated after using lv_memcpy()

peter-marcisovsky · 2025-01-15T09:58:30Z

components/esp_lvgl_port/test_apps/simd/main/lv_blend/src/lv_draw_sw_blend_to_rgb888.c

+                    // TODO: lv_memcpy
+                    memcpy(dest_buf_u8, dest_buf_ori, w);


Suggested change

// TODO: lv_memcpy

memcpy(dest_buf_u8, dest_buf_ori, w);

lv_memcpy(dest_buf_u8, dest_buf_ori, w);

peter-marcisovsky · 2025-01-15T09:59:57Z

components/esp_lvgl_port/test_apps/simd/main/lv_blend/src/lv_draw_sw_blend_to_rgb888.c

+            if (LV_RESULT_INVALID == LV_DRAW_SW_RGB888_BLEND_NORMAL_TO_RGB888(dsc, dest_px_size, src_px_size)) {
+                if (src_px_size == dest_px_size) {
+                    for (y = 0; y < h; y++) {
+                        memcpy(dest_buf, src_buf, w);


Suggested change

memcpy(dest_buf, src_buf, w);

lv_memcpy(dest_buf, src_buf, w);

peter-marcisovsky · 2025-01-15T10:02:20Z

components/esp_lvgl_port/test_apps/simd/main/test_lv_fill_benchmark.c

+
+TEST_CASE("LV Fill benchmark RGB888", "[fill][benchmark][RGB888]")
+{
+    uint8_t *dest_array_align16  = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES);


Suggested change

uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES);

uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 * sizeof(uint8_t) + UNALIGN_BYTES);

peter-marcisovsky · 2025-01-15T10:11:16Z

components/esp_lvgl_port/test_apps/simd/main/test_lv_fill_functionality.c

+TEST_CASE("Test fill functionality RGB888", "[fill][functionality][RGB888]")
+{
+    test_matrix_params_t test_matrix = {
+        .min_w = 12,             // 12 is the lower limit for the esp32s3 asm implementation, otherwise esp32 is executed


Please, just make sure, that the functionality test passes even with .min_w < 12 for esp32s3 assembly. Just check it locally (going from 1 to 12, or so)

peter-marcisovsky · 2025-01-15T10:32:14Z

components/esp_lvgl_port/src/lvgl9/simd/lv_color_blend_to_rgb888_esp32s3.S

+        // Copy q3 to q0
+        ee.zero.q      q0                                   // clear q0
+        ee.orq         q0,   q0,   q3                       // copy q3 to q0
+
+        // Copy q4 to q1
+        ee.zero.q      q1                                   // clear q1
+        ee.orq         q1,   q1,   q4                       // copy q4 to q1
+
+        // Copy q5 to q2
+        ee.zero.q      q2                                   // clear q2
+        ee.orq         q2,   q2,   q5                       // copy q5 to q2


Suggested change

// Copy q3 to q0

ee.zero.q q0 // clear q0

ee.orq q0, q0, q3 // copy q3 to q0

// Copy q4 to q1

ee.zero.q q1 // clear q1

ee.orq q1, q1, q4 // copy q4 to q1

// Copy q5 to q2

ee.zero.q q2 // clear q2

ee.orq q2, q2, q5 // copy q5 to q2

// Clear q0 to q3

ee.zero.q q0 // clear q0

ee.zero.q q1 // clear q1

ee.zero.q q2 // clear q2

// Refresh q0 to q3

ee.orq q0, q0, q3 // copy q3 to q0

ee.orq q1, q1, q4 // copy q4 to q1

ee.orq q2, q2, q5 // copy q5 to q2

I would prefer something like that. Due to CPU pipelining.
You are writing to q0 by ee.zero.q q0 and then in the very next step you are reading from it by ee.orq q0, q0, q3, so you might cause the CPU to wait one cycle, because the value in the q0 is not yet ready to be read from.

The pipelining is also a reason why, for example, this main loop (and majority of the other main loops) looks the way it looks. It uses 2 q registers, instead of one to prevent CPU from waiting one cycle. I could have just easily load from memory to q0 and then subsequently save from q0 to memory. But it would cause significant delay, due to the pipelining.

In this very case, it's not a very big deal (because this is in the outer loop), but why don't we save few cycle if we can.

peter-marcisovsky · 2025-01-15T10:33:30Z

components/esp_lvgl_port/src/lvgl9/simd/lv_color_blend_to_rgb888_esp32s3.S

+        extui   a8,    a3,  0,  4                           // a8 = a3 AND 0xf
+
+        // if a8 = 0 skip unalignment computation
+        bnez    a8,    _unaligned_dest_buff                  // If already aligned, jump to aligned


Please, fix the comment.
If (what?) already aligned, jump to (where?) aligned

peter-marcisovsky · 2025-01-15T10:34:34Z

components/esp_lvgl_port/src/lvgl9/simd/lv_color_blend_to_rgb888_esp32s3.S

+            j    _aligned_dest_buff
+        _unaligned_dest_buff:
+
+        // length


Here could be: calculate length of something?

peter-marcisovsky · 2025-01-15T10:35:14Z

components/esp_lvgl_port/src/lvgl9/simd/lv_color_blend_to_rgb888_esp32s3.S

+        quou    a9,    a10,   a12                   // a9 = local_dest_w_bytes (a10) DIV 48 (a12)
+        remu    a10,   a10,   a12                   // a10 = local_dest_w_bytes (a10) remainder div 48 (a12)
+
+        beqz    a8,    _dest_buff_aligned           // If already aligned, skip aligning


If already (what) aligned, skip aligning of (what)?

feature(lvgl_port): RGB888 SIMD fill

4e49325

pborcin requested a review from peter-marcisovsky January 14, 2025 13:43

peter-marcisovsky reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(lvgl_port): RGB888 SIMD fill #480

feature(lvgl_port): RGB888 SIMD fill #480

pborcin commented Jan 14, 2025

peter-marcisovsky left a comment

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

peter-marcisovsky Jan 15, 2025

		\| RGB888 \| 128x128 \| 16 byte \| 0.608 \| 2.247 \|
		\| \| 127x127 \| 1 byte \| 0.826 \| 2.413 \|

	// TODO: lv_memcpy
	memcpy(dest_buf_u8, dest_buf_ori, w);
	lv_memcpy(dest_buf_u8, dest_buf_ori, w);

	memcpy(dest_buf, src_buf, w);
	lv_memcpy(dest_buf, src_buf, w);

	uint8_t dest_array_align16 = (uint8_t )memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES);
	uint8_t dest_array_align16 = (uint8_t )memalign(16, STRIDE * HEIGHT * 3 * sizeof(uint8_t) + UNALIGN_BYTES);

feature(lvgl_port): RGB888 SIMD fill #480

Are you sure you want to change the base?

feature(lvgl_port): RGB888 SIMD fill #480

Conversation

pborcin commented Jan 14, 2025

ESP-BSP Pull Request checklist

Change description

peter-marcisovsky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment