Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(lvgl_port): RGB888 SIMD fill #480

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pborcin
Copy link
Collaborator

@pborcin pborcin commented Jan 14, 2025

This MR adds SIMD support for LV_DRAW_SW_COLOR_BLEND_TO_RGB888

ESP-BSP Pull Request checklist

Note: For new BSPs create a PR with this link.

  • Version of modified component bumped
  • CI passing

Change description

  • esp32s3 SIMD assembly implementation of the RGB888 simple fill
  • esp32 assembly implementation of the RGB888 simple fill
  • Functionality and benchmark tests
  • Updated README with achieved benchmark results

Copy link
Collaborator

@peter-marcisovsky peter-marcisovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pborcin thanks for the changes. Nice work with the switch-case implementation. I have never tried that myself (did not have a proper use case) but this looks like a good one.

Sorry for the review comments regarding your style of commenting, but as you might have seen the assembly is very hard be read and to be navigated through and commenting makes it bit easier for us to know what is going on and which buffer are you aligning and where to are you jumping. Thank you.

I will also try the test app and let you know.

#include "lv_draw_sw_blend.h"
#include "lv_math.h"
#include "lv_color.h"
#include "string.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added lv_string.h in #472 from lvgl, since it contains some general HW optimized memcpy() and memset(), (lv_memcpy() and lv_memset()).

As already mentioned, once my MR is merged, you should update this file to use the LVGL's lv_memset() and lv_memcpy().

It is important, because right now, the std memcpy() from string.h is used is the LV_DRAW_SW_COLOR_BLEND_TO_RGB888 ANSI C implementation, the very implementation you are doing benchmark tests against.

Thus, the benchmark results of the LV_DRAW_SW_COLOR_BLEND_TO_RGB888 ANSI C implementation are not aligned with the LVGL's upstream ANSI C, and you are not getting "the real" benchmark results which you would normally get by running ANSI C blending functions in LVGL.

Comment on lines +15 to +16
| RGB888 | 128x128 | 16 byte | 0.608 | 2.247 |
| | 127x127 | 1 byte | 0.826 | 2.413 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this results should be updated after using lv_memcpy()

Comment on lines +217 to +218
// TODO: lv_memcpy
memcpy(dest_buf_u8, dest_buf_ori, w);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: lv_memcpy
memcpy(dest_buf_u8, dest_buf_ori, w);
lv_memcpy(dest_buf_u8, dest_buf_ori, w);

if (LV_RESULT_INVALID == LV_DRAW_SW_RGB888_BLEND_NORMAL_TO_RGB888(dsc, dest_px_size, src_px_size)) {
if (src_px_size == dest_px_size) {
for (y = 0; y < h; y++) {
memcpy(dest_buf, src_buf, w);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memcpy(dest_buf, src_buf, w);
lv_memcpy(dest_buf, src_buf, w);


TEST_CASE("LV Fill benchmark RGB888", "[fill][benchmark][RGB888]")
{
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 + UNALIGN_BYTES);
uint8_t *dest_array_align16 = (uint8_t *)memalign(16, STRIDE * HEIGHT * 3 * sizeof(uint8_t) + UNALIGN_BYTES);

TEST_CASE("Test fill functionality RGB888", "[fill][functionality][RGB888]")
{
test_matrix_params_t test_matrix = {
.min_w = 12, // 12 is the lower limit for the esp32s3 asm implementation, otherwise esp32 is executed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, just make sure, that the functionality test passes even with .min_w < 12 for esp32s3 assembly. Just check it locally (going from 1 to 12, or so)

Comment on lines +88 to +98
// Copy q3 to q0
ee.zero.q q0 // clear q0
ee.orq q0, q0, q3 // copy q3 to q0

// Copy q4 to q1
ee.zero.q q1 // clear q1
ee.orq q1, q1, q4 // copy q4 to q1

// Copy q5 to q2
ee.zero.q q2 // clear q2
ee.orq q2, q2, q5 // copy q5 to q2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copy q3 to q0
ee.zero.q q0 // clear q0
ee.orq q0, q0, q3 // copy q3 to q0
// Copy q4 to q1
ee.zero.q q1 // clear q1
ee.orq q1, q1, q4 // copy q4 to q1
// Copy q5 to q2
ee.zero.q q2 // clear q2
ee.orq q2, q2, q5 // copy q5 to q2
// Clear q0 to q3
ee.zero.q q0 // clear q0
ee.zero.q q1 // clear q1
ee.zero.q q2 // clear q2
// Refresh q0 to q3
ee.orq q0, q0, q3 // copy q3 to q0
ee.orq q1, q1, q4 // copy q4 to q1
ee.orq q2, q2, q5 // copy q5 to q2

I would prefer something like that. Due to CPU pipelining.
You are writing to q0 by ee.zero.q q0 and then in the very next step you are reading from it by ee.orq q0, q0, q3, so you might cause the CPU to wait one cycle, because the value in the q0 is not yet ready to be read from.

The pipelining is also a reason why, for example, this main loop (and majority of the other main loops) looks the way it looks. It uses 2 q registers, instead of one to prevent CPU from waiting one cycle. I could have just easily load from memory to q0 and then subsequently save from q0 to memory. But it would cause significant delay, due to the pipelining.

In this very case, it's not a very big deal (because this is in the outer loop), but why don't we save few cycle if we can.

extui a8, a3, 0, 4 // a8 = a3 AND 0xf

// if a8 = 0 skip unalignment computation
bnez a8, _unaligned_dest_buff // If already aligned, jump to aligned
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix the comment.
If (what?) already aligned, jump to (where?) aligned

j _aligned_dest_buff
_unaligned_dest_buff:

// length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here could be: calculate length of something?

quou a9, a10, a12 // a9 = local_dest_w_bytes (a10) DIV 48 (a12)
remu a10, a10, a12 // a10 = local_dest_w_bytes (a10) remainder div 48 (a12)

beqz a8, _dest_buff_aligned // If already aligned, skip aligning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already (what) aligned, skip aligning of (what)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants