Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize JSON string encoding #17734

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Feb 8, 2025

  • Split out a specialized function to decode multibyte UTF-8 sequences

    Decoding purely multibyte UTF-8 is common for example in the case of
    JSON. Furthermore, we want to avoid the switch on the character set in
    such hot code. Finally, we also add UNEXPECTED markers to move code to
    the cold section which reduces pressure on the µop and instruction
    caches.

  • Optimize JSON string encoding

    There are a couple of optimizations that work together:

    • We now use the specialized php_next_utf8_char_mb() helper function to
      avoid pressure on the µop and instruction cache.
    • It no longer emits UTF-8 bytes under PHP_JSON_UNESCAPED_UNICODE until
      it actually has to. By emitting in bulk, this improves performance.
    • Code layout tweaks
      • Use a specialized php_json_append() and assertions to avoid
        allocating the initial buffer, as this is already done upfront.
      • Factor out the call to smart_str_extend() to above the UTF-16 check
        to avoid code bloat.
    • Use SIMD, either with SSE2 or SSE4.2. A resolver is used when SSE4.2
      is not configured at compile time.

Here's a graph where I test different input strings.
The tests take an input string and repeat it using str_repeat, and then perform 200000 calls to json_encode with no flags. If (no escape) is in the label, then we pass JSON_UNESCAPED_UNICODE as a flag.
The first four bars show the very short strings where performance improvements won't be able to do much and no SIMD is possible. The performance difference and overhead of this approach is IMO negligible in those cases.
Next we test some extreme cases. The best case happens when a string doesn't need escaping or anything special. We can see when we need to do escaping or UTF-8 validation/escaping that the performance win is still there, but not as pronounced.
In some extreme UTF-8 (éa*200) the performance is slightly lower but when compared to the total execution time this shouldn't be noticeable. The last two bars show that the performance when using JSON_UNESCAPED_UNICODE is improved as well, which is hopefully the common way people work with UTF-8 in JSON.

Benchmarks done on an i7-4790 on Linux with the SSE4.2 resolver (so not the theoretical optimum, but a likely default).

out

The benchmark gist is here: https://gist.github.com/nielsdos/51f32ca9d4610b802d0ea65e7af6ae16

Decoding purely multibyte UTF-8 is common for example in the case of
JSON. Furthermore, we want to avoid the switch on the character set in
such hot code. Finally, we also add UNEXPECTED markers to move code to
the cold section which reduces pressure on the µop and instruction
caches.
There are a couple of optimizations that work together:
- We now use the specialized php_next_utf8_char_mb() helper function to
  avoid pressure on the µop and instruction cache.
- It no longer emits UTF-8 bytes under PHP_JSON_UNESCAPED_UNICODE until
  it actually has to. By emitting in bulk, this improves performance.
- Code layout tweaks
  * Use a specialized php_json_append() and assertions to avoid
    allocating the initial buffer, as this is already done upfront.
  * Factor out the call to smart_str_extend() to above the UTF-16 check
    to avoid code bloat.
- Use SIMD, either with SSE2 or SSE4.2. A resolver is used when SSE4.2
  is not configured at compile time.
@bukka
Copy link
Member

bukka commented Feb 8, 2025

Nice! The benchmark is quite promising. I would prefer a bit clean up of the code so the ifdefs are not mixed up. It means creating some abstraction and having ifdefs just at the top or in different file.

@bukka
Copy link
Member

bukka commented Feb 8, 2025

But it's not a big deal if you can't be bothered to do the abstraction - certainly not a blocker - just thought that it might be nicer. If you are too busy and all tests fine, feel free to merge it and I might refactore it later maybe.

@bukka
Copy link
Member

bukka commented Feb 8, 2025

btw I think people more likely use default (including frameworks like Laravel) so I'm not sure if JSON_UNESCAPED_UNICODE is used that much...

@nielsdos
Copy link
Member Author

nielsdos commented Feb 8, 2025

I'll try to split it off / move some code more together soon-ish.

btw I think people more likely use default (including frameworks like Laravel) so I'm not sure if JSON_UNESCAPED_UNICODE is used that much...

Should be easy to check, I can check this.

@nielsdos
Copy link
Member Author

nielsdos commented Feb 8, 2025

so I'm not sure if JSON_UNESCAPED_UNICODE is used that much...

Indeed most uses within Laravel and Symfony don't use it, or let the user configure it, and only a handful of places use it. A bit unfortunate as they would miss out on a huge performance win (but necessary ofc if they want interop with different character encodings)

@nielsdos nielsdos marked this pull request as ready for review February 9, 2025 17:25
Comment on lines +683 to +686
pos = 0;
us = php_next_utf8_char_mb((unsigned char *)cur, us, len, &pos);
len -= pos;
pos += pos_old;
Copy link
Contributor

@mvorisek mvorisek Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pos = 0;
us = php_next_utf8_char_mb((unsigned char *)cur, us, len, &pos);
len -= pos;
pos += pos_old;
us = php_next_utf8_char_mb((unsigned char *)cur, us, len, &pos);
len -= pos - pos_old;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need pos to be 0 when the function call happens because len is the remaining length instead of the total length.

Comment on lines +250 to +252
/* It's more important to keep this loop tight than to optimize this with
* a trailing zero count. */
for (; mask; mask >>= 1, *s += 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abcd*400 takes 0.06 s (1600 bytes)
"longenclosedstring"*50 takes 0.22 s (1000 bytes)

thus when " is present, the encoding is 6x slower. This is actually not an uncommon case, like HTML/XML/JSON (or basically any code) encoded in JSON.

The worst unoptimized case would be mask 10..0..01. Is loop with zend_ulong_ntz really slower here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ulong_ntz approach was in my other branch. It was significantly slower for more dense bitset, perhaps there's some heuristic to switch between the 2. I also found that the smaller code size that this has was very important as well for the other inputs (even the ones without any escaping).

int options
)
{
while (*len >= sizeof(__m128i)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding one "0123456789abcde*1000" string will have absolutely different performance then array of "0123456789abcde" 1000 strings.

How feasible is something like mb_fast_check_utf8 does? - https://github.com/php/php-src/blob/php-8.4.3/ext/mbstring/mbstring.c#L5447

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blow up in code size may optimize that case but will likely deoptimize other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json_encode can use SIMD
3 participants