my normalization implementation #4647

gpawru · 2024-03-04T00:00:47Z

gpawru
Mar 4, 2024

Hello everyone!

First of all, thank you very much for what you do, it's inspiring! 🤟

Some time ago, I decided to write a few articles about Unicode, and logically, I wrote several code examples for these articles. More specifically, I implemented my own normalization (in progress - collations; articles not yet published - being reviewed by the editor, but that's all off-topic).

By chance, I noticed this section and decided to share my results (maybe it will be useful to someone?). I would like to ask for advice: perhaps there is something worth correcting? Should I make a crate out of these drafts?
Comments in the code are in russian (since the main purpose was to supplement the articles), but Google translate will handle it :)

I decided to sacrifice a bit of data size for performance. Here are the sizes of the compressed data:

NFD: 69 Kb
NFKD: 96 Kb
NFC: 82 Kb
NFKC: 113 Kb

Tests used: UCD tests + results comparsion with ICU4X; all combinations tests on early stages (not included in the repositories, SLOW, but helped to discover #4527 )

For benchmarks, I took texts in various languages (most common) (public domain texts, for some languages - simply Google translated), trimmed to 100 Kb, and made two versions: regular and predecomposed.

And finally the benchmarks:

Decomposing, µs :

language	ICU4X NFD	my NFD	ICU4X NFKD	my NFKD	ICU4X NFD (dec¹)	my NFKD (dec¹)
arabic	1244	508	1631	515	1230	508
chinese	883	201	1627	355	890	221
czech	886	603	992	603	875	558
dutch	178	111	191	108	178	108
english	193	110	214	109	188	104
french	360	228	385	233	337	208
german	249	163	283	166	235	149
greek	1473	713	1879	708	1516	641
hebrew	1073	407	1419	406	1072	402
hindi	867	341	1164	341	861	341
italian	257	158	287	158	245	145
japanese	1072	321	1665	337	1047	284
korean	2662	1167	3277	1261	2131	595
persian	1127	443	1511	434	1132	421
polish	653	442	755	442	635	421
portuguese	287	186	304	182	275	170
russian	1115	420	1508	425	1111	406
serbian	1045	411	1472	411	1063	412
spanish	329	206	368	205	304	189
thai	904	345	1307	386	905	342
turkish	738	500	855	499	708	466
ukrainian	1115	421	1574	446	1099	412
vietnamese	1923	1182	2252	1304	1673	1111

Composing, µs :

language	ICU4X NFC	my NFC	ICU4X NFKC	my NFKC	ICU4X NFC (dec¹)	my NFKC (dec¹)
arabic	890	411	1359	443	1130	574
chinese	787	203	1738	471	772	205
czech	104	71	592	334	2047	1045
dutch	104	72	164	113	111	73
english	156	117	180	109	158	120
french	104	84	238	152	603	317
german	102	72	198	122	286	172
greek	970	383	1431	425	1864	879
hebrew	855	425	1302	457	848	420
hindi	848	395	1129	432	842	399
italian	136	94	204	128	305	171
japanese	859	194	1498	261	1156	345
korean	874	312	1425	377	4468	2605
persian	823	404	1270	433	893	446
polish	105	74	495	269	1012	583
portuguese	105	73	195	129	449	244
russian	794	403	1272	447	839	430
serbian	788	428	1243	453	778	419
spanish	119	82	230	144	506	250
thai	1025	462	1431	531	1017	457
turkish	111	77	585	307	1041	595
ukrainian	790	384	1271	430	912	450
vietnamese	564	297	1202	575	3761	2061

Repositories:

decomposing: https://github.com/gpawru/02_habr_decomposing_normalization
composing: https://github.com/gpawru/02_habr_composing_normalization
UCD parse: https://github.com/gpawru/unicode_data
baking data: https://github.com/gpawru/unicode_bakery

dec - predecomposed texts used ↩ ↩² ↩³ ↩⁴

sffc · 2024-03-06T02:43:05Z

sffc
Mar 6, 2024
Maintainer

Thanks for the note!

Do you have a sense of where the performance difference might come from? Is it just that your data size is larger (100 kB is quite large)? I would be curious to see if you can improve on ICU4X performance with smaller data files. Alternatively, it would be compelling to see numbers if you can share data between the four normalization routines instead of shipping big data for each one.

2 replies

gpawru Mar 6, 2024
Author

Hello!
A few thoughts:

For ASCII (English), performance drops solely due to compiler attempts to optimize the code. Restructuring the code + liters of coffee and Godbolt. It gave me 170µs -> 110µs on English case.
A significant influence is the speed of access to the baked data.
Removing the check for each character for Hangul codepoints (baking a marker into a single block of data) - on the one hand, it seems like it's just one "if", but on the other hand - this check is called for every (!!) character of the normalized string outside the fast forward block. It gave me a few percentage points of improvement.
If we store the code and CCC in u32 (CharacterAndClass), wouldn't it be logical to store CCC in the lower bits, as CCC is checked more often?
Does ICU4X stores CCC in its original form (230, etc)? It could be baked into 6 bits, which may provide a slight performance boost.

In fact, there are many places where we can try something. I'll fork the repo, make some experiments and write about the results I achieve :)

gpawru Mar 15, 2024
Author

@sffc Hello Shane!

Unfortunately, I wasn't able to experiment with ICU4X - my 12-year-old mac mini, I'm afraid, turns work into torture (compile time, IDE) 😓. Maybe someday I'll do an upgrade and be able to return to the attempts. Nevertheless, I followed your advice and updated my own implementation.

Now:

NFD - 29 Kb,
NKFD - 56 Kb.
Compositions - common dataset - 14 Kb.
NFC = NFD + Compositions + 104 bytes patch
NFKC = NFKD + Compositions + 1288 bytes patch

Performance in benchmarks, of course, has decreased, but not critically. Benchmark result tables are here and here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my normalization implementation #4647

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

my normalization implementation #4647

gpawru Mar 4, 2024

Decomposing, µs :

Composing, µs :

Repositories:

Footnotes

Replies: 1 comment · 2 replies

sffc Mar 6, 2024 Maintainer

gpawru Mar 6, 2024 Author

gpawru Mar 15, 2024 Author

gpawru
Mar 4, 2024

Replies: 1 comment 2 replies

sffc
Mar 6, 2024
Maintainer

gpawru Mar 6, 2024
Author

gpawru Mar 15, 2024
Author