MONGOCRYPT-759 Implement CFold #941

marksg07 · 2025-01-25T21:24:06Z

The new unicode/ implements folding as implemented in mongo::unicode::String::caseFoldAndStripDiacritics in the server. The two _map.c files are generated by gen_[diacritic|casefold]_map.py in the server, with modifications so that they work in C. Note that these maps are not the same as the ones on the server: Those use unicode 8.0, while we use unicode 13.0.0 (for the simple reason that this is the latest unicode version supported by the version of python we use for the server).

We now do random unicode string generation for unit testing, rather than the static strings we had before. My hope is that this will be able to test a wider variety of cases than I myself can think of.

src/unicode/fold.c

erwee · 2025-01-30T17:59:01Z

src/unicode/fold.c

+        CLIENT_ERR("unicode_fold: Either case or diacritic folding must be enabled");
+        return false;
+    }
+    *out_str = bson_malloc(len);


need to null-terminate output

Suggested change

*out_str = bson_malloc(len);

*out_str = bson_malloc(len + 1);

Now null-terminating.

erwee · 2025-01-30T18:00:53Z

src/unicode/fold.c

+    *out_len = (size_t)(output_it - *out_str);
+    *out_str = realloc(*out_str, *out_len);


I personally wouldn't bother with the realloc here just to shrink it. The folded string is not gonna be around long enough to make it worth the realloc cost.

Need to null terminate output string:

Suggested change

*out_len = (size_t)(output_it - *out_str);

*out_str = realloc(*out_str, *out_len);

*output_it = '\0';

*out_len = (size_t)(output_it - *out_str);

erwee · 2025-01-30T18:16:08Z

test/test-unicode-fold.c

+    const char nfd2[] = {'C', 'a', 'f', 'E', 0xcc, 0x81, 0};
+    const char nfd2_lower[] = {'c', 'a', 'f', 'e', 0xcc, 0x81, 0};
+    TEST_UNICODE_FOLD_ALL_CASES(nfd2, nfd2_lower, "CafE", "cafe");
+


Add:

Suggested change

TEST_UNICODE_FOLD("fo\0bar", 6, "fo\0bar", 6, kUnicodeFoldToLower | kUnicodeFoldRemoveDiacritics);

test/test-unicode-fold.c

erwee

one more nit, and lgtm!

test/test-unicode-fold.c

kevinAlbs

LGTM with minor test suggestion.

test/test-mc-text-search-str-encode.c

Co-authored-by: Kevin Albertson <[email protected]>

marksg07 added 5 commits January 25, 2025 21:23

MONGOCRYPT-762 Implement CFold

f3482f0

conversion

fb060e8

format & drop helper

8f56363

conv

5a38547

fix format

6f5f2d9

marksg07 changed the title ~~MONGOCRYPT-762 Implement CFold~~ MONGOCRYPT-759 Implement CFold Jan 27, 2025

marksg07 added 4 commits January 27, 2025 19:33

better asserts

8fa7f4c

format

5346506

no-fold

47484e4

fix h

4be3d5b

marksg07 requested review from erwee and kevinAlbs January 27, 2025 20:55

marksg07 marked this pull request as ready for review January 27, 2025 20:55

kevinAlbs reviewed Jan 30, 2025

View reviewed changes

src/unicode/fold.c Outdated Show resolved Hide resolved

erwee requested changes Jan 30, 2025

View reviewed changes

More space allocation

5618b37

marksg07 requested review from erwee and kevinAlbs January 31, 2025 19:09

erwee requested changes Jan 31, 2025

View reviewed changes

test/test-unicode-fold.c Show resolved Hide resolved

more cases

71997a7

marksg07 requested a review from erwee January 31, 2025 20:01

erwee approved these changes Jan 31, 2025

View reviewed changes

test

a2b4b40

kevinAlbs approved these changes Feb 3, 2025

View reviewed changes

test/test-mc-text-search-str-encode.c Outdated Show resolved Hide resolved

test/test-mc-text-search-str-encode.c Outdated Show resolved Hide resolved

marksg07 and others added 2 commits February 4, 2025 12:09

Update test/test-mc-text-search-str-encode.c

27fe10d

Co-authored-by: Kevin Albertson <[email protected]>

Update test/test-mc-text-search-str-encode.c

4c5704e

Co-authored-by: Kevin Albertson <[email protected]>

marksg07 merged commit facf082 into mongodb:master Feb 4, 2025
48 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MONGOCRYPT-759 Implement CFold #941

MONGOCRYPT-759 Implement CFold #941

marksg07 commented Jan 25, 2025 •

edited

Loading

erwee Jan 30, 2025

marksg07 Jan 31, 2025

erwee Jan 30, 2025

marksg07 Jan 31, 2025

erwee Jan 30, 2025

marksg07 Jan 31, 2025

erwee left a comment

kevinAlbs left a comment

	*out_str = bson_malloc(len);
	*out_str = bson_malloc(len + 1);

		out_len = (size_t)(output_it - out_str);
		out_str = realloc(out_str, *out_len);


	TEST_UNICODE_FOLD("fo\0bar", 6, "fo\0bar", 6, kUnicodeFoldToLower \| kUnicodeFoldRemoveDiacritics);

MONGOCRYPT-759 Implement CFold #941

MONGOCRYPT-759 Implement CFold #941

Conversation

marksg07 commented Jan 25, 2025 • edited Loading

erwee Jan 30, 2025

Choose a reason for hiding this comment

marksg07 Jan 31, 2025

Choose a reason for hiding this comment

erwee Jan 30, 2025

Choose a reason for hiding this comment

marksg07 Jan 31, 2025

Choose a reason for hiding this comment

erwee Jan 30, 2025

Choose a reason for hiding this comment

marksg07 Jan 31, 2025

Choose a reason for hiding this comment

erwee left a comment

Choose a reason for hiding this comment

kevinAlbs left a comment

Choose a reason for hiding this comment

marksg07 commented Jan 25, 2025 •

edited

Loading