Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MONGOCRYPT-759 Implement CFold #941

Merged
merged 14 commits into from
Feb 4, 2025

Conversation

marksg07
Copy link
Collaborator

@marksg07 marksg07 commented Jan 25, 2025

The new unicode/ implements folding as implemented in mongo::unicode::String::caseFoldAndStripDiacritics in the server. The two _map.c files are generated by gen_[diacritic|casefold]_map.py in the server, with modifications so that they work in C. Note that these maps are not the same as the ones on the server: Those use unicode 8.0, while we use unicode 13.0.0 (for the simple reason that this is the latest unicode version supported by the version of python we use for the server).

We now do random unicode string generation for unit testing, rather than the static strings we had before. My hope is that this will be able to test a wider variety of cases than I myself can think of.

@marksg07 marksg07 changed the title MONGOCRYPT-762 Implement CFold MONGOCRYPT-759 Implement CFold Jan 27, 2025
@marksg07 marksg07 requested review from erwee and kevinAlbs January 27, 2025 20:55
@marksg07 marksg07 marked this pull request as ready for review January 27, 2025 20:55
src/unicode/fold.c Outdated Show resolved Hide resolved
src/unicode/fold.c Outdated Show resolved Hide resolved
src/unicode/fold.c Outdated Show resolved Hide resolved
CLIENT_ERR("unicode_fold: Either case or diacritic folding must be enabled");
return false;
}
*out_str = bson_malloc(len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to null-terminate output

Suggested change
*out_str = bson_malloc(len);
*out_str = bson_malloc(len + 1);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now null-terminating.

Comment on lines 132 to 133
*out_len = (size_t)(output_it - *out_str);
*out_str = realloc(*out_str, *out_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally wouldn't bother with the realloc here just to shrink it. The folded string is not gonna be around long enough to make it worth the realloc cost.

Need to null terminate output string:

Suggested change
*out_len = (size_t)(output_it - *out_str);
*out_str = realloc(*out_str, *out_len);
*output_it = '\0';
*out_len = (size_t)(output_it - *out_str);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

const char nfd2[] = {'C', 'a', 'f', 'E', 0xcc, 0x81, 0};
const char nfd2_lower[] = {'c', 'a', 'f', 'e', 0xcc, 0x81, 0};
TEST_UNICODE_FOLD_ALL_CASES(nfd2, nfd2_lower, "CafE", "cafe");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

Suggested change
TEST_UNICODE_FOLD("fo\0bar", 6, "fo\0bar", 6, kUnicodeFoldToLower | kUnicodeFoldRemoveDiacritics);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

test/test-unicode-fold.c Outdated Show resolved Hide resolved
test/test-unicode-fold.c Outdated Show resolved Hide resolved
@marksg07 marksg07 requested review from erwee and kevinAlbs January 31, 2025 19:09
Copy link
Collaborator

@erwee erwee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more nit, and lgtm!

test/test-unicode-fold.c Show resolved Hide resolved
@marksg07 marksg07 requested a review from erwee January 31, 2025 20:01
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor test suggestion.

test/test-mc-text-search-str-encode.c Outdated Show resolved Hide resolved
test/test-mc-text-search-str-encode.c Outdated Show resolved Hide resolved
@marksg07 marksg07 merged commit facf082 into mongodb:master Feb 4, 2025
48 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants