Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22785 uprops.icu: coalesce scx+sc bits #3025

Merged
merged 2 commits into from
Jun 5, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Jun 4, 2024

More uprops.icu property vectors cleanup.

  • Taking advantage of the uprops.icu major format version change to move the Script_Extensions and Script bit fields so that they are contiguous. No more functions to split & merge the bits into and from discontiguous fields.
  • With that, moving the East_Asian_Width bit field to avoid awkward gaps.

The second commit moves the CodePointTrie bit setter functions from the emojipropsbuilder.cpp into the toolutil library, and adds a getCPTrieSize() function. This should help with future experimentation. (Copied from the experimental #2926.)


I also experimented with moving the Age bits into yet another new trie (small, 8-bit CodePointTrie). (Very easy with that second commit.)

Good: Frees up another 8 bits in properties vector word 0.
Bad: Slightly increases the uprops.icu data size by some 4.5% compared to the current state, or 3.9% compared with when the Block bits were still in properties vector word 0.

With the current Unicode 16 alpha data, we would go from

trie size in bytes:                    47608
size in bytes of additional props trie:64624
number of additional props vectors:     2095
number of 32-bit words per vector:         3
number of 16-bit scriptExtensions:       314
size in bytes of Block trie:            7752
data size:                            145816

to

trie size in bytes:                    47608
size in bytes of additional props trie:60792
number of additional props vectors:     1502
number of 32-bit words per vector:         3
number of 16-bit scriptExtensions:       314
size in bytes of Block trie:            7752
size in bytes of Age trie:             17488
data size:                            152356

Since we don't need the additional bits yet, there is no immediate benefit to counterbalance the slight size increase, so I just added a comment.

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22785
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@markusicu markusicu force-pushed the 22785-coalesce-scx-sc branch from 354a732 to 6473366 Compare June 4, 2024 22:41
@jira-pull-request-webhook

This comment was marked as outdated.

@markusicu markusicu marked this pull request as ready for review June 4, 2024 22:42
@markusicu
Copy link
Member Author

Dear reviewers: Fairly simple change that blocks the next round of Unicode 16 integration...

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM.

@markusicu
Copy link
Member Author

Thanks @richgillam !
@echeran if you have additional feedback, I will be happy to follow up with another PR.

@markusicu markusicu merged commit 47e9389 into unicode-org:main Jun 5, 2024
103 checks passed
@markusicu markusicu deleted the 22785-coalesce-scx-sc branch June 5, 2024 01:51
@echeran
Copy link
Contributor

echeran commented Jun 5, 2024

Yep, this LGTM.

In case we revisit the idea later of moving out the Age property from the uprops trie, the general principle that we observe is that overall across all of the CodePointTries, if we group properties that tend to correlate with each other, then their respective tries will be smaller because there will be fewer ranges to deal with when building the trie. Of course, the exact details of what produces the optimal size at any given time & set of support properties will depend on multiple factors, ex: overhead of creating new tries vs. reusing existing tries, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants