-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking the correspondence between input text and output glyphs #31
Comments
Although the character->glyph relationship starts out roughly one-to-one, substitutions can break that in the following ways:
|
An application may want to make high level queries such as:
The shaping process needs to maintain sufficient correspondence between the input text and output glyphs that these questions can be answered, even if the answer may not always be particularly useful in the general case, such as if the font has erased all of the glyphs or converted the entire text into a single ligature. |
Idea for future investigation: associate an (index, length) pair with every character and every glyph, representing the first and last glyph corresponding to that character, and vice versa. This is only an approximation but potentially a useful one. Or perhaps a better starting point would be to consider the text buffer and the glyph buffer each split into contiguous subranges that map to each other, one character to one glyph in the simple case and potentially the entire input to the entire output in the case of complex script reordering, one giant ligature, or a pathological font like Addition. |
This is what the HarfBuzz Here's the section in our docs: See also the following which makes it closer to what you propose: |
Thanks for the pointers @behdad, much appreciated. |
Are there any unresolved questions here or is it just waiting on an implementation? |
I think the approach that the implementation would take is still undecided. |
What if you provided a "userdata" field with a trait so the user can decide how to handle splitting and combining? RawGlyph currently has a generic but it's never propagated into the You could add some sort of
|
Shaping proceeds roughly as follows:
Characters are mapped to glyphs via the cmap table on a one-to-one basis (with some exceptions and special cases such as variation selector characters and zero width (non-)joiners and possibly after Unicode normalisation has taken place).
The glyph array is permuted by the substitution lookups found in the GSUB table and other script-specific reordering may be applied.
The glyphs are finally positioned relative to each other by the positioning lookups found in the GPOS table and their intrinsic metrics such as advance width.
During this process we attempt to track the connection between the original text input and the glyphs by remembering which characters each glyph came from and updating that appropriately in response to ligature substitutions. This is enough to support the ToUnicode mapping needed by PDF files so that copy and paste works, but not adequate for interactive applications that need to handle caret positioning, text selection, or efficient line-breaking via shaping boundaries as described in #29 (which Prince could also benefit from).
As a contrived example consider shaping the text "aba" and getting back glyphs [17-'b', 18-'a', 18-'a'], from this alone you can't tell which 'a' ended up where.
The text was updated successfully, but these errors were encountered: