Tracking the correspondence between input text and output glyphs #31

mikeday · 2020-09-08T01:52:35Z

Shaping proceeds roughly as follows:

Characters are mapped to glyphs via the cmap table on a one-to-one basis (with some exceptions and special cases such as variation selector characters and zero width (non-)joiners and possibly after Unicode normalisation has taken place).
The glyph array is permuted by the substitution lookups found in the GSUB table and other script-specific reordering may be applied.
The glyphs are finally positioned relative to each other by the positioning lookups found in the GPOS table and their intrinsic metrics such as advance width.

During this process we attempt to track the connection between the original text input and the glyphs by remembering which characters each glyph came from and updating that appropriately in response to ligature substitutions. This is enough to support the ToUnicode mapping needed by PDF files so that copy and paste works, but not adequate for interactive applications that need to handle caret positioning, text selection, or efficient line-breaking via shaping boundaries as described in #29 (which Prince could also benefit from).

As a contrived example consider shaping the text "aba" and getting back glyphs [17-'b', 18-'a', 18-'a'], from this alone you can't tell which 'a' ended up where.

mikeday · 2020-09-08T02:51:08Z

Although the character->glyph relationship starts out roughly one-to-one, substitutions can break that in the following ways:

Ligatures can replace multiple glyphs with one glyph. Worse, these don't need to be consecutive glyphs either! The text "abmc" where 'm' is an accent mark could be turned into [abc-ligature][m accent applying to second component].
One glyph can be replaced with multiple glyphs. This raises the question of which of these, if any, is now associated with the original character, even more so if some of the replacement glyphs then form ligatures with other glyphs.
Glyphs can be deleted entirely. This relies on a technically unspecified behaviour of the multiple substitution, but all of the shaping engines support it.
Some mark glyphs can be split and/or reordered, for example "am" can become "ma", or even "[m1]a[m2]".

mikeday · 2020-09-08T04:35:58Z

An application may want to make high level queries such as:

Which glyph position best corresponds to a given insertion point in the text buffer?
Which range of glyphs best corresponds to a selection range in the text buffer?
Conversely, which insertion point or range in the text buffer best corresponds to a given glyph?

The shaping process needs to maintain sufficient correspondence between the input text and output glyphs that these questions can be answered, even if the answer may not always be particularly useful in the general case, such as if the font has erased all of the glyphs or converted the entire text into a single ligature.

mikeday · 2020-09-08T11:10:39Z

Idea for future investigation: associate an (index, length) pair with every character and every glyph, representing the first and last glyph corresponding to that character, and vice versa. This is only an approximation but potentially a useful one.

Or perhaps a better starting point would be to consider the text buffer and the glyph buffer each split into contiguous subranges that map to each other, one character to one glyph in the simple case and potentially the entire input to the entire output in the case of complex script reordering, one giant ligature, or a pathological font like Addition.

behdad · 2021-02-22T19:52:14Z

This is what the HarfBuzz hb_glyph_info_t::cluster is about. I suggest you study what we do there.

Here's the section in our docs:
https://harfbuzz.github.io/clusters.html

See also the following which makes it closer to what you propose:
harfbuzz/harfbuzz#1392

adrianwong · 2021-02-22T20:05:52Z

Thanks for the pointers @behdad, much appreciated.

LoganDark · 2022-06-12T06:57:18Z

Are there any unresolved questions here or is it just waiting on an implementation?

wezm · 2022-06-21T06:46:37Z

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

LoganDark · 2022-06-21T11:15:13Z

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

What if you provided a "userdata" field with a trait so the user can decide how to handle splitting and combining?

RawGlyph currently has a generic but it's never propagated into the Infos. Maybe you could make Info generic as well and allow the user to implement some trait to handle how things are propagated from the RawGlyphs?

You could add some sort of Userdata trait bound to RawGlyph, that would be another breaking change of course. But it would be a good one.

() could have an implementation that does nothing in all cases.

wezm mentioned this issue Dec 8, 2020

Make Adjust::apply, Placement::combine_distance and Placement::combine_anchor public #36

Open

wezm mentioned this issue Jan 3, 2023

Lifetimes are extremely difficult (impossible?) to navigate #52

Open

wezm mentioned this issue Oct 28, 2024

view: Cycle through multiple fg colours if supplied yeslogic/allsorts-tools#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking the correspondence between input text and output glyphs #31

Tracking the correspondence between input text and output glyphs #31

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020 •

edited

Loading

behdad commented Feb 22, 2021

adrianwong commented Feb 22, 2021

LoganDark commented Jun 12, 2022

wezm commented Jun 21, 2022

LoganDark commented Jun 21, 2022 •

edited

Loading

Tracking the correspondence between input text and output glyphs #31

Tracking the correspondence between input text and output glyphs #31

Comments

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020

mikeday commented Sep 8, 2020 • edited Loading

behdad commented Feb 22, 2021

adrianwong commented Feb 22, 2021

LoganDark commented Jun 12, 2022

wezm commented Jun 21, 2022

LoganDark commented Jun 21, 2022 • edited Loading

mikeday commented Sep 8, 2020 •

edited

Loading

LoganDark commented Jun 21, 2022 •

edited

Loading