Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking the correspondence between input text and output glyphs #31

Open
mikeday opened this issue Sep 8, 2020 · 8 comments
Open

Tracking the correspondence between input text and output glyphs #31

mikeday opened this issue Sep 8, 2020 · 8 comments

Comments

@mikeday
Copy link
Contributor

mikeday commented Sep 8, 2020

Shaping proceeds roughly as follows:

  1. Characters are mapped to glyphs via the cmap table on a one-to-one basis (with some exceptions and special cases such as variation selector characters and zero width (non-)joiners and possibly after Unicode normalisation has taken place).

  2. The glyph array is permuted by the substitution lookups found in the GSUB table and other script-specific reordering may be applied.

  3. The glyphs are finally positioned relative to each other by the positioning lookups found in the GPOS table and their intrinsic metrics such as advance width.

During this process we attempt to track the connection between the original text input and the glyphs by remembering which characters each glyph came from and updating that appropriately in response to ligature substitutions. This is enough to support the ToUnicode mapping needed by PDF files so that copy and paste works, but not adequate for interactive applications that need to handle caret positioning, text selection, or efficient line-breaking via shaping boundaries as described in #29 (which Prince could also benefit from).

As a contrived example consider shaping the text "aba" and getting back glyphs [17-'b', 18-'a', 18-'a'], from this alone you can't tell which 'a' ended up where.

@mikeday
Copy link
Contributor Author

mikeday commented Sep 8, 2020

Although the character->glyph relationship starts out roughly one-to-one, substitutions can break that in the following ways:

  • Ligatures can replace multiple glyphs with one glyph. Worse, these don't need to be consecutive glyphs either! The text "abmc" where 'm' is an accent mark could be turned into [abc-ligature][m accent applying to second component].

  • One glyph can be replaced with multiple glyphs. This raises the question of which of these, if any, is now associated with the original character, even more so if some of the replacement glyphs then form ligatures with other glyphs.

  • Glyphs can be deleted entirely. This relies on a technically unspecified behaviour of the multiple substitution, but all of the shaping engines support it.

  • Some mark glyphs can be split and/or reordered, for example "am" can become "ma", or even "[m1]a[m2]".

@mikeday
Copy link
Contributor Author

mikeday commented Sep 8, 2020

An application may want to make high level queries such as:

  • Which glyph position best corresponds to a given insertion point in the text buffer?

  • Which range of glyphs best corresponds to a selection range in the text buffer?

  • Conversely, which insertion point or range in the text buffer best corresponds to a given glyph?

The shaping process needs to maintain sufficient correspondence between the input text and output glyphs that these questions can be answered, even if the answer may not always be particularly useful in the general case, such as if the font has erased all of the glyphs or converted the entire text into a single ligature.

@mikeday
Copy link
Contributor Author

mikeday commented Sep 8, 2020

Idea for future investigation: associate an (index, length) pair with every character and every glyph, representing the first and last glyph corresponding to that character, and vice versa. This is only an approximation but potentially a useful one.

Or perhaps a better starting point would be to consider the text buffer and the glyph buffer each split into contiguous subranges that map to each other, one character to one glyph in the simple case and potentially the entire input to the entire output in the case of complex script reordering, one giant ligature, or a pathological font like Addition.

@behdad
Copy link

behdad commented Feb 22, 2021

This is what the HarfBuzz hb_glyph_info_t::cluster is about. I suggest you study what we do there.

Here's the section in our docs:
https://harfbuzz.github.io/clusters.html

See also the following which makes it closer to what you propose:
harfbuzz/harfbuzz#1392

@adrianwong
Copy link
Member

Thanks for the pointers @behdad, much appreciated.

@LoganDark
Copy link
Contributor

Are there any unresolved questions here or is it just waiting on an implementation?

@wezm
Copy link
Contributor

wezm commented Jun 21, 2022

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

@LoganDark
Copy link
Contributor

LoganDark commented Jun 21, 2022

Are there any unresolved questions here or is it just waiting on an implementation?

I think the approach that the implementation would take is still undecided.

What if you provided a "userdata" field with a trait so the user can decide how to handle splitting and combining?

RawGlyph currently has a generic but it's never propagated into the Infos. Maybe you could make Info generic as well and allow the user to implement some trait to handle how things are propagated from the RawGlyphs?

You could add some sort of Userdata trait bound to RawGlyph, that would be another breaking change of course. But it would be a good one.

() could have an implementation that does nothing in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants