Integrate harfbuzz for text shaping in fpdf #696

andersonhc · 2023-02-16T16:28:00Z

andersonhc
Feb 16, 2023
Maintainer

I am trying to integrate Harfbuzz (https://en.wikipedia.org/wiki/HarfBuzz) with fpdf as a possible solution to our text shaping problems (diacritics, ligatures, kerning, left-to-right vs right-to-left, etc.)

The harfbuzz project is open-source, MIT licensed and actively developed (https://github.com/harfbuzz/harfbuzz) and they have the uharfbuzz package available on PIP with the Cython bindings (https://github.com/harfbuzz/uharfbuzz) so it's straightforward to use it.

My first step was building a proof-of-concept version of fpdf with harfbuzz. I ran into several problems because fpdf is built over a "one character = one glyph" concept that is not compatible with properly shaped text.

The proof-of-concept version I built is here: https://github.com/andersonhc/fpdf2/tree/harfbuzz

This version if far from being production ready. I added a harfbuzz-text() function that works like text() but uses the shaper. I am attaching some files comparing text with and without harfbuzz taken from some of our open issues.

There's a lot of questions we need to discuss before moving forward. Some of them are:

Is harfbuzz the right option for text shaping in fpdf? Is it being a c program with cython binding going to limit fpdf use?
How can we integrate the shaper? Use it as default? Optional like markdown?

I'd love to hear @Lucas-C and @gmischler input.

Those are the tests I did. The code I used to generate the files was:

from fpdf import FPDF
filename = "test-arabic-rtl.pdf"
font_file = "amiri-regular.ttf"
some_text = "هذه بعض النصوص العربية"
pdf = FPDF()
pdf.add_font("f", fname=font_file)
pdf.add_page()
pdf.set_font("Helvetica", "", 14)
pdf.text(x=10, y=10, txt="Only FPDF")
pdf.text(x=10, y=40, txt="With harfbuzz")
pdf.set_font("f", "", 14)
pdf.text(x=10, y=24, txt=some_text)
pdf.harfbuzz_text(x=10, y=54, txt=some_text)
pdf.output(filename)

Testing with Fira Code (plenty of ligatures):
test-firacode.pdf

Testing hindi text as reported on #365
test-365-hindi.pdf

Testing hebrew from #549
test-549-hebrew.pdf
test-hebrew-549-2.pdf

Testing Tibetan from #679
test-679-tibetan.pdf

Testing arabic right-to-left text
test-arabic-rtl.pdf

mrchoke · 2023-02-19T01:15:38Z

mrchoke
Feb 19, 2023

This is thai text testing for fpdf2 harfbuzz. It pretty shaping but it can't render some text (random).

https://github.com/mrchoke/fpdf2-thai-harfbuzz-test

2 replies

andersonhc Feb 21, 2023
Maintainer Author

Can you try again? I think I fixed the problem.
Thanks for helping test this.

mrchoke Feb 21, 2023

Can you try again? I think I fixed the problem. Thanks for helping test this.

Wow it perfected!! it can render Thai shaping now

Thank you @andersonhc , wait for official release

Lucas-C · 2023-02-22T08:50:21Z

Lucas-C
Feb 22, 2023
Maintainer

Thank you for your very promising work on this @andersonhc!
I am procrastinating a bit but I will to take the time to answer you in depths soon 😊

0 replies

gmischler · 2023-02-22T14:02:38Z

gmischler
Feb 22, 2023
Maintainer

Am I understanding this correctly, in that it currently only produces single lines of text?
If we decide to go the path of using harfbuzz, I think we should ultimately integrate it into our line wrapping system.
For the moment, I could see the single-line approach as a workaround implemented in a subclass (like write_html() used to be). That way we don't create internal code that would become redundant to a final solution.

It looks like we are ending up with redundant font data already. self.current_font.hbfont is the harfbuzz version of the same data as self.current_font.font gathered through fonttools. Would there be a reasonable way for [u]harfbuzz to completely replace fonttools at some point?

We'll have to study the output of harfbuzz in more details in order to figure out how a full integration could possibly look like.
This will likely involve a hierarchy of classes that replace the individual characters in a Fragment(). Each instance of one of those classes might represent either a single "normal" character, a ligature sequence, a character with (possibly stacked) accents, or other typographic peculiarites and any combinations thereof. If harfbuzz can help us to populate those instances with the right data, then that would be great!

2 replies

andersonhc Feb 22, 2023
Maintainer Author

The current version is a quick proof of concept version, that's why it's only working with single lines and has the redundancies.

Harfbuzz provides the same functionality Fonttools does (font information, glyph widths, subsetting, etc) so we could probably abstract those functions and the user can choose which one to use.

gmischler Feb 22, 2023
Maintainer

The current version is a quick proof of concept version, that's why it's only working with single lines and has the redundancies.

It's definitively a very promising start!
I was just stating my observations/questions, not necessarily complaining... 😉
You've taken on a big task, so proceeding in verifiable steps makes a lot of sense.

One of the next challenges will be to not only just move around glyphs, but also preserve some structural information in the process, which will be needed both for line wrapping and for the ability to copy the original Unicode text from the PDF when eg. ligatures are displayed in its place. I see that harfbuzz can tell us the width of a given string, which is helpful. We might want to try to preserve such information in as fine a granularity as reasonable (ideally down to the "cluster").

Harfbuzz provides the same functionality Fonttools does (font information, glyph widths, subsetting, etc) so we could probably abstract those functions and the user can choose which one to use.

That might be an interesting possibility. It would essentially give a choice between a "full featured" vs. a "pure python" version of fpdf2.
With one caveat, though: We currently also use the SVG path parser from fonttools. I don't see harfbuzz offering that at first glance, so either we'd have to require fonttools in any case or we go back to the seperate svg.path library.

Lucas-C · 2023-02-27T13:49:20Z

Lucas-C
Feb 27, 2023
Maintainer

I am trying to integrate Harfbuzz (en.wikipedia.org/wiki/HarfBuzz) with fpdf as a possible solution to our text shaping problems (diacritics, ligatures, kerning, left-to-right vs right-to-left, etc.)

The harfbuzz project is open-source, MIT licensed and actively developed (harfbuzz/harfbuzz) and they have the uharfbuzz package available on PIP with the Cython bindings (harfbuzz/uharfbuzz) so it's straightforward to use it.

This sounds really great!
Thank you for looking into this 😊
Based on the issues we had in the past, this subject of text shaping is probably one of the most awaited, and a solution providing a better support for this (especially ligatures) would please many fpdf2 users!

harfbuzz seems like a good pick.
Out of curiosity, have you picked it because you used it in the past?
Or after comparison with other libs?

My first step was building a proof-of-concept version of fpdf with harfbuzz. I ran into several problems because fpdf is built over a "one character = one glyph" concept that is not compatible with properly shaped text.

Yes, I see how deep this asumption is nested inside fpdf2...

The proof-of-concept version I built is here: andersonhc/fpdf2@harfbuzz

I am attaching some files comparing text with and without harfbuzz taken from some of our open issues.

Thanks!

I had a look and it's very promising!

the refactoring into fpdf/fonts.py is nice and structure the code better (as a side note, I also touched this module in PR Implement FPDF.table() - close #701 & #723 #703: https://github.com/PyFPDF/fpdf2/blob/table/fpdf/fonts.py#L2607)
it's fine for a PoC, but I see that currently some features have been dropped out, like alias_nb_pages or the warning for missing glyphs
if I understand things correctly, harfbuzz_text() demonstrates well the core changes that we will need to make to adopt HarfBuzz:

buf = hb.Buffer()
...  # perform text shaping based on font select & text provided
for info, pos in zip(buf.glyph_infos, buf.glyph_positions):
    char = chr(self.current_font.subset.pick_by_id(info.codepoint))
    # then use char & pos to insert glyph inside PDF stream

A couple of questions on the code of this method:

what is the role of txt_mapped?
in case of ligatures, a single char will be processed in that loop?

Is harfbuzz the right option for text shaping in fpdf? Is it being a c program with cython binding going to limit fpdf use?

I checked the uharfbuzz Pypi package:

it seems regularly updated since 2018: https://pypi.org/project/uharfbuzz/#history
it is not too restrictive regarding Python version (>= 3.5)
it provides wheel packages for Linux, Windows & MacOSX: https://pypi.org/project/uharfbuzz/#files
it seems widely used: 7k package downloads /week https://pepy.tech/project/uharfbuzz
it seems well maintained, with several active contributors. uharfbuzz is also part of the global HarbfBuzz project, and maintained by the same persons, including Behdad Esfahbod

Overall it looks like a great candidate and I'm fine to introduce it as an dependency of fpdf2!

We already have dependencies that require compiled C code: fonttools (required since v2.5.7), pillow (required only if images are inserted, but that's very often the case I think).
So the fact that harfbuzz is a C-program with Cython bindings is fine with me.

How can we integrate the shaper? Use it as default? Optional like markdown?

IF harfbuzz introduction does not have an impact on performances, I'd be in favor of using it by default.
This would make the life of fpdf2 users a lot easier.

When I started answering here, I initially suggested to make it an optional/peer dependency (like we do already for endesive).
On second thought, given how "structural" the role of this lib would be for fpdf2, I think this would make for too much of cost in terms of code clarity & project maintenance.
In case uharfbuzz can fully replace fonttools, I would agree with this switch in order to limit the number of fpdf2 dependencies, even if it means going back to using svg.path.
But would uharfbuzz really cover all of fonttools features used by fpdf2?

What do you think about this plan @andersonhc & @gmischler?

Also, it seems like fonttools already uses uharfbuzz as an extra dependency: https://github.com/fonttools/fonttools/blob/main/setup.py#L143
I think it may be interesting that fpdf2 defines extra dependencies similarly, for endesive & pillow
(this last paragraph is mostly a note to myself 😊)

1 reply

andersonhc Mar 1, 2023
Maintainer Author

harfbuzz seems like a good pick.
Out of curiosity, have you picked it because you used it in the past?
Or after comparison with other libs?

I have not used in the past... I was looking at #365 where it was mentioned Pillow solved the problem using libramq. When I was looking at libramq it uses harfbuzz to do the shaping. I started digging a bit more what alternatives exist and could be used in python and harfbuzz seems to be the more mature alternative that is multiplatform and has the python bindings available.

When I have some more free time I'll post more information on how ligatures and other font features are handled.

gmischler · 2023-02-28T14:41:46Z

gmischler
Feb 28, 2023
Maintainer

My first step was building a proof-of-concept version of fpdf with harfbuzz. I ran into several problems because fpdf is built over a "one character = one glyph" concept that is not compatible with properly shaped text.

Yes, I see how deep this asumption is nested inside fpdf2...

This assumption is currently only present in the Fragment() class, in the few methods that create its instances, and in FPDF._render_styled_text_line(). It should be fairly straightforward to change Fragment.characters from a list of chars to a list of Cluster() instances that encapsulate multi-glyph sequences. I have outlined this concept in several discussions already, and can dig up a few pointers if necessary.
(We need to convert FPDF.text() to also using Fragment()s to make the above explanation entirely true.)

in case of ligatures, a single char will be processed in that loop?

Ligature processing takes two or more chars as input, and produces one or more glyphs as output.
As far as I have seen so far though, with harfbuzz we don't need to worry about many of those details. It eats a string of Unicode chars and spits out a sequence of glyphs, grouped into clusters. And in each cluster, each glyph comes with positioning information that we can use directly. We don't really need to consider whether any (partial) glyph is a base character glyph, the result of ligaturization, or a (possibly stacked) accent sign (all of which apparently can appear in the same cluster).

But would uharfbuzz really cover all of fonttools features used by fpdf2?

Given its functionality, it essentially has to provide a superset of fonttools functionality (minus the SVG paths).

What do you think about this plan?

Good plan! 😁

@andersonhc , if you need more details about the inner workings of Fragment() and friends, just holler!

0 replies

Lucas-C · 2023-04-27T05:41:53Z

Lucas-C
Apr 27, 2023
Maintainer

Hi @andersonhc!

Are you still working on this promising integration? 😊

1 reply

andersonhc Jun 1, 2023
Maintainer Author

Hi @andersonhc!

Are you still working on this promising intagration? 😊

I was busy with a project but I will resume this PR soon.
I will try to break it in 3 or 4 PRs to make it simpler and avoid changing too much code at once.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate harfbuzz for text shaping in fpdf #696

{{title}}

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Integrate harfbuzz for text shaping in fpdf #696

andersonhc Feb 16, 2023 Maintainer

Replies: 6 comments · 6 replies

mrchoke Feb 19, 2023

andersonhc Feb 21, 2023 Maintainer Author

mrchoke Feb 21, 2023

Lucas-C Feb 22, 2023 Maintainer

gmischler Feb 22, 2023 Maintainer

andersonhc Feb 22, 2023 Maintainer Author

gmischler Feb 22, 2023 Maintainer

Lucas-C Feb 27, 2023 Maintainer

andersonhc Mar 1, 2023 Maintainer Author

gmischler Feb 28, 2023 Maintainer

Lucas-C Apr 27, 2023 Maintainer

andersonhc Jun 1, 2023 Maintainer Author

andersonhc
Feb 16, 2023
Maintainer

Replies: 6 comments 6 replies

mrchoke
Feb 19, 2023

andersonhc Feb 21, 2023
Maintainer Author

Lucas-C
Feb 22, 2023
Maintainer

gmischler
Feb 22, 2023
Maintainer

andersonhc Feb 22, 2023
Maintainer Author

gmischler Feb 22, 2023
Maintainer

Lucas-C
Feb 27, 2023
Maintainer

andersonhc Mar 1, 2023
Maintainer Author

gmischler
Feb 28, 2023
Maintainer

Lucas-C
Apr 27, 2023
Maintainer

andersonhc Jun 1, 2023
Maintainer Author