Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL Support in the language like Arabic which are RTL languages #175

Closed
sohailsed opened this issue Jun 7, 2015 · 43 comments
Closed

RTL Support in the language like Arabic which are RTL languages #175

sohailsed opened this issue Jun 7, 2015 · 43 comments
Assignees
Labels
tabled Viable requests or ideas, but there's no time to work on them right now.

Comments

@sohailsed
Copy link

Hi,

Base on the discussion at here asciidoctor-pdf does not support the RTL at the moment..

The following fact observed through rendering a PDF from asciidoc :

  1. The characters renders in capital
  2. As the char renders in capital the words seems as separated chars whereas in the language like arabic the chars of a word are connected to each other. In other words, the middle chars of a word should be non capital and if be in this way the words renders correctly.(The first char and the middle chars should render in non capital word which leads to render word with connected chars not separated chars)
  3. Another problem is that the chars of a Words are LTR not RTL(Problem). The words instead renders well(RTL).
  4. The equivalent HTML document which created with asciidoctorj rendered correctly and have not the above problems.

For more history on this issue look at here

Thanks

@sohailsed
Copy link
Author

Hi,

One another observation that could help you to find the solution :

The rendered PDF without embedding the non-latin font shows the non-latin char completely ok, but just in TOC(Table of content). In other word, the non-latin chars render well in TOC but at the body of PDF document, anywhere that there is a non-latin(arabic) char it does not render at all(this observation is in the situation that the used fonts and the YAML are default one without adding the non-latin fonts and changing the YAML). As a summary the rendering of arabic chars is ok in TOC but not at the other parts of PDF. I think this different behavior could help you to distinguish the problem and ever the solution.

thanks

@mojavelinux
Copy link
Member

Thanks for this input @sohailsed! It will prove very valuable in working out the best solution to this problem.

Prawn, the underlying PDF library, does have some support for RTL, but from the threads I've followed, it seems like the team's understanding of RTL is limited. One way or another, we want to make it right in AsciiDoc.

I am curious, how do you write your AsciiDoc content? Do you write that in RTL too? If so, how do you work around some of the LTR bias in the language syntax?

Another thing that might be helpful is if you provided a sample AsciiDoc document and a screenshot of how it is rendered in the browser. That should help get a better understanding of what we're shooting for.

For reference, I found (what seems to be) a nice article on devanagari fonts that includes some key terminology and assumptions. http://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm (which I realize is different from Arabic, but contributes to the understanding of non-latin languages).

@mojavelinux mojavelinux added this to the v1.6.0 milestone Jun 25, 2015
@mojavelinux mojavelinux self-assigned this Jun 25, 2015
@sohailsed
Copy link
Author

Hi, and excuse me for late response!

I attached the asciidoc and the generated HTML and PDFs. Two PDF were attached. One that is generated by embedding RTL Fonts and the other that is generated by default included fonts in the lib. I hope this help you.

The facts regards this PDFs:

  1. Both of them render RTL fonts correctly in TOC.
  2. One with embedding fonts render RTL text in the body of PDF but the characters are not continues(are capitals where as should be lower case at the middle of word and seem continus char not separated capital chars)
  3. One without embedding fonts do not include the RTL test at all in the body of pdf.

Also,maybe it will be interesting to you , as I saw a document on the net that described a similar problem, It consider that the problem related to few Adobe UI Components which few of them support RTL and some do not. In our case, we could see the font completely ok ever without embedding RTL fonts in somewhere of document but not at the all parts(seen in the TOC Section of PDF and not seen in other parts) that this amplifies my guess that we have a similar problem( This is the situation where we did not used any trick to force the RTL renders correctly and we are just depending on the default behavior of adobe components.

Anyway, I uploaded the files and hope to help you :

Asciidoc File

Generated PDF 1

Generated PDF 2

Generated HTML

Screeshot of HTML

Kind Regards

@mojavelinux
Copy link
Member

I hope this help you.

This definitely helps. Thank you for providing these samples.

It looks like we also have an issue in the HTML output as well. I think we need to allow the direction to be set and then pass that through to the HTML. If you'd like to file an issue in https://github.com/asciidoctor/asciidoctor, we can address that problem there.

@mojavelinux
Copy link
Member

There will need to be changes in each of the converters. For the HTML output, please see asciidoctor/asciidoctor#1601 to evaluate RTL support in core and the HTML / DocBook output. Asciidoctor PDF will need to do more work since it has to handle most of the layout itself.

It may be possible to use Asciidoctor to convert to DocBook, then use a2x or fopub to convert to PDF. Then, all you need is basic support in core as likely the DocBook toolchain handles RTL. In short, it will be necessary to explore different avenues as we learn what needs to be done to support it properly. This is new territory for me, but I'm happy to learn about it.

@meisterluk
Copy link
Contributor

This is my personal opinion and your mileage may vary. I am talking about this on a very fundamental level:

It is very difficult to reinvent paragraph layout, text shaping, etc over and over properly again. I love asciidoc as a nice text input language and would love to see a lot of non-technical people using it instead of MS Word and (to some extent) LaTeX. I see a lot of potential. However, there are still many differences between asciidoctor and professional typesetting.

The Unicode Consortium takes care of specifying algorithms for bidi writing and I think referring/considering their decisions is the way to go. On the other hand I consider paragraph layouting an unsolved issue. It can be incredibly difficult to specify 2D behavior in 1 dimension (text). Text shaping is a solved technical issue as long as it does not come to OpenType Features. My point being:

Harfbuzz is the de-facto standard for text shaping. Firefox, Servo and Chrome use it for the web, Xe(La)TeX nowadays also uses Harfbuzz (for print, obviously). Optionally Pango and/or ICU Paragraph Layout are used to finally form paragraphs.

I can clearly see how asciidoctor already beats Patoline in many aspects, but lacks typographic features compared to SILE typesetter, because SILE is reusing existing libraries like HarfBuzz. I consider those 2 projects as competitors because they also try new approaches for desktop publishing.

To sum it up: I think tackling the RTL question (and associated Unicode questions) as a ruby/prawn-only issue might slow down the project for several years. I suggest to consider future options to integrate those other libraries. Specifically using Harfbuzz would be a very helpful start with little risks, I can recommend. But Harfbuzz does not help for RTL/bidi. Sadly, I cannot give a specific recommendation which of the other mentioned libraries best suits the needs of asciidoctor. Talking to developers working with those libraries on a regular basis might help more. I am just a young typesetting enthusiast.

Hopefully, I am not taking a "small" document format project to a too farfetched direction, but I would love to see asciidoctor as choice for a wide range of use cases in digital typesetting.

@mustafa0x
Copy link

Are there any new developments with regards to this issue? Would using XeLaTeX alleviate this problem? Or is that not even applicable here?

@shahryareiv
Copy link

@mustafa0x Probably not. For a direct solution, I guess, an RTL layout engine, such as Harfbuzz, should be implemented in Ruby, to be connected to the PDF engine of Asciidoctor-PDF. But I am using Asciidoctor for RTL texts by going through
Asciidoc -> Docbook (through Asciidoctor) -> Tex (through DBLatex) -> PDF (through Xelatex).

@mojavelinux mojavelinux modified the milestones: v1.6.0, v2.x Sep 18, 2019
@mojavelinux
Copy link
Member

mojavelinux commented Nov 13, 2019

There's no way that Asciidoctor PDF is going to get into text shaping or any of the other typesetting concerns mentioned above. That's the responsibility of a PDF library. And trying to take that on would not only be out of scope, it would likely be insufficient.

I don't understand the observation "The characters renders in capital". Asciidoctor PDF takes what you wrote and it puts that text into the PDF using the font you specify. It's not capitalizing any text.

It is very difficult to reinvent paragraph layout, text shaping, etc over and over properly again.

I agree. This is another strong case for why the future of Asciidoctor PDF is browser-based. The browser already handles most (if not all) of the necessary typesetting for RTL languages, and we can then distill that result to PDF.

What I need to know is what can we reasonably do to support RTL in Asciidoctor PDF that address the most glaring problems? And for that, I need a sample AsciiDoc document somewhere that I can test that isn't going to disappear.

@mojavelinux mojavelinux modified the milestones: v2.x, future Nov 13, 2019
@mojavelinux
Copy link
Member

mojavelinux commented Nov 13, 2019

To be clear, if you want me to modify Asciidoctor PDF to better support RTL, I need a sample document and an explanation for what Asciidoctor PDF is doing wrong. I'll try fix what I can, as long as it involves alignment or margin calculations. You could also submit a PR with proposed changes to help move things along.

Asciidoctor PDF is not a typesetter and it never aimed to be one. It's a simpler path to creating PDFs. If you need more advanced PDF creation tools, I believe a browser-based solution will address some of it, or you can convert to DocBook and use that toolchain (fopub or dblatex).

It's also important to separate AsciiDoc and the converter. The language itself "supports" RTL. The discussion here is how that carries through to the output document. The question is are we doing what we can do to honor RTL semantics? (or can we, given the capabilities of the PDF library we're using)?

@shahryareiv
Copy link

I don't understand the observation "The characters renders in capital"

I guess he meant isolated form of letters in Arabic (and other bidi scripts). See "contextual forms" in https://en.wikipedia.org/wiki/Arabic_script_in_Unicode. However, as you mentioned, this is the PDF-renderer responsibility. The ligature information and glyph selection contexts are stored in fonts that support those languages. If a renderer does not support reading and implementing those information (or is not signaled to do so) then it will show each letter in its isolated form. Harfbuzz is a popular text-shaper engine to load fonts (for various languages, not only bidi) and assign the right glyphs, but I think that is the job of the renderer to load and use it.

I have not looked at Prawn capabilities, but I guess in comparison with CSS paged media module it is better to invest on CSS and its future, especially for bidi languages. There might be slightly different implementation but they are not critical. However, we should note that browser rendering at the moment is not targeting paged media and might wait for a long time to see implementation of paged layout features similar to TeX family.

To summarize, I think the right responsibility of Asciidoctor for bidi text is only to signal properly the document-wide, block-wide, or inline text directions. However, support for different languages is a somehow wider than bidi signal.

@shahryareiv
Copy link

shahryareiv commented Nov 13, 2019

One more comment:
I have used Asciidoc-> DocBook -> Latex chain to create nice looking RTL documents. However, conversion of DocBook to Latex is troublesome. One has to use XSLT. There are also features that I had to go through customized solutions (for example, expanding abbreviations, citations, ...). dblatex and fopub do not seem to be supported any more.
My other suggestion (that might need to be discussed in a different place) is to enrich Asciidoctor HTML output with more information, make it to use more standard forms and focus on converting everything (latex, direct-pdf, ...) from that single source. There are much more tools available for parsing and manipulating html.

@mojavelinux
Copy link
Member

mojavelinux commented Nov 13, 2019

we should note that browser rendering at the moment is not targeting paged media and might wait for a long time to see implementation of paged layout features similar to TeX family.

What I'm about to say is a little hand wavy, so here goes. The browser has JavaScript and JavaScript+CSS has proven to be able to manipulate layouts in every way imaginable, even compared to print layout. So I have no doubt we can achieve what we want to achieve for pages using JavaScript. I'm not suggesting that we rely on support for paged media natively, because it's understood to be limited and buggy.

@mojavelinux
Copy link
Member

fopub do(es) not seem to be supported any more

You don't need fopub. All it is a packaged way to run the DocBook toolchain (again, just for simplicity and to provide better styles, which could easily be copied out of that project). There are plenty of other ways to run the DocBook toolchain, and I would recommend those if fopub is too simplistic.

@mojavelinux
Copy link
Member

to enrich Asciidoctor HTML output with more information

We are starting the process of creating a more semantically rich HTML output. But you don't have to wait on that. The HTML converter can be customized using templates or by extending it, or you can replace it entirely (like asciidoctor-html5s does). Asciidoctor is giving you the tools to make the output document you want with some reasonable defaults. But the whole design of this ecosystem is to empower you to make the output you want.

@shahryareiv
Copy link

shahryareiv commented Nov 13, 2019

The browser has JavaScript and JavaScript+CSS has proven to be able to manipulate layouts in every way imaginable, even compared to print layout.

There are features in typesetting platforms (such as InDesign, or TeX family) that affect the way text look on the printed page (or PDF) and eventually page layouts that we cannot access to that level of manipulation with Javascript+CSS. Actually those features should be implemented in the browser engine (WebKit, Geko, Blink). Some of these features are kernings, font tracking, kerning, and expansion. These are normally at letter, or at inter-letter level but affect the whole layout of document. A good source for some of those features might be the microtype library of latex http://ftp.acc.umu.se/mirror/CTAN/macros/latex/contrib/microtype/microtype.pdf . Also, articles that compare word and latex sometimes mention differences that can be extended to browsers such as in http://nitens.org/taraborelli/latex.

Practically speaking, we might start with javascript+CSS with acceptable quality. My impression is that there is a good level of demand on that part, as companies such Prince (www.princexml.com/)work in that market. However, to reach the printing industry professional level there might be need to access features at browser engine level (or develop customized versions of those engines).

@mojavelinux
Copy link
Member

When I mentioned JavaScript+CSS, I was specifically talking about pagination and page layouts (i.e., paged media). Clearly the browser engine itself needs to handle font rendering such as kerning, expansion, etc. My understanding is that it already does. If not, it should. That's definitely the responsibility of the browser.

Reaching printing industry professional level (however that is defined) is simply not my goal, or the goal of this project. Typesetting is not my passion or expertise. I'm unlikely to ever venture past what the browser engine / PDF library provides. (The thought of developing a custom browser engine gives me the shivers). What I will do is continue working on providing a framework for converting AsciiDoc so anyone who does want to take on that challenge can.

@mojavelinux
Copy link
Member

I'm still looking for a simple document to test. Prawn does have support for text direction. I just need to understand when to apply it, and where else the margin calculations are assuming ltr instead of considering bidi.

@mustafa0x
Copy link

The browser has JavaScript and JavaScript+CSS has proven to be able to manipulate layouts in every way imaginable, even compared to print layout.

Possible related: https://www.pagedmedia.org/paged-js/

I'm still looking for a simple document to test.

A sample Arabic document? You can use something from the Arabic Wikipedia: https://ar.wikipedia.org/wiki/الصفحة_الرئيسية

@mojavelinux
Copy link
Member

If I make up the use case, it will be fiction. I need a real scenario that comes from someone who understand what it should produce. Ideally that would be a sample AsciiDoc document in an RTL language, the output it produces currently, and the expected output (perhaps created using a different program). Then I'll see how close we can get.

@shahryareiv
Copy link

shahryareiv commented Nov 15, 2019

asciidoctor.ir.zip

The sample file. It is actually the content of asciidoctor.ir (site is down at the moment): It contains the adoc file, the generated html, and two images. The only difference with a normal generated file is in the <body class="article" style="direction:rtl">. The two image files are snapshots of an article in Persian, written by asciidoc, and outputted through HTML and Docbook->Latex paths.

@mojavelinux
Copy link
Member

Thanks @shahryareiv!

@mojavelinux
Copy link
Member

I notice a lot of \u200c (zero-width non-joiner) characters. Is that normal to add those when writing, or is that something a tool adds?

@shahryareiv
Copy link

It is normal. Some affixes or multi-words-in-one should be separated with zero-width non-joiner.

@shahryareiv
Copy link

Just note one problem (and probably the only one) with the HTML document: The section numbers, generated by asciidoctor, are in English and not Persian.

@mojavelinux
Copy link
Member

@shahryareiv Yep, I noticed that. Though that's a core issue. It's possible to patch the Section class to have it generate numbers differently. Though we've long talked about making that an extension point. Either way, Asciidoctor PDF isn't really involved there.

@mojavelinux
Copy link
Member

Using the paktype fonts, here's what we get currently:

asciidoctor.ir.pdf

Prawn does support setting the text direction to rtl globally. If I do that, here's what we get:

asciidoctor.ir-rtl.pdf

Right away we can see a problem with mixed text. Prawn is reversing English words. So we'll need to detect those and switch them back.

There's also an issue with margins. I have no idea what's going on there.

@mojavelinux
Copy link
Member

What's interesting is that none of the rtl logic even hits Asciidoctor PDF. It's all happening down in Prawn. If the text direction is rtl, Prawn needs to scan/chunk the text and flip the non-RTL words back around (something like string =~ /\p{Arabic}/).

@mojavelinux
Copy link
Member

I suggest filing an issue in Prawn to support bidi text (not just rtl). If the direction is bidi, then Prawn will need to chunk the text and set the direction appropriately.

It's possible we could chunk the text in Asciidoctor PDF, at least to handle the really basic stuff. This is similar logic to what we just did for text hyphenation. We scan for non-RTL words and wrap them in a tag that reverses the direction. It's not a perfect solution, but it's a place to start.

There's no doubt that the weak link here in Prawn and it even further proves that the future has to be browser-based rendering (because it's going to produce better results than Prawn).

@shahryareiv
Copy link

shahryareiv commented Nov 15, 2019

First:
You do not need to use PakType or any language specific font (actually I recommend not to use there might be some peculiarities). Simply use widely used fonts such as Arial, Tahoma, or Google Noto.

The first Prawn output:

1- It does not read font ligature information and apply text shaping algorithms in choosing the right glyphs. I know there are projects of interfacing Harfbuzz for Ruby. Besides the original implementation in C++ there is a JavaScript implementation available (https://github.com/harfbuzz/harfbuzzjs). There are also Pango interfacing (not sure the right version that supports RTL)

2-Strangely, the table of content in the PDF viewer (mine Preview) shows all titles in the correct format.

The second Prawn output:

1-It just reversed putting the letters. It does not respect text-shaping (right glyphs) or what to do when switching from rtl to ltr or reveres (Unicode Bidirectional Algorithm http://www.unicode.org/reports/tr9/).

Chunking individual words does not solve the problem. In a mixed text one reads from right then when a Latin words appear it should continue from the most left possible position in the line. That is what have not been implemented in the second Prawn output, so for example "light-weight markup" has become "pukram thgiew-thgil" (markup weight-light).

My suggestion is to keep with interfacing or implementing Pango-Harfbuzz (not in Asciidoc but in Prawn). This is of course already implemented in a browser. There are already filed (and closed) issues to support bidi with Prawn: prawnpdf/prawn#68 , prawnpdf/prawn#871, prawnpdf/prawn#921 . There is also a gem https://github.com/cropio/prawn-rtl-support , which I have no idea how mature it is.

@mojavelinux
Copy link
Member

Simply use widely used fonts such as Arial, Tahoma, or Google Noto.

Aha! That looks much better.

@mojavelinux
Copy link
Member

Strangely, the table of content in the PDF viewer (mine Preview) shows all titles in the correct format.

That's because Prawn doesn't "draw" that text. It encodes it into the PDF as a hex-encoded string. It's up to the PDF viewer to render that part. Hence why it works out of the box on your system.

@mojavelinux
Copy link
Member

My suggestion is to keep with interfacing or implementing Pango-Harfbuzz (not in Asciidoc but in Prawn).

I was exploring if there was something we could from the Asciidoctor PDF side. It seems the answer is no, not really. While Prawn has a switch for rtl, it's basically worthless because it doesn't do the right thing.

I take your point about integration with Harfbuzz, but this just isn't the place. That has to happen in Prawn. Asciidoctor PDF would be taking on something that is well beyond the scope of converting from AsciiDoc to PDF objects. Otherwise, we'd be recreating a PDF library, and that's just not what we're doing here.

@mojavelinux
Copy link
Member

Btw, there's a Ruby interface for Harfbuzz right here: https://rubygems.org/gems/harfbuzz

@mojavelinux
Copy link
Member

If Prawn keeps closing bidi issues, then that's cause for us to start looking elsewhere for a PDF platform on which to build (which is half the discussion here). We have to be able to rely on the PDF library doing the right thing because we are not a PDF library (and I don't want to go there).

@mustafa0x
Copy link

mustafa0x commented Nov 15, 2019

The issue seems to be more basic than bidi, as simple shaping doesn't work (Arabic and Persian are cursive scripts).

@mojavelinux
Copy link
Member

Shaping is still the responsibility of the PDF library.

@mustafa0x
Copy link

mustafa0x commented Nov 15, 2019 via email

@shahryareiv
Copy link

shahryareiv commented Nov 15, 2019

Just wonder if no satisfying thing can be found on https://www.pagedmedia.org. They have pollyfills for CCS page modules that are not implemented yet.

@shahryareiv
Copy link

As another option: how about docker-based latex?

@mojavelinux
Copy link
Member

mojavelinux commented Nov 16, 2019

If someone is able to demonstrate that there's room to add something at the Asciidoctor PDF converter layer, I'm open to reviewing that solution. I won't be spending my time on coming up with a solution since this is not my focus or expertise. The Asciidoctor ecosystem is open, so there is always the possibility of different pathways (latex or asciidoctor-pdf.js). But those aren't appropriate to discuss in this issue tracker. Alternate pathways are better discussed at https://discuss.asciidoctor.org.

(Having said that, I have learned from this thread and will take those concerns into consideration when working on the AsciiDoc language).

@mojavelinux
Copy link
Member

As I mentioned previously, I looked into this and the amount of work required to get Prawn to handle RTL scripts is extensive and not something I'll be able to achieve. This is yet another reason why we're beginnning to explore using web technologies as a way to generate PDFs from AsciiDoc in the future. State of the art RTL support is built into web technologies, and we will be able to leverage that.

Unless someone comes forward with a solution, I don't see this issue going anywhere.

@mojavelinux mojavelinux removed this from the future milestone May 1, 2022
@mojavelinux mojavelinux added the tabled Viable requests or ideas, but there's no time to work on them right now. label May 1, 2022
beniza added a commit to beniza/gundert that referenced this issue May 27, 2022
  - rendering of complex script is not supported
  - asciidoctor/asciidoctor-pdf#175
@gnusupport

This comment was marked as abuse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tabled Viable requests or ideas, but there's no time to work on them right now.
Projects
None yet
Development

No branches or pull requests

6 participants