Request for UTF-8 support #37

joakim-tjernlund · 2020-01-11T16:07:30Z

Any idea if UTF support can be added to this lib?
Preferably without depending on locale stuff from glibc

troglobit · 2020-01-11T16:49:22Z

Sure it can, it's just a lot of work since most of the lib is built around the notion that a character is one byte wide. A Unicode character, as far as I know, can be two, three, or four bytes wide.

Do you mind telling med the usecase? If our interests interlock (which I've noticed quite a few times) I may be able to help in such an endeavor.

joakim-tjernlund · 2020-01-11T18:11:11Z

Today most terminals default to UTF-8 so i think it wold be great if editline could do UTF-8
but limit accepted chars to ISO Latin .
That would help our use case where users can connect via telnet/ssh/rs232 and non ascii
chars would just work.

troglobit · 2020-01-12T11:03:50Z

Yeah, that's a good limitation. I'll have a look again and try to estimate the amount of work needed.

troglobit · 2020-01-12T22:25:48Z

Interestingly quite a lot actually works already. What needs attention is cursor movement, most notably when editing a line. It is hard-coded to one-character=one-byte ... also, most of the internals use 'int' to store a character during processing. So it shouldn't be too much work to straighten out libeditline \o/

joakim-tjernlund · 2020-01-21T07:56:45Z

Any progress on this?

troglobit · 2020-01-21T09:48:53Z

Nope. I've been redirected/reprioritized at work atm, so I'll have to circulate back to this later. Sorry.

skull-squadron · 2020-02-26T11:41:58Z

When there is time, the easiest way to do this would be like how another library handled it: basically duplicate the API (functions, header and library) and append a suffix such as w to all relevant symbols. Then, use something like libunistring, utf8proc or icu4c to handle conversion from raw bytes to UTF-8, and then iterating on individual 21-bit code points (stored 1 to 4 bytes long). There are numerous edge-cases like non-/UTF-8 BOMs to discard at the beginning of input, code-points spanning reads and UTF-16 surrogate pairs to (not) handle, so writing a solid parser yourself isn't easy. It gets even more complicated (converting the latest Unicode standard version into optimal conditional statements) to decide what 21-bit value is valid Unicode.

troglobit · 2020-02-26T14:03:21Z

@steakknife Thank you for the input! I sort of came to the same conclusion wrt parsing, but I'm really hesitant to make editline depend on other libraries, the whole point of editline is to be small and have as few (no) dependencies as possible. Whenever I circle back to this (or someone else does), your input will be a valuable starting point for further investigation!

Artoria2e5 · 2021-04-19T03:02:11Z

I doubt there is any need for an extra layer on top of that, let along any libraries. The one thing we need to get from a UTF-8 string is the width of characters. Every POSIX system has a int wcwidth(wchar_t c), and all we need is to use mbrtowc to feed it. (On Windows you will need to carry your own wcwidth table and the conversion function. busybox has a nice slim version.)

troglobit · 2021-04-19T17:13:00Z

Good point. Just remember that BusyBox is GPL, this lib isn't.

minfrin · 2021-04-20T12:28:00Z

Replxx (BSD license) has an example of calculating the number of unicode codepoints in a UTF8 string:

https://github.com/AmokHuginnsson/replxx/blob/master/examples/util.c#L3

timkuijsten · 2022-10-13T11:07:55Z

When it comes to supporting different locales I found this presentation by the OpenBSD maintainer of libedit very helpful: https://czyborra.com/yudit/eurobsdcon2016-utf8.pdf

Supporting UTF-8 might be possible without having to duplicate all interfaces or dragging in heavy weights like libiconv.

skull-squadron · 2022-10-13T16:25:10Z

@troglobit To be honest, Unicode (UTF-8, -16LE/BE, -32LE/BE, and UCS2/4) support including BOM detection/removal, sanitization, and normalization is trivial-adjacent. The catch is 95% of libraries get it wrong and/or don't keep it current, including tools at work used to parse Rust code. (We're in the midst of a refactoring bonanza.) The maintenance required is precision around keeping current with Unicode versions and correctly classifying characters that could fall under multiple domains. The older Unicode references list these concerns more pedantically than newer ones. It's also worth being away of runtime behavioral changes imposed by LC_CTYPE, LC_COLLATE, LC_ALL, and LANG in GNU/Linux, POSIX, and other systems. For example, when grepping random binary files for text, better set LC_ALL=C or it's likely to assume UTF-8 is the lingua franca.

skull-squadron · 2022-10-13T16:57:14Z

@minfrin

Nice. I recall having to fork hirb-unicode to correctly draw text tables in an ActiveRecord ORM containing Hebrew and Japanese.

Input: 
   unsigned integer c - the code point of the character to be encoded
Output: 
   byte b1, b2, b3, b4 - the encoded sequence of bytes
Algorithm:
   if (c<0x80) 
      b1 = c>>0  & 0x7F | 0x00
      b2 = 0x00
      b3 = 0x00
      b4 = 0x00
   else if (c<0x0800)
      b1 = c>>6  & 0x1F | 0xC0
      b2 = c>>0  & 0x3F | 0x80
      b3 = 0x00
      b4 = 0x00
   else if (c<0x010000)
      b1 = c>>12 & 0x0F | 0xE0
      b2 = c>>6  & 0x3F | 0x80
      b3 = c>>0  & 0x3F | 0x80
      b4 = 0x00
   else if (c<0x110000)
      b1 = c>>18 & 0x07 | 0xF0
      b2 = c>>12 & 0x3F | 0x80
      b3 = c>>6  & 0x3F | 0x80
      b4 = c>>0  & 0x3F | 0x80
   end if

https://herongyang.com/Unicode/UTF-8-UTF-8-Encoding-Algorithm.html

Absolute range of U+0 to U+10FFFF

UTF-16 surrogate pairs are U+D800 to U+DFFF, so never those.

And then it gets trickier.

What is valid depends on the Unicode version. Validation code can be mechanically generated from a list of all valid code points. Here's the latest database. It's XML but contains all of the painful-to-find equivalence class metadata. Simplicity but not so simple as to be incorrect. ;)

Here's how Microsoft C# / CLR counts UTF-8 code-points.

Artoria2e5 · 2022-10-15T03:53:44Z

Well uh... I don't think a line-editing program needs to concern itself with BOM, or to know about any Unicode properties, let alone hard-code ranges of assigned UTF code points. Heck, do we even need to check for surrogates for "valid" UTF-8 as opposed to WTF-8? If someone types it it's their fault, and even that could have use on quirky filesystems (looks at Windows). Grapheme editing is also overrated -- being able to navigate individual code points in sequences is actually fun and not very distressing!

troglobit · 2022-10-15T12:08:08Z

Whoa, a sudden burst of activity in this old thread! 😃

Lots of interesting suggestions coming in, so let me just reiterate: no external dependencies, and keep it simple. No need to go full-blown unicode support, it's fine to start with a release that only does utf-8 and possibly even having to explicitly enable it at build time.

troglobit changed the title ~~UTF8 support?~~ Request for UTF-8 support Jan 12, 2020

Artoria2e5 linked a pull request Apr 20, 2021 that will close this issue

Add multibyte / locale and ignore support #49

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for UTF-8 support #37

Request for UTF-8 support #37

joakim-tjernlund commented Jan 11, 2020

troglobit commented Jan 11, 2020

joakim-tjernlund commented Jan 11, 2020

troglobit commented Jan 12, 2020

troglobit commented Jan 12, 2020

joakim-tjernlund commented Jan 21, 2020

troglobit commented Jan 21, 2020

skull-squadron commented Feb 26, 2020 •

edited

Loading

troglobit commented Feb 26, 2020

Artoria2e5 commented Apr 19, 2021 •

edited

Loading

troglobit commented Apr 19, 2021

minfrin commented Apr 20, 2021

timkuijsten commented Oct 13, 2022

skull-squadron commented Oct 13, 2022 •

edited

Loading

skull-squadron commented Oct 13, 2022 •

edited

Loading

Artoria2e5 commented Oct 15, 2022 •

edited

Loading

troglobit commented Oct 15, 2022

Request for UTF-8 support #37

Request for UTF-8 support #37

Comments

joakim-tjernlund commented Jan 11, 2020

troglobit commented Jan 11, 2020

joakim-tjernlund commented Jan 11, 2020

troglobit commented Jan 12, 2020

troglobit commented Jan 12, 2020

joakim-tjernlund commented Jan 21, 2020

troglobit commented Jan 21, 2020

skull-squadron commented Feb 26, 2020 • edited Loading

troglobit commented Feb 26, 2020

Artoria2e5 commented Apr 19, 2021 • edited Loading

troglobit commented Apr 19, 2021

minfrin commented Apr 20, 2021

timkuijsten commented Oct 13, 2022

skull-squadron commented Oct 13, 2022 • edited Loading

skull-squadron commented Oct 13, 2022 • edited Loading

Artoria2e5 commented Oct 15, 2022 • edited Loading

troglobit commented Oct 15, 2022

skull-squadron commented Feb 26, 2020 •

edited

Loading

Artoria2e5 commented Apr 19, 2021 •

edited

Loading

skull-squadron commented Oct 13, 2022 •

edited

Loading

skull-squadron commented Oct 13, 2022 •

edited

Loading

Artoria2e5 commented Oct 15, 2022 •

edited

Loading