Skip to content
This repository has been archived by the owner on Jul 14, 2022. It is now read-only.

Pass source and target language to transliterators #14

Open
brawer opened this issue Jan 17, 2019 · 1 comment
Open

Pass source and target language to transliterators #14

brawer opened this issue Jan 17, 2019 · 1 comment

Comments

@brawer
Copy link

brawer commented Jan 17, 2019

Could a variant of osml10n_translit take what source and target language is being transliterated? Currently, the implementation always calls Any-Latin. Although this works as a fallback, ICU would return better quality if you were passing the source and target language. For background, see this talk. Here’s a few examples from the unit tests in Unicode CLDR, which is the upstream source for ICU’s transliterators. The transliteration IDs are in IETF BCP 47-T (RFC 6497) syntax, such as:

  • ja-t-es: Japanese, transliterated from Spanish (test cases, abanilla ⇒ アバニリャ)

  • ja-t-es-419: Japanese, transliterated from Latin American Spanish (test cases, abanilla ⇒ アバニヤ)

  • ka-Latn-t-ka-m0-bgn-1981: Georgian in Latin letters, transliterated from Georgian using the 1981 version of the BGN romanization (test cases, საყოველთაო ⇒ saqovelt’ao)

  • ka-Latn-t-ka-m0-bgn-2009: Georgian in Latin letters, transliterated from Georgian according to the 2009 version of the BGN romanization (test cases, საყოველთაო ⇒ saqʼoveltao)

Implementation detail: With the current ICU version, you’d unfortunately have to mangle BCP47-T identifiers before creating the ICU transliterator. Future ICU versions will expose BCP47-T identifiers to outside callers, but I can’t promise when this will be deployed; there’s higher priority bugs. However, the capability to select particular transforms has always been present in ICU. This name mangling is an implementation detail which could be hidden from callers. My recommendation for osml10n_translit would be to accept an optional parameter with an IETF BCP47-T identifier; their syntax has been standardized in RFC 6497. ICU can be instructed to fall back to Any-Latin in case the requested transliterator is unavailable.

Performance: Consider caching the ICU Transform objects across invocations. In the current implementation, a new icu::Transform gets created and disposed for every transliterated label. This is actually quite expensive; it would be faster to reuse the Transforms. I don’t know whether Postgres is multi-threaded; if it is, it would be necessary to use a mutex around icu::Transform::transliterate() because the current ICU implementation is not thread-safe. (It’s perfectly safe to simultaneously call this method on different icu::Transform instances, but only one single thread at a time should call transliterate() on the same icu::Transform object). Despite the locking, it’ll be much faster to reuse Transform objects across multiple calls. No locking is needed when each thread has its private set of icu::Transform objects; this could be done by putting a std::hash_map<transliterator_id, icu::Transform*> into thread-local storage.

Sorry for not just sending you a patch to implement this change; I know nothing about Postgres internals. But for someone familiar with the codebase this should be quite an easy change, with a noticeable improvement in transliteration quality. However, do tell if you want me to write a C++ function for converting BCP 47-T syntax to ICU’s legacy identifiers; I could do that for you.

@giggls
Copy link
Owner

giggls commented Jan 17, 2019

Did you have a look at my talks? Have a look at the wiki if not. Currently the whole code is based on the implicit assumption that the target language is something written in latin script. It will likely break with any other target language anyway.

So this boils down to the question if there would be a benefit to replace Any by something more specific.

An Optional Parameter for osml10n_translit would only make sense if the geo_transliterate function would also be able to call it in a more specific way. What I can provide there is the country-code where the geographic object (which name we are about to transliterate) is located.

So can you provide a function which can do some mapping between a country-code and the "IETF BCP47-T identifiers" needed?

Concerning Performance: I know that it is not a good Idea to create a new instance of the ICU translator on any call, but I have no idea if the PostgreSQL Interface for stored procedures does provide a way to re-use instances of older calls. This is a plain C Interface after all.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants