Feature Request: Add ability to set custom replacements #54

atlowell-smpl · 2020-03-25T22:14:18Z

This is a feature request to add a library function that can be called to set custom character replacement mappings. The library function could be called in a similar manner to as follows:

unidecode.setReplacement('™', "(TM)")

after which subsequent calls to unidecode would replace '™' with '(TM)' rather than the default of "(tm).

For my particular use case, this would be useful for standardizing names across data sources. For instance, if one source has the name "Acme™" and the other has the name "Acme", I could make a call to replace all instances of '™' with "", which allows me to easily compare across sources.

However, I could see a large number of other use cases for this as well. For example, in the FAQ, it is mentioned that localization is not supported and that certain german characters translate to english text rather than german phonetics. Having a call like this could allow programmers to make their own "localization" libraries, allowing them to whatever mappings they feel like.

youssefavx · 2020-05-27T21:25:33Z

I would love this. I'd love to be able to replace something like: [?] with an empty space instead.

avian2 · 2020-05-28T07:02:44Z

I think @atlowell-smpl's use case is something that falls outside of Unidecode's concern. It's easy to do such string replacements yourself before a call to Unidecode.

I agree that the characters that get replaced with a [?] are a problem. It's something that #53 already tried to address (reading my comments there I think I initially misunderstood the goal of that pull request).

Before code for that can be added the character tables first need to be updated and unknown characters marked (probably with None instead of [?])

avian2 · 2021-01-08T15:46:21Z

I've implemented the feature to specify custom replacement strings for characters that are unknown to Unidecode. For example, this will use ASCII space to replace characters that are not are not present in Unidecode's tables.

>>> unidecode('[\ue000]', errors='replace', replace_str=' ')
[ ]

Tatsh · 2024-09-28T23:11:30Z

Hacky way to do this:

from typing import cast


def assert_not_none(var: _T | None) -> _T:
    """
    Assert the ``var`` is not None and return it.
    
    This will remove ``| None`` from type ``_T``.
    """
    assert var is not None
    return var


def add_custom_replacement(find: str, replace: str) -> None:
    from unidecode import Cache, unidecode  # noqa: PLC0415
    unidecode(find)  # Force it to load the module
    codepoint = ord(find)
    section = codepoint >> 8
    position = codepoint % 256
    new_section = cast(list[str | None],
                       (Cache[section] if isinstance(Cache[section], list) else
                        (list(assert_not_none(Cache[section])) if Cache[section] is not None else
                         [None for _ in range(position + 1)])))  # convert to mutable type
    assert len(new_section) > position
    new_section[position] = replace
    Cache[section] = new_section

Edit: made it type-safe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add ability to set custom replacements #54

Feature Request: Add ability to set custom replacements #54

atlowell-smpl commented Mar 25, 2020

youssefavx commented May 27, 2020

avian2 commented May 28, 2020

avian2 commented Jan 8, 2021

Tatsh commented Sep 28, 2024 •

edited

Loading

Feature Request: Add ability to set custom replacements #54

Feature Request: Add ability to set custom replacements #54

Comments

atlowell-smpl commented Mar 25, 2020

youssefavx commented May 27, 2020

avian2 commented May 28, 2020

avian2 commented Jan 8, 2021

Tatsh commented Sep 28, 2024 • edited Loading

Tatsh commented Sep 28, 2024 •

edited

Loading