emoji: Process emoji keywords for flags. #31

wkyoshida · 2023-04-12T00:34:49Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have updated the CHANGELOG with a description of my changes for the upcoming release (if necessary)

Description

"🇩🇪": {
  "default": [
    "Flagge"
  ],
  "tts": [
    "Flagge: Deutschland"
  ]
},

Reference the example above ⬆️
We are currently getting emoji keywords from the JSON list under "default". Issue with flag emojis is that the only keyword in that list is the word "Flag" (or "Flagge" in the case of German) and not the corresponding territory of the flag, e.g. "Deutschland". That detail is in the JSON list under "tts". This change adds some quick regex to extract the territory name, "Deutschland" in the example, and add it as another keyword.

Quick note: CLDR has a distinction between flags and tags. Tags are more for subdivision/regional flags. There are only three that are officially supported it seems, England, Scotland, and Wales.

Related issue

I'd say this is perhaps a quick addition to the MVP for the emoji data done for Emoji data for Scribe apps #14

andrewtavis

Only thing that doesn't make sense here is the limited selection of possible tags 🤦‍♂️🤦‍♂️😊 Thanks for this, @wkyoshida! Will generate the data now and use all this as a basis for the numbers in #32 :)

andrewtavis · 2023-04-12T19:42:59Z

@wkyoshida, did a bit of a rookie mistake here and didn't try to run it before merging... 😅 Read it and it made sense 😇 I commented out the code so I could do #32, but then when I try to run gen_emoji_lexicon within gen_emoji_lexicon.ipynb I get the following:

---> 92 emoji_flags = Char.getBinaryPropertySet(UProperty.RGI_EMOJI_FLAG_SEQUENCE)
     93 emoji_tags = Char.getBinaryPropertySet(UProperty.RGI_EMOJI_TAG_SEQUENCE)
     94 regexp_flag_keyword = re.compile(r".*\: (?P<flag_keyword>.*)")

AttributeError: type object 'icu.Char' has no attribute 'getBinaryPropertySet'

Do you have any ideas on this? Really happy to get the emojis in the table so we have some documentation of all the work we're putting in here, btw! 😊

wkyoshida · 2023-04-12T22:33:18Z

Only thing that doesn't make sense here is the limited selection of possible tags

I know! Those three definitely make sense as cases, but there definitely are others that should qualify too.

Do you have any ideas on this?

Hmm, so icu is actually pulled down as PyICU when it is installed. Here it is on pypi. It doesn't look like there's been a new release recently, and the latest is still 2.10.2.

The output of pip list for me gives me:

...
PyICU                        2.10.2
...

So that looks right.

You may recall that PyICU does have some oddities as far as installing though, unfortunately.. There are other things that have to be installed as well. Would you by chance be using a different device or environment? Just want to check so that we cover our bases.

andrewtavis · 2023-04-12T23:08:28Z

I'm on the same device and the same environment supposedly, but then your comment reminds me that randomly my local env was missing some packages the other day 🤷‍♂️ Have it set up to auto activate on entering the dir, so that was a surprise. Would you say I should just reinstall the whole env as a first check?

wkyoshida · 2023-04-12T23:18:49Z

Could perhaps be a good sanity check at least, particularly since packages missing have happened recently too. If the issue is still there after, we can look at what else it could be.

andrewtavis · 2023-04-13T17:13:11Z

Hey @wkyoshida 👋 Deleted and remade the conda env and am still getting the issue 🤷‍♂️ I also am getting:

...
PyICU                        2.10.2
...

from pip list, similarly for conda list 🤔 Any thoughts of what to try next? :)

wkyoshida · 2023-04-13T22:52:17Z

😭
Well... hmm WHAT happened? 😆

Let's see - So when you installed the dependencies for PyICU, did you go with Homebrew to install or one of the other installation methods (from these instructions)? Despite being on Ubuntu, I went with Homebrew. Running brew list pkg-config icu4c, it looks like I have versions 0.29.2_3 and 71.1, respectively.

One thing that I could try is spin up an Ubuntu Docker container to get a clean slate, do some partial installations, and see if I can mimic what you're seeing perhaps.

andrewtavis · 2023-04-14T07:41:37Z

0.29.2_3 and 72.1 on my end 🤔 But then I do think I need to screw around with the C installation a bit more 🤔 Would a dockererized container make sense for Scribe-Data at this point though? Would need to figure out how to get them installed as selectable envs within Jupyter though.

andrewtavis · 2023-04-14T19:49:03Z

I played around with the installation a bit more and I'm still at a standstill with type object 'icu.Char' has no attribute 'getBinaryPropertySet'.

Well... hmm WHAT happened? 😆

This is new since the addition of Char.getBinaryPropertySet. Not sure why it's working on yours and not mine though. A basic thing that we could do for now to make sure it's included is that you could run the process in gen_emoji_lexicon.ipynb to update the JSONs and I can send them over to iOS?

Any other suggestions you have would be great 😊

wkyoshida · 2023-04-15T04:09:49Z

Would a dockererized container make sense for Scribe-Data at this point though?

Hmm I'm not sure. A container could help for development definitely, since we would both be able to use the same OS/environment (other contributors also). However, this wouldn't resolve the issue for users, I think. I'm guessing that the differences are coming from installation process differences due to OS/env. To make Scribe-Data a general-use package, I think that cross-platform usability should be something to aspire for. So kind of a bummer that we're running into issues with PyICU..

I played around with the installation a bit more and I'm still at a standstill... Not sure why it's working on yours and not mine though.

Yeah 😞 I spun up an Ubuntu container so I could get a clean slate to try some things too. Sadly, I wasn't able to actually get a working installation via Homebrew 🤔. I did get it working via the Debian option though. TBH I don't actually recall which installation option I went with months ago, so maybe I actually installed via Debian and then later just ran the brew commands? Sounds like something I might've done.

Also, it does actually seem that a non-insignificant number of people have run into issues installing on Mac sadly - just looking at how more wordy the installation instructions for Mac is. Might notice also that there are links to GitLab Issues with additional asides for Mac too.

For my sanity lol would you be able to share what steps you took? Also, have you tried the Macports option (no idea if it'd be different, but who knows)?

A basic thing that we could do for now to make sure it's included is that you could run the process in gen_emoji_lexicon.ipynb to update the JSONs and I can send them over to iOS?

I guess we could. I would like to figure out what's happening of course, because I am considering that, if PyICU ends up being a dud, a worst-case scenario could be us having to look at a different option to get the same data. So there is a small chance of some unpleasant rework. But I'll get a PR with the JSONs ready tomorrow, and we can keep it on standby if it comes to do that.

andrewtavis · 2023-04-15T10:21:59Z

For my sanity lol would you be able to share what steps you took? Also, have you tried the Macports option (no idea if it'd be different, but who knows)?

Sure thing! And I haven't tried the Macports option, but then looking at it I was like 🤨

I deleted the conda venv that I use conda env remove -n scribe-data-dev
I remade it with conda create --name scribe-data-dev python=3.9 --no-default-packages
I installed some basic packages that I like to work with: ipykernel, jupyter and nb_conda_kernels
I went through and installed the packages with conda or pip depending on where they're found in environment.yml (I know I can do it in one line, but I was going one by one to be cautious)
I think at this point I installed via pip first and then with the suggested way after removing uninstalling, but I just went back and did it all again and installed via pip install --no-binary=:pyicu: pyicu
- I had ran brew install pkg-config icu4c before this
- I also ran export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH" as suggested
- I also ran export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/usr/local/opt/icu4c/lib/pkgconfig" as suggested
- I've tried with both export CC="$(which gcc)" CXX="$(which g++)" and unset CC CXX as options, but I think the second makes more sense in my case

Still the same error: AttributeError: type object 'icu.Char' has no attribute 'getBinaryPropertySet'. Let me know if anything in the above sticks out! 🙏 Only thing I can think of is that conda is problematic in all this. I'm coming from data science, so it's been my go to for environments.

I would like to figure out what's happening of course, because I am considering that, if PyICU ends up being a dud, a worst-case scenario could be us having to look at a different option to get the same data. So there is a small chance of some unpleasant rework. But I'll get a PR with the JSONs ready tomorrow, and we can keep it on standby if it comes to do that.

We'll figure this out somehow! 😊 And thanks for the willingness to send along the files for release purposes 🚀

andrewtavis · 2023-04-16T17:22:32Z

Let's also not worry about getting this data in here right this moment, @wkyoshida :) I definitely have releasing today in mind as I've got loads for next week and also want to make sure the new stuff is up by the time I go to my Swift meetup on Wednesday 😅😄

We can take a look at it all and do some edits for a patch release in the coming weeks 😊 Making an issue for this might make sense just so we have it organized. There's also already some emoji stuff that I'm seeing that we might want to find a way to avoid — specifically the usage data is putting some emojis that some people wouldn't want to see all the time as suggestions (🤰 for woman being the most glaring one, which does make sense as that emoji would likely be used more than 👩, but ya...). Switching it over so that we get the base emojis for sure and then add on to them based on rank would likely be best, if it's possible.

Edit: I'm actually gonna go through and do a removal of some emojis in the process that have to do with relationships and such, as 👨‍❤️‍💋‍👨 is also showing up for man. We're pissing off all sides at this point 😅😊

andrewtavis · 2023-04-16T19:50:55Z

@wkyoshida, I'm gonna put it simply 😅 The creation of emoji_utils.py was one of the worst coding experiences of my life 😄 My autoformat changing emojis into their respective combinations and then undo reformatting and making it even worse meant that I had to restart soooo many times 🤦‍♂️😊 Let me know if you think that anything else should be added though. I think that what's there to be removed and what it leaves us with is ok.

wkyoshida · 2023-04-17T02:18:56Z

Hey @andrewtavis!

So I was actually looking to ready the data yesterday, but then I noticed that the generated .jsons I was getting actually had additional differences beyond simply the flags/tags data. When I generate them, the .jsons do process only for the base emojis. The .jsons that were previously saved in the repo, however, also have emoji variants included. Are the saved .jsons the ones from when you only generate them via the notebook? If so, I guess I'm confused now about this additional ICU behavior difference that we're seeing...

This is perhaps why I was also not understanding when you would point out things like:

Switching it over so that we get the base emojis for sure ... if it's possible.

Cause that IS the behavior that I see already 😆

Making an issue for this might make sense just so we have it organized.

Agreed, I was thinking the same. Wanted to avoid having to, though, if the problem was quick, but it looks like there might be some work here. #33 is a start

There's also already some emoji stuff that I'm seeing that we might want to find a way to avoid — specifically the usage data is putting some emojis that some people wouldn't want to see all the time as suggestions

I understand that some emojis may not be ones that some may not want to see all the time. I'm with you 👍 Although, to play devil's advocate perhaps, there is a reason why these emojis are higher up in the popularity rankings though, no? In the aggregate, people do use them. Also, I'd say that all-round filtering them out does eliminate these emojis from being used in other cases where they could be more universally appropriate. Another thing is simply cultural or contextual differences. The suggested emojis, in many cases, do end up being the same across languages. However, there are instances when they differ. For instance, 💏 does appear for the German "mann", but it also does for the French "couple".

Perhaps what I'm getting at is that we might want to be careful. Excluding something that some people might not want may mean excluding something that others expect. That is why I thought going off the standard (e.g. 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🏴󠁧󠁢󠁷󠁬󠁳󠁿, tags with official support) and off researched usage (e.g. the popularity data) could be a decent baseline. However, my gut is that we might want to think through custom decisions thoroughly.

The creation of emoji_utils.py was one of the worst coding experiences of my life 😄

Yea.. 😄 I experienced that too as I was doing some of the emoji investigations a while back. Messing with Unicode can get frustrating.

Also, thanks for sharing the installation steps you took! I'll think through them if there's anything that jumps out at me.

andrewtavis · 2023-04-17T07:06:32Z

The .jsons that were previously saved in the repo, however, also have emoji variants included. Are the saved .jsons the ones from when you only generate them via the notebook? If so, I guess I'm confused now about this additional ICU behavior difference that we're seeing...

The current JSONs come from the most recent runs of the process in the notebooks. In thinking about it, 👩 was definitely something that we were getting before all of this, so I think that in matching emojis to remove the "controversial" ones we're actually deleting these as well as 🤰 might match with 👩. I don't know 🤷‍♂️ Assuming that's the case I'll check the process and see if I can remove things in a way that we're using the code and not the emoji.

The only other major thing I've changed is the export_base_rank option, which shouldn't be causing any problems:

Scribe-Data/src/scribe_data/extract_transform/process_unicode.py

Line 39 in 5969cde

export_base_rank=False,

I was getting base emojis before adding that in though, and it's solely to keep the size down as the dictionaries do maintain the rank on export to the JSONs.

Although, to play devil's advocate perhaps, there is a reason why these emojis are higher up in the popularity rankings though, no? In the aggregate, people do use them. Also, I'd say that all-round filtering them out does eliminate these emojis from being used in other cases where they could be more universally appropriate. Another thing is simply cultural or contextual differences. The suggested emojis, in many cases, do end up being the same across languages. However, there are instances when they differ. For instance, 💏 does appear for the German "mann", but it also does for the French "couple".

I 100% agree. Makes sense that some of these would be more popular, and personally speaking removing any of the 💏 style emoji was unnecessary as far as my usage of a keyboard. My intention now is to remove them at a time when I think that they're showing up in cases that the user would not expect them — so for woman, man, etc let's try to get 👩, etc in there — but the intention is to add them back in. They'll also of course be added into the app in scribe-org/Scribe-iOS#258 😊

Perhaps what I'm getting at is that we might want to be careful. Excluding something that some people might not want may mean excluding something that others expect. That is why I thought going off the standard (e.g. 🏴󠁧󠁢󠁥󠁮󠁧󠁿 🏴󠁧󠁢󠁳󠁣󠁴󠁿 🏴󠁧󠁢󠁷󠁬󠁳󠁿, tags with official support) and off researched usage (e.g. the popularity data) could be a decent baseline. However, my gut is that we might want to think through custom decisions thoroughly.

Very much so! The ideality is that we delete emoji_utils.py. Big thing is that when demoing Scribe I also use "woman" as an example for that which would be "die", "la", etc, and I really don't want to have conversations with people about how the 🤰 isn't appropriate. For now at least :) Custom decisions might be made in the final version of this, but that's also the case for Apple as for some of these things they definitely seem to be hand picking the emojis that get suggested.

I'll look into this all a bit more before releasing 😊 You can likely tell that it didn't go through last night, which I'd say was the right decision as last night I was way too tired to do the proper checks. Thanks for thinking through all this with me! 🙏

andrewtavis · 2023-04-17T16:08:13Z

f9df300 did a simple change to make the emoji comparison happen with unicode instead, which then prevented us from losing 👩 and the like. I'm happy with the autosuggestions/completions as of now, but with that understand that this solution is definitely WIP :)

wkyoshida · 2023-05-12T00:29:49Z

Very much so! The ideality is that we delete emoji_utils.py. Big thing is that when demoing Scribe I also use "woman" as an example for that which would be "die", "la", etc, and I really don't want to have conversations with people about how the 🤰 isn't appropriate. For now at least :) Custom decisions might be made in the final version of this, but that's also the case for Apple as for some of these things they definitely seem to be hand picking the emojis that get suggested.
...
I'm happy with the autosuggestions/completions as of now, but with that understand that this solution is definitely WIP :)

Gotchu! That makes sense; I think that the aspect with the demoing concerns are valid. For clarity, I'm definitely not opposed to custom decisions; just wanted to point out to think through them carefully too. With the emoji suggestions being WIP as they are though as you said, agreed that we'll continue to refine.

emoji: Process emoji keywords for flags.

64f199f

andrewtavis self-requested a review April 12, 2023 18:09

andrewtavis approved these changes Apr 12, 2023

View reviewed changes

andrewtavis merged commit 0d8de8a into scribe-org:main Apr 12, 2023

wkyoshida mentioned this pull request Apr 17, 2023

Inconsistent PyICU behavior #33

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emoji: Process emoji keywords for flags. #31

emoji: Process emoji keywords for flags. #31

wkyoshida commented Apr 12, 2023

andrewtavis left a comment

andrewtavis commented Apr 12, 2023 •

edited

Loading

wkyoshida commented Apr 12, 2023

andrewtavis commented Apr 12, 2023

wkyoshida commented Apr 12, 2023

andrewtavis commented Apr 13, 2023

wkyoshida commented Apr 13, 2023

andrewtavis commented Apr 14, 2023

andrewtavis commented Apr 14, 2023

wkyoshida commented Apr 15, 2023

andrewtavis commented Apr 15, 2023

andrewtavis commented Apr 16, 2023 •

edited

Loading

andrewtavis commented Apr 16, 2023 •

edited

Loading

wkyoshida commented Apr 17, 2023

andrewtavis commented Apr 17, 2023

andrewtavis commented Apr 17, 2023 •

edited

Loading

wkyoshida commented May 12, 2023

emoji: Process emoji keywords for flags. #31

emoji: Process emoji keywords for flags. #31

Conversation

wkyoshida commented Apr 12, 2023

Contributor checklist

Description

Related issue

andrewtavis left a comment

Choose a reason for hiding this comment

andrewtavis commented Apr 12, 2023 • edited Loading

wkyoshida commented Apr 12, 2023

andrewtavis commented Apr 12, 2023

wkyoshida commented Apr 12, 2023

andrewtavis commented Apr 13, 2023

wkyoshida commented Apr 13, 2023

andrewtavis commented Apr 14, 2023

andrewtavis commented Apr 14, 2023

wkyoshida commented Apr 15, 2023

andrewtavis commented Apr 15, 2023

andrewtavis commented Apr 16, 2023 • edited Loading

andrewtavis commented Apr 16, 2023 • edited Loading

wkyoshida commented Apr 17, 2023

andrewtavis commented Apr 17, 2023

andrewtavis commented Apr 17, 2023 • edited Loading

wkyoshida commented May 12, 2023

andrewtavis commented Apr 12, 2023 •

edited

Loading

andrewtavis commented Apr 16, 2023 •

edited

Loading

andrewtavis commented Apr 16, 2023 •

edited

Loading

andrewtavis commented Apr 17, 2023 •

edited

Loading