Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What constitutes an acceptable keyword? #194

Closed
jacobwhall opened this issue Jul 2, 2021 · 12 comments · Fixed by #226
Closed

What constitutes an acceptable keyword? #194

jacobwhall opened this issue Jul 2, 2021 · 12 comments · Fixed by #226

Comments

@jacobwhall
Copy link

First of all, thank you for maintaining this repository!

I wrote a rudimentary emoji search program using your data, and noticed that, for example, "poop" does not match any of the keywords for 💩:

"💩": [
"pile_of_poo",
"hankey",
"shitface",
"fail",
"turd",
"shit"
],

There are a lot of other poop synonyms listed here, so I feel that "poop" would be an uncontroversial addition. But there are many synonyms for poop, and we might not want to include them all?

Another example I ran into was for 📱:

emojilib/dist/emoji-en-US.json

Lines 7259 to 7265 in f3169dc

"📱": [
"mobile_phone",
"technology",
"apple",
"gadgets",
"dial"
],

The first phrase I'd say if you asked me to identify this emoji is "cell phone." However, none of the keywords for this emoji would match "cell." Would it be appropriate to add "cell," "cell_phone," or "cellular_phone?" Are non-official keywords that use underscores OK, or should substrings like "phone" be added as well as "mobile_phone?"

Finally, and I write this sincerely, I'd like to discuss 🍆:

emojilib/dist/emoji-en-US.json

Lines 4269 to 4275 in f3169dc

"🍆": [
"eggplant",
"vegetable",
"nature",
"food",
"aubergine"
],

This emoji is often used to signify a penis. Would it be acceptable to add "dick" or "penis" to the list of associated keywords for this emoji? I think that doing so would better reflect common usage, but might stray too far from Unicode's "intended use" for the emoji (if that's a thing).

I suggest that a section be added to CONTRIBUTING.md or README.md that gives guidance to future contributors about questions like these.

…and that's how I posted a GitHub issue about poop, cell phones, and penises 🤪

@muan
Copy link
Owner

muan commented Feb 3, 2022

Hey sorry for the lack of response I was largely away last year.

TBH I have not thought about this at length. but I agree with what you've written here. If pull requests were sent for these keywords, I'd accept them all.

I suggest that a section be added to CONTRIBUTING.md or README.md that gives guidance to future contributors about questions like these.

I agree. I'd be happy to accept a PR for this if anyone's willing to send them.

@thdoan
Copy link

thdoan commented Jul 5, 2022

@jacobwhall I'm planning to fork this and start an emoji autocomplete project also. Have you settled on a fast way to search through the aliases? I was thinking about doing something like a filter, but not sure if there are faster options out there.

UPDATE: I did some performance tests, and I think for best performance I'm going to flatten the arrays into strings -- finding partial text matches in strings is faster than doing the same operation on arrays.

https://jsbench.me/zql58n0oew/1

When doing a partial match on every keystroke, every bit of performance counts ^^.

@jacobwhall
Copy link
Author

@thdoan sounds like you've done as much as I have. I wrote an emoji picker in Python that you're welcome to check out. The search works surprisingly well!

@thdoan
Copy link

thdoan commented Jul 8, 2022

@jacobwhall cool, I'm experimenting with an emoji autocomplete by leveraging the browser's native datalist functionality. However, I've decided to start my emojis map from scratch based on https://emojipedia.org/ (all tedious manual work since they closed their API). We'll see how it goes.

@JoshuaKGoldberg
Copy link
Collaborator

+1, having docs on this would be great. I'm working on omnidan/node-emoji#132 to bring node-emoji to emojilib@3. The test cases in that draft PR are showing a lot of places where emojilib@3 removed conveniences the library relied on. For example, "heart" shows up in a few emojis, but not ❤️ itself:

"❤️": [
"red_heart",
"love",
"like",
"valentines"
],

I wrote a quick script to find discrepencies:

// npm i emojilib-2@npm:emojilib@2 emojilib-3@npm:emojilib@3
const { lib: emojisV2 } = await import("emojilib-2");
const { default: emojisV3 } = await import("emojilib-3", {
  assert: { type: "json" },
});

const missing = [];
const missingIgnoringAliases = [];

for (const [nameV2, detailsV2] of Object.entries(emojisV2)) {
  const detailsV3 = emojisV3[detailsV2.char];
  if (detailsV3?.includes(nameV2)) {
    continue;
  }

  const complaint = { nameV2, detailsV2, detailsV3 };
  missing.push(complaint);

  const primaryAlias = detailsV3?.[0];
  if (
    primaryAlias &&
    !/^(?:flag|two|smiling_face_with)_|_face$/.test(primaryAlias)
  ) {
    missingIgnoringAliases.push(complaint);
  }
}

console.table({
  "Missing in general": missing.length,
  "Missing ignoring a few quick aliases": missingIgnoringAliases.length,
});
┌──────────────────────────────────────┬────────┐
│               (index)                │ Values │
├──────────────────────────────────────┼────────┤
│          Missing in general          │  678   │
│ Missing ignoring a few quick aliases │  456   │
└──────────────────────────────────────┴────────┘

@muan is there a description anywhere of how #178's lists were generated? Or, if not, could you speak to how you generated it?

@muan
Copy link
Owner

muan commented Sep 22, 2023

@muan is there a description anywhere of how #178's lists were generated? Or, if not, could you speak to how you generated it?

I believe I had some hack-together local scripts so I don't recall the exact differences. But here's what might have happened:

Previously this project was exclusively built for github shortcodes at our internal hackathon, and with v3 I decided to move away from that. so the primary key became their official unicode names, which would explains why tada was replaced with party popper, poop was replaced by pile of poo.

IIRC, the official name of the emoji changes with each version sometimes too (gun -> water gun), which was why I made the character be the key now.

I feel like I would/should have done the work to compare and keep the GitHub shortcodes but I guess I did not.

So to add them all back, a name/alias comparison between GitHub's set and the unicode set could potentially do the trick.

@JoshuaKGoldberg
Copy link
Collaborator

JoshuaKGoldberg commented Mar 18, 2024

OK! Sorry for taking so long on this - I wanted to really think through the problem space. As in: what's a "keyword"?

Using the 🛫 emoji as an example, I think there are really 2-3 use cases for emoji keywords:

  • 🆔 Identity: Where keywords can be used as either...
    • 🌕 Full Identity: Terms that are a complete alias or title for the emoji (e.g. airplane_departure)
    • 🌗 Partial Identity: Terms that can be a part of the complete identity of the emoji, but aren't standalone (e.g. airplane, departure)
  • 🔗 Relation: Terms that would relate to the emoji in searching, but aren't part of its identity (e.g. airport, taking)

Ideally I'd propose emojilib separate at least 🆔 identity from 🔗 relation keywords. Some users will want only identity, e.g. node-emoji's :shortcode: replacement. Some users will want the relation ones as well, e.g. general text searches.

+1 to @muan's suggestion in #194 (comment) of a comparison. I'd say a programmatic approach would be the easiest & least controversy-risking approach for emojilib. My proposal would be something like:

  • 🆔 Identity keywords should be sourced from the Unicode standard, Emojipedia also-known-as and title, and platform shortcodes
  • 🔗 Relation keywords should be sourced from the search terms defined for emoji in individual platforms

As for setting up that programmatic approach... we can get halfway there. I made a standalone emojipedia package to scrape & store the Emojipedia data for each emoji. That data includes 🆔 identity shortcodes across Discord, Emojipedia (based on the Unicode standard), GitHub, and Slack.

Looking at the data that's in emojipedia and/or emojilib@3 today on the 🛫 emoji, we can see that there are a lot of 🆔 identity keywords that are only in one of the two datasets but not both:

In Both 🌕 Only in Emojilib 🌗 Only in Emojipedia 🌓
Full Keywords Partial Keywords Full Keywords Partial Keywords Full Keywords Partial Keywords
  • airplane_departure
  • airplane
  • departure
  • airport
  • flight
  • landing
  • aeroplane_taking_off
  • airplane_taking_off
  • flight_departure
  • plane_taking_off
  • aeroplane
  • off
  • plane
  • taking

Full comparison on: https://github.com/JoshuaKGoldberg/repros/tree/emojilib-emojipedia-keywords-comparison.

My next task will be trying to similarly source the 🔗 relation keywords programmatically. That way we can make a script that populates emojilib data automatically. 🔗 Relation keywords aren't stored on Emojipedia that I can find, so I plan on trying to find exports of individual platforms' emoji libraries such as https://github.com/github/gemoji.

@JoshuaKGoldberg
Copy link
Collaborator

Update: I have a proposal for your review now @muan! 🙌

Preview the full proposal of changes here: Proposed-all.html.

This follows what I proposed in the last comment: that emojilib's keywords be sourced from all associated words in Emojilib/Unicode and platforms we can scrape from. The ones I could easily access were: "fluemoji" (Fluent UI / Windows), "gemoji" (GitHub), and "twemoji" (Twitter).

Using 🛫 as an example, here's what that would look like:

Current Proposed Proposed Changes
➕ Added ➖ Removed ✔️ Unchanged
  • airplane_departure
  • airport
  • flight
  • landing
  • aeroplane
  • aeroplane_taking_off
  • airplane
  • airplane_departure
  • airplane_taking_off
  • check-in
  • departure
  • departures
  • flight
  • flight_departure
  • off
  • plane
  • plane_taking_off
  • taking
  • vehicle
  • aeroplane
  • aeroplane_taking_off
  • airplane
  • airplane_taking_off
  • check-in
  • departure
  • departures
  • flight_departure
  • off
  • plane
  • plane_taking_off
  • taking
  • vehicle
  • airport
  • landing
  • airplane_departure
  • flight

Full comparison and proposal tables on: https://github.com/JoshuaKGoldberg/repros/tree/emojilib-platforms-keywords-comparison.

Unless directed otherwise, I'll send a big PR updating the keywords in this repo... soon. Hopefully later this month.


Note that the following emojis have significantly fewer keywords in the proposed changes:

  • 🐦 went from 6 keywords to 1: bird
  • 🛃 went from 4 keywords to 1: customs
  • 🏜️ went from 4 keywords to 1: desert
  • 🐬 went from 9 keywords to 2: dolphin, flipper
  • 🐘 went from 6 keywords to 1: elephant
  • 🦍 went from 4 keywords to 1: gorilla
  • ⛰️ went from 4 keywords to 1: mountain
  • 🐙 went from 7 keywords to 1: octopus
  • ❇️ went from 6 keywords to 2: *, sparkle

None of the platforms in emoji-platform-data have more than 1-2 keywords for them. Adding in a more rich platform would fill back in those missing keywords. For example, asking the native macOS emoji picker for sea includes 🐙 in the results. I added emoji-platform-data issues labeled platform support.

@jacobwhall
Copy link
Author

Thank you for your work on this @JoshuaKGoldberg

Note that the following emojis have significantly fewer keywords in the proposed changes

I suggest that we integrate individual keyword contributions into this new workflow. I think it's worth retaining the keywords from this project for the example emojis you provided. Contributions to this project could continue to add common-sense keywords that may have been overlooked by unicode/emojipedia/etc.

@JoshuaKGoldberg
Copy link
Collaborator

Makes sense! I sent #226 as a draft for reference that only augments, rather than removes.

@yannickgloster
Copy link

Is there any indication when #226 will be moved from draft/will be merged? Interested in seeing a resolution to this upstream lib omnidan/node-emoji#132.

@pimjansen
Copy link

Any progress on this guys? Like the idea of having a strict workflow in here instead of random keyword proposals

@muan muan closed this as completed in #226 Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants