Centralizing the emoji keyword generation logic #379

Ekikereabasi-Nk · 2024-10-16T08:21:36Z

Contributor checklist

This pull request is on a separate branch and not the main branch

Description

This PR addresses issue #359 by centralizing the emoji keyword generation logic and simplifying the language-specific scripts to enhance maintainability, flexibility, and modularity across the project. The changes are structured into parts, focusing on minimizing code duplication and improving support for language-specific customizations like gender, region and grouped Languages (Hindustani (which contains both Urdu and Hindi) or Norwegian (which includes both Bokmål and Nynorsk) suggested by @KesharwaniArpita.

Centralize the Core Logic

The core functionality for generating emoji keywords, which is duplicated across most language folders, has now been moved to a single centralized file: src/scribe_data/unicode/generate_emoji_keyword.py

This file now handles the core logic, including the generation of emoji keywords and the export of formatted data. The centralized approach ensures that all shared functionality resides in one place, reducing the risk of inconsistencies.

Centralized Function: The generate_emoji_keyword function now resides in this centralized location. This function
takes the language, emojis_per_keyword, and file_path as core parameters and performs the keyword generation by
calling gen_emoji_lexicon and subsequently exporting the data using export_formatted_data.
Shared Logic: The logic for generating the emoji lexicon and handling the export process has been consolidated into
this centralized file, eliminating duplication of this logic across the language-specific modules.

Handle Language-Specific Customizations (Including Grouped Languages)

The generate_emoji_keyword function has been enhanced to accommodate language-specific customizations, including grouped languages.

Grouped Language Support: Languages like Hindustani (which contains both Urdu and Hindi) and Norwegian (which includes both Bokmål and Nynorsk) are now available.
- The function allows the processing of grouped languages by providing a list of sub-languages to process.
- If no sub-languages are specified, all sub-languages in the grouped language will be processed by default.
Customizations for Gender and Region:
- Additional optional parameters such as gender and region are now available.
- This functionality provides flexibility to customize emojis according to specific requirements (e.g., male/female emojis,
  regional differences like US or JP variants).

Modify Language-Specific Files

Each language emoji_keyword folder with a generate_emoji_keyword.py file under language_data_extraction can be updated by using:

The files can now import the centralized logic from the new file generate_emoji_keyword.py from the src/scribe_data/unicode/generate_emoji_keyword.py from from scribe_data.unicode.generate_emoji_keyword import generate_emoji_keyword.
The only language-specific customization left in these files is the setting of basic parameters like LANGUAGE, emojis_per_keyword, and path variables.

Refactor process_unicode.py

The process_unicode.py file has been refactored to integrate the centralized generate_emoji_keyword logic:

Centralized Emoji Lexicon Generation: Shared logic for emoji generation now resides in the gen_emoji_lexicon function, making it more consistent across language modules.
Support for Grouped Languages: The function handles grouped languages like Hindustani (Hindi and Urdu) and Norwegian (Bokmål and Nynorsk), processing sub-languages individually.
Gender and Region Customizations: Parameters for gender and region allow emoji customization based on cultural and demographic factors.
#ISSUE_NUMBER
-Simplify project's emoji keyword functionality #359

github-actions · 2024-10-16T08:22:07Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

Ekikereabasi-Nk · 2024-10-16T08:41:04Z

CC @andrewtavis @KesharwaniArpita @catreedle @DeleMike @VNW22. Please do review and contribute. Thank you

andrewtavis · 2024-10-16T23:23:19Z

src/scribe_data/unicode/generate_emoji_keyword.py

+    """
+    Generate emoji keywords for a specified language, with optional support for grouped languages (e.g., Hindustani = Hindi + Urdu).
+
+    Parameters:


As a start here, @Ekikereabasi-Nk, could I ask you to change the function docstrings to match those in the rest of the project :)

Alright @andrewtavis will do so. Thank you.

andrewtavis · 2024-10-16T23:24:05Z

src/scribe_data/unicode/generate_emoji_keyword.py

+    - sub_languages (list): A list of specific sub-languages for grouped languages (e.g., ["Hindi", "Urdu"]). If not provided, all sub-languages in the group will be processed.
+    """
+
+    # Define grouped languages and their sub-languages


And own line comments should have a period at the end of them. Inline comments don't need capitalization of the first word or punctuation, but those that are their own line are sentences.

catreedle · 2024-10-17T08:19:27Z

src/scribe_data/unicode/generate_emoji_keyword.py

+
+def generate_emoji_keyword(
+    language,
+    emojis_per_keyword,


I suggest setting emojis_per_keyword=3 as a default value. This way, we can reduce the need to specify the parameter each time. I think for most cases 3 would be a suitable default. Of course, this would still allow customization when necessary.

catreedle · 2024-10-17T08:40:45Z

...ibe_data/language_data_extraction/Hindustani/Hindi/emoji_keywords/generate_emoji_keywords.py

+# Set up the argument parser
+parser = argparse.ArgumentParser(
+    description="Generate emoji keywords for a specific language."
+)
+parser.add_argument(
+    "--file-path", required=True, help="Path to save the generated emoji keywords."
+)
+parser.add_argument(
+    "--sub-languages",
+    nargs="*",
+    help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.",
+)
+parser.add_argument(
+    "--gender",
+    choices=["male", "female", "neutral"],
+    help="Specify the gender for emoji customization.",
+)
+parser.add_argument(
+    "--region", help="Specify the region for emoji customization (e.g., US, IN)."
+)
+
+# Parse the command-line arguments


can we set up a common argument parsing function for this?

Thanks @catreedle I will take note

welcome! I'm not really sure of the differences between individual languages to decide if it is plausible, but it is something to consider. :)

@catreedle this will mean to create a modular function for the common parser file, most likely in a separate file that can be imported to each generate_emoji_keyword file for all language. Also @SethiShreya mention the is a parser function somewhere within the codebase, I have not seen it. Please you can help point it out for me to study. So I will create another file in the Unicode folder for this parser function related to generate_emoji_keyword for each language. Thank you

I can't find it either...
maybe something like this could work?

def setup_argument_parser(): parser = argparse.ArgumentParser(description="Generate emoji keywords for a specific language.") parser.add_argument( "--file-path", required=True, help="Path to save the generated emoji keywords." ) ... return parser

andrewtavis · 2024-10-18T01:15:10Z

Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊

VNW22 · 2024-10-18T09:23:11Z

hey @Ekikereabasi-Nk @andrewtavis @catreedle
After reviewing the CLDR dataset, I explored the relevant scripts and verified emoji support across the available languages. Below are the languages that do not have emoji support:

Arabic
Czech
Esperanto
Gurmukhi
Hausa
Hebrew
Japanese
Kurmaji
Latin
Nigerian
Shimukhi
Slovak
Tajik
Tamil
Ukrainian
Urdu
Yoruba
The rest of the languages in the dataset appear to have emoji support.

Please feel free to recheck my findings or let me know if I missed anything.

andrewtavis · 2024-10-18T09:29:23Z

Amazing, @VNW22! What's the code for this like? Would we be able to make a workflow to make sure that there's no directory for emoji keywords directory for those language that don't have support? Might be something to add to the project structure check :) Are there also languages that do have support that we currently don't have in the project?

VNW22 · 2024-10-18T10:09:24Z

Amazing, @VNW22! What's the code for this like? Would we be able to make a workflow to make sure that there's no directory for emoji keywords directory for those language that don't have support? Might be something to add to the project structure check :) Are there also languages that do have support that we currently don't have in the project?

This is the python script i used;
import os
import re

def find_languages_with_emoji_support(root_dir):
languages_with_emoji = []

language_pattern = re.compile(r'LANGUAGE\s*=\s*"(.*?)"', re.IGNORECASE)

for root, _, files in os.walk(root_dir):
    for file in files:
        if file.endswith(".py"):  
            file_path = os.path.join(root, file)

            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()

                    if "gen_emoji_lexicon" in content or "emoji-keywords" in content:
                        match = language_pattern.search(content)
                        if match:
                            language = match.group(1)
                            languages_with_emoji.append(language)
                        else:
                            print(f"Warning: No language defined in {file_path}")
            except Exception as e:
                print(f"Error reading {file_path}: {e}")

return languages_with_emoji

root_directory = "./cldr-json/cldr-localenames-full/"

languages = find_languages_with_emoji_support(root_directory)
if languages:
print("Languages with emoji support:", languages)
else:
print("No languages with emoji support found.")

The code for finding languages without was not working, so I opted for this.
There are sooo many languages that do have support that are not in the project. The project has covered like 40%.

andrewtavis · 2024-10-18T10:28:04Z

Nice 😊 Let's definitely add this to the project structure workflow check once this is done. We also have functionality to update the CLDR data, so maybe new languages will be added and then this workflow will tell us to add emoji support 😇

Ekikereabasi-Nk · 2024-10-18T12:28:59Z

hey @Ekikereabasi-Nk @andrewtavis @catreedle After reviewing the CLDR dataset, I explored the relevant scripts and verified emoji support across the available languages. Below are the languages that do not have emoji support:

Arabic Czech Esperanto Gurmukhi Hausa Hebrew Japanese Kurmaji Latin Nigerian Shimukhi Slovak Tajik Tamil Ukrainian Urdu Yoruba The rest of the languages in the dataset appear to have emoji support.

Please feel free to recheck my findings or let me know if I missed anything.

Thank you @VNW22. @KesharwaniArpita I have update the the Centralizing the emoji keyword generation logic by creating a common_arg_parser function that can be imported for every languages that CLDR dataset support, please look at it. @catreedle Please check the recent commit and review. Thank you

catreedle · 2024-10-18T14:37:49Z

src/scribe_data/unicode/common_arg_parser.py

+def setup_arg_parser():
+    parser = argparse.ArgumentParser(
+        description="Generate emoji keywords for a specific language."
+    )
+    parser.add_argument(
+        "--file-path", required=True, help="Path to save the generated emoji keywords."
+    )
+    parser.add_argument(
+        "--sub-languages",
+        nargs="*",
+        help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.",
+    )
+    parser.add_argument(
+        "--gender",
+        choices=["male", "female", "neutral"],
+        help="Specify the gender for emoji customization.",
+    )
+    parser.add_argument(
+        "--region", help="Specify the region for emoji customization (e.g., US, IN)."
+    )
+    parser.add_argument(
+        "--emojis-per-keyword",
+        type=int,
+        help="Number of emojis to generate per keyword.",
+    )
+    return parser


this looks great! thank you for taking my suggestion into account. :)

Looks great @Ekikereabasi-Nk

Thanks for your work!!!

…r modularity

Ekikereabasi-Nk · 2024-10-19T23:17:28Z

Hi @andrewtavis @SethiShreya @KesharwaniArpita please review. Also check #440 if they are compatible. Thank you

KesharwaniArpita · 2024-10-20T08:08:44Z

src/scribe_data/unicode/process_unicode.py

+        This function serves as a foundational component for generating
+        emojis tailored to specific user demographics, and implementing it in a
+        modular fashion will support future enhancements and maintenance.
+        """


Makes Sense!! Does that mean there's a lot more scope for emoji generation by expanding the demographics? Just confirming. Actually sounds interesting. Can we do anything in reference to this? Maybe expand the emoji generation? What do you say @Ekikereabasi-Nk and @andrewtavis .
Thank you @Ekikereabasi-Nk for the work.

SethiShreya · 2024-10-21T05:51:24Z

@Ekikereabasi-Nk I was reading the generate_emoji_functionality file it still have args, and one more thing I saw is you are generating emoji_keywords for all the languages right? Why? Wasnt we suppose to get the parameters from main file and then generate for the language provided, @andrewtavis did I understand it right?

andrewtavis · 2024-10-24T14:44:20Z

Closing this in favor of #440, @Ekikereabasi-Nk :) Really appreciate the work here! Was just a bit easier to integrate the other PR as it directly addressed the current functionality without expanding into new features.

andrewtavis self-requested a review October 16, 2024 10:46

andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 16, 2024

This was referenced Oct 16, 2024

Expand Kurmanji data queries #229

Closed

Create Scribe-Data Yoruba data process queries #217

Closed

andrewtavis reviewed Oct 16, 2024

View reviewed changes

KesharwaniArpita mentioned this pull request Oct 17, 2024

Centralized Emoji Keyword Functionality call for All Languages #397

Merged

1 task

catreedle reviewed Oct 17, 2024

View reviewed changes

Ekikereabasi-Nk force-pushed the centralized-emoji-function branch from 254e9dc to 3c41c6d Compare October 17, 2024 14:42

Ekikereabasi-Nk force-pushed the centralized-emoji-function branch from 520d2b1 to 281b735 Compare October 18, 2024 10:54

catreedle reviewed Oct 18, 2024

View reviewed changes

Ekikereabasi-Nk added 10 commits October 19, 2024 23:53

Update generate_emoji_keyword.py with additional changes

8238e27

Update generate_emoji_keywords.py and created common_arg_parser.py fo…

8b05fb8

…r modularity

Update generate_emoji_keyword.py with additional changes

036c60d

Update src/scribe_data/unicode/generate_emoji_keyword.py

53206be

Updated centralized emoji logic

a72ab70

Update generate_emoji_keyword.py with additional changes

1209620

Update generate_emoji_keyword.py with additional changes

c152a31

Update generate_emoji_keyword.py with additional changes

f0feaac

Update generate_emoji_keywords.py and created common_arg_parser.py fo…

a733955

…r modularity

Update generate_emoji_keyword.py with additional changes

f602dc3

Ekikereabasi-Nk force-pushed the centralized-emoji-function branch from 85767b1 to f602dc3 Compare October 19, 2024 23:09

Delete Hindustani emoji_keyword.py file

84ef9a1

KesharwaniArpita reviewed Oct 20, 2024

View reviewed changes

Ekikereabasi-Nk mentioned this pull request Oct 21, 2024

Move language_data_extraction under /wikidata and lower case all languages #446

Closed

2 tasks

andrewtavis closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralizing the emoji keyword generation logic #379

Centralizing the emoji keyword generation logic #379

Ekikereabasi-Nk commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited by andrewtavis

Loading

Ekikereabasi-Nk commented Oct 16, 2024

andrewtavis Oct 16, 2024

Ekikereabasi-Nk Oct 17, 2024

andrewtavis Oct 16, 2024 •

edited

Loading

catreedle Oct 17, 2024

andrewtavis Oct 17, 2024

catreedle Oct 17, 2024

Ekikereabasi-Nk Oct 17, 2024

catreedle Oct 17, 2024

Ekikereabasi-Nk Oct 17, 2024 •

edited

Loading

catreedle Oct 17, 2024

andrewtavis commented Oct 18, 2024

VNW22 commented Oct 18, 2024

andrewtavis commented Oct 18, 2024

VNW22 commented Oct 18, 2024

andrewtavis commented Oct 18, 2024

Ekikereabasi-Nk commented Oct 18, 2024

catreedle Oct 18, 2024

KesharwaniArpita Oct 18, 2024

KesharwaniArpita Oct 18, 2024

Ekikereabasi-Nk commented Oct 19, 2024

KesharwaniArpita Oct 20, 2024

SethiShreya commented Oct 21, 2024

andrewtavis commented Oct 24, 2024

Centralizing the emoji keyword generation logic #379

Centralizing the emoji keyword generation logic #379

Conversation

Ekikereabasi-Nk commented Oct 16, 2024 • edited Loading

Contributor checklist

Description

Centralize the Core Logic

Handle Language-Specific Customizations (Including Grouped Languages)

Modify Language-Specific Files

Refactor process_unicode.py

github-actions bot commented Oct 16, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

Ekikereabasi-Nk commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ekikereabasi-Nk Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis commented Oct 18, 2024

VNW22 commented Oct 18, 2024

andrewtavis commented Oct 18, 2024

VNW22 commented Oct 18, 2024

andrewtavis commented Oct 18, 2024

Ekikereabasi-Nk commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ekikereabasi-Nk commented Oct 19, 2024

Choose a reason for hiding this comment

SethiShreya commented Oct 21, 2024

andrewtavis commented Oct 24, 2024

Ekikereabasi-Nk commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Oct 16, 2024 •

edited by andrewtavis

Loading

andrewtavis Oct 16, 2024 •

edited

Loading

Ekikereabasi-Nk Oct 17, 2024 •

edited

Loading