Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralizing the emoji keyword generation logic #379

Conversation

Ekikereabasi-Nk
Copy link
Contributor

@Ekikereabasi-Nk Ekikereabasi-Nk commented Oct 16, 2024

Contributor checklist


Description

This PR addresses issue #359 by centralizing the emoji keyword generation logic and simplifying the language-specific scripts to enhance maintainability, flexibility, and modularity across the project. The changes are structured into parts, focusing on minimizing code duplication and improving support for language-specific customizations like gender, region and grouped Languages (Hindustani (which contains both Urdu and Hindi) or Norwegian (which includes both Bokmål and Nynorsk) suggested by @KesharwaniArpita.

Centralize the Core Logic

The core functionality for generating emoji keywords, which is duplicated across most language folders, has now been moved to a single centralized file: src/scribe_data/unicode/generate_emoji_keyword.py

This file now handles the core logic, including the generation of emoji keywords and the export of formatted data. The centralized approach ensures that all shared functionality resides in one place, reducing the risk of inconsistencies.

  • Centralized Function: The generate_emoji_keyword function now resides in this centralized location. This function
    takes the language, emojis_per_keyword, and file_path as core parameters and performs the keyword generation by
    calling gen_emoji_lexicon and subsequently exporting the data using export_formatted_data.

  • Shared Logic: The logic for generating the emoji lexicon and handling the export process has been consolidated into
    this centralized file, eliminating duplication of this logic across the language-specific modules.

Handle Language-Specific Customizations (Including Grouped Languages)

The generate_emoji_keyword function has been enhanced to accommodate language-specific customizations, including grouped languages.

  • Grouped Language Support: Languages like Hindustani (which contains both Urdu and Hindi) and Norwegian (which includes both Bokmål and Nynorsk) are now available.

    • The function allows the processing of grouped languages by providing a list of sub-languages to process.
    • If no sub-languages are specified, all sub-languages in the grouped language will be processed by default.
  • Customizations for Gender and Region:

    • Additional optional parameters such as gender and region are now available.
    • This functionality provides flexibility to customize emojis according to specific requirements (e.g., male/female emojis,
      regional differences like US or JP variants).

Modify Language-Specific Files

Each language emoji_keyword folder with a generate_emoji_keyword.py file under language_data_extraction can be updated by using:

  • The files can now import the centralized logic from the new file generate_emoji_keyword.py from the src/scribe_data/unicode/generate_emoji_keyword.py from from scribe_data.unicode.generate_emoji_keyword import generate_emoji_keyword.
  • The only language-specific customization left in these files is the setting of basic parameters like LANGUAGE, emojis_per_keyword, and path variables.

Refactor process_unicode.py

The process_unicode.py file has been refactored to integrate the centralized generate_emoji_keyword logic:

  • Centralized Emoji Lexicon Generation: Shared logic for emoji generation now resides in the gen_emoji_lexicon function, making it more consistent across language modules.

  • Support for Grouped Languages: The function handles grouped languages like Hindustani (Hindi and Urdu) and Norwegian (Bokmål and Nynorsk), processing sub-languages individually.

  • Gender and Region Customizations: Parameters for gender and region allow emoji customization based on cultural and demographic factors.

  • #ISSUE_NUMBER
    -Simplify project's emoji keyword functionality #359

Copy link

github-actions bot commented Oct 16, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@Ekikereabasi-Nk
Copy link
Contributor Author

CC @andrewtavis @KesharwaniArpita @catreedle @DeleMike @VNW22. Please do review and contribute. Thank you

@andrewtavis andrewtavis self-requested a review October 16, 2024 10:46
@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 16, 2024
"""
Generate emoji keywords for a specified language, with optional support for grouped languages (e.g., Hindustani = Hindi + Urdu).

Parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a start here, @Ekikereabasi-Nk, could I ask you to change the function docstrings to match those in the rest of the project :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright @andrewtavis will do so. Thank you.

- sub_languages (list): A list of specific sub-languages for grouped languages (e.g., ["Hindi", "Urdu"]). If not provided, all sub-languages in the group will be processed.
"""

# Define grouped languages and their sub-languages
Copy link
Member

@andrewtavis andrewtavis Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And own line comments should have a period at the end of them. Inline comments don't need capitalization of the first word or punctuation, but those that are their own line are sentences.


def generate_emoji_keyword(
language,
emojis_per_keyword,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest setting emojis_per_keyword=3 as a default value. This way, we can reduce the need to specify the parameter each time. I think for most cases 3 would be a suitable default. Of course, this would still allow customization when necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed :)

Comment on lines 25 to 46
# Set up the argument parser
parser = argparse.ArgumentParser(
description="Generate emoji keywords for a specific language."
)
parser.add_argument(
"--file-path", required=True, help="Path to save the generated emoji keywords."
)
parser.add_argument(
"--sub-languages",
nargs="*",
help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.",
)
parser.add_argument(
"--gender",
choices=["male", "female", "neutral"],
help="Specify the gender for emoji customization.",
)
parser.add_argument(
"--region", help="Specify the region for emoji customization (e.g., US, IN)."
)

# Parse the command-line arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we set up a common argument parsing function for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @catreedle I will take note

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

welcome! I'm not really sure of the differences between individual languages to decide if it is plausible, but it is something to consider. :)

Copy link
Contributor Author

@Ekikereabasi-Nk Ekikereabasi-Nk Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@catreedle this will mean to create a modular function for the common parser file, most likely in a separate file that can be imported to each generate_emoji_keyword file for all language. Also @SethiShreya mention the is a parser function somewhere within the codebase, I have not seen it. Please you can help point it out for me to study. So I will create another file in the Unicode folder for this parser function related to generate_emoji_keyword for each language. Thank you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find it either...
maybe something like this could work?

def setup_argument_parser():
    parser = argparse.ArgumentParser(description="Generate emoji keywords for a specific language.")
    parser.add_argument(
        "--file-path", required=True, help="Path to save the generated emoji keywords."
    )
    ...
    
    
    return parser

@andrewtavis
Copy link
Member

Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊

@VNW22
Copy link
Contributor

VNW22 commented Oct 18, 2024

hey @Ekikereabasi-Nk @andrewtavis @catreedle
After reviewing the CLDR dataset, I explored the relevant scripts and verified emoji support across the available languages. Below are the languages that do not have emoji support:

Arabic
Czech
Esperanto
Gurmukhi
Hausa
Hebrew
Japanese
Kurmaji
Latin
Nigerian
Shimukhi
Slovak
Tajik
Tamil
Ukrainian
Urdu
Yoruba
The rest of the languages in the dataset appear to have emoji support.

Please feel free to recheck my findings or let me know if I missed anything.

@andrewtavis
Copy link
Member

Amazing, @VNW22! What's the code for this like? Would we be able to make a workflow to make sure that there's no directory for emoji keywords directory for those language that don't have support? Might be something to add to the project structure check :) Are there also languages that do have support that we currently don't have in the project?

@VNW22
Copy link
Contributor

VNW22 commented Oct 18, 2024

Amazing, @VNW22! What's the code for this like? Would we be able to make a workflow to make sure that there's no directory for emoji keywords directory for those language that don't have support? Might be something to add to the project structure check :) Are there also languages that do have support that we currently don't have in the project?

This is the python script i used;
import os
import re

def find_languages_with_emoji_support(root_dir):
languages_with_emoji = []

language_pattern = re.compile(r'LANGUAGE\s*=\s*"(.*?)"', re.IGNORECASE)

for root, _, files in os.walk(root_dir):
    for file in files:
        if file.endswith(".py"):  
            file_path = os.path.join(root, file)

            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()

                    if "gen_emoji_lexicon" in content or "emoji-keywords" in content:
                        match = language_pattern.search(content)
                        if match:
                            language = match.group(1)
                            languages_with_emoji.append(language)
                        else:
                            print(f"Warning: No language defined in {file_path}")
            except Exception as e:
                print(f"Error reading {file_path}: {e}")

return languages_with_emoji

root_directory = "./cldr-json/cldr-localenames-full/"

languages = find_languages_with_emoji_support(root_directory)
if languages:
print("Languages with emoji support:", languages)
else:
print("No languages with emoji support found.")

The code for finding languages without was not working, so I opted for this.
There are sooo many languages that do have support that are not in the project. The project has covered like 40%.

@andrewtavis
Copy link
Member

Nice 😊 Let's definitely add this to the project structure workflow check once this is done. We also have functionality to update the CLDR data, so maybe new languages will be added and then this workflow will tell us to add emoji support 😇

@Ekikereabasi-Nk
Copy link
Contributor Author

hey @Ekikereabasi-Nk @andrewtavis @catreedle After reviewing the CLDR dataset, I explored the relevant scripts and verified emoji support across the available languages. Below are the languages that do not have emoji support:

Arabic Czech Esperanto Gurmukhi Hausa Hebrew Japanese Kurmaji Latin Nigerian Shimukhi Slovak Tajik Tamil Ukrainian Urdu Yoruba The rest of the languages in the dataset appear to have emoji support.

Please feel free to recheck my findings or let me know if I missed anything.

Thank you @VNW22. @KesharwaniArpita I have update the the Centralizing the emoji keyword generation logic by creating a common_arg_parser function that can be imported for every languages that CLDR dataset support, please look at it. @catreedle Please check the recent commit and review. Thank you

Comment on lines 25 to 50
def setup_arg_parser():
parser = argparse.ArgumentParser(
description="Generate emoji keywords for a specific language."
)
parser.add_argument(
"--file-path", required=True, help="Path to save the generated emoji keywords."
)
parser.add_argument(
"--sub-languages",
nargs="*",
help="List of specific sub-languages to process (e.g., Hindi Urdu). If omitted, all sub-languages will be processed.",
)
parser.add_argument(
"--gender",
choices=["male", "female", "neutral"],
help="Specify the gender for emoji customization.",
)
parser.add_argument(
"--region", help="Specify the region for emoji customization (e.g., US, IN)."
)
parser.add_argument(
"--emojis-per-keyword",
type=int,
help="Number of emojis to generate per keyword.",
)
return parser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great! thank you for taking my suggestion into account. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @Ekikereabasi-Nk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!!!

@Ekikereabasi-Nk
Copy link
Contributor Author

Hi @andrewtavis @SethiShreya @KesharwaniArpita please review. Also check #440 if they are compatible. Thank you

This function serves as a foundational component for generating
emojis tailored to specific user demographics, and implementing it in a
modular fashion will support future enhancements and maintenance.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes Sense!! Does that mean there's a lot more scope for emoji generation by expanding the demographics? Just confirming. Actually sounds interesting. Can we do anything in reference to this? Maybe expand the emoji generation? What do you say @Ekikereabasi-Nk and @andrewtavis .
Thank you @Ekikereabasi-Nk for the work.

@SethiShreya
Copy link
Contributor

@Ekikereabasi-Nk I was reading the generate_emoji_functionality file it still have args, and one more thing I saw is you are generating emoji_keywords for all the languages right? Why? Wasnt we suppose to get the parameters from main file and then generate for the language provided, @andrewtavis did I understand it right?

@andrewtavis
Copy link
Member

Closing this in favor of #440, @Ekikereabasi-Nk :) Really appreciate the work here! Was just a bit easier to integrate the other PR as it directly addressed the current functionality without expanding into new features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants