Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title: Comprehensive expansion of Ukrainian lexeme extraction queries (Issue #237 fixed) #424

Merged
merged 2 commits into from
Oct 19, 2024

Conversation

Collins-Webdev
Copy link
Contributor

I'm excited to present a substantial enhancement to our Ukrainian language data extraction pipeline. This pull request significantly expands our SPARQL queries to capture a more comprehensive morphological landscape of Ukrainian lexemes across multiple parts of speech. Let's delve into the technical specifics:

  1. Verbs 🔠 (query_verbs.sparql):

    • Implemented extraction of finite verb forms: * Present tense: 1st, 2nd, 3rd person singular (wd:Q192613 + wd:Q21714344/wd:Q51929049/wd:Q51929074 + wd:Q110786) * Past tense: masculine, feminine, neuter singular (wd:Q1240211 + wd:Q499327/wd:Q1775415/wd:Q1775461 + wd:Q110786)
    • Added imperative mood: 2nd person singular (wd:Q22716 + wd:Q51929049 + wd:Q110786)
    • Retained infinitive form extraction (wd:Q179230)
  2. Nouns 📚 (query_nouns.sparql):

    • Extended singular case paradigm: * Genitive (wd:Q146233), Dative (wd:Q145599), Accusative (wd:Q146078) * Instrumental (wd:Q192997), Locative (wd:Q202142)
    • Maintained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction
  3. Adjectives 🏷️ (NEW: query_adjectives.sparql):

    • Implemented comprehensive adjectival paradigm: * Singular nominative: masculine (wd:Q499327), feminine (wd:Q1775415), neuter (wd:Q1775461) * Plural nominative (wd:Q146786)
    • Included degree forms: comparative (wd:Q14169499) and superlative (wd:Q1817208)
  4. Adverbs 🔄 (NEW: query_adverbs.sparql):

    • Established query for adverbial extraction: * Base form (lemma) * Comparative (wd:Q14169499) and superlative (wd:Q1817208) degrees
  5. Prepositions 📍 (query_prepositions.sparql):

    • Optimized existing query structure
    • Enhanced case association extraction (wdt:P5713)
  6. Proper Nouns 👤 (query_proper_nouns.sparql):

    • Significantly expanded case paradigm for singular: * Nominative (lemma), Genitive (wd:Q146233), Dative (wd:Q145599) * Accusative (wd:Q146078), Instrumental (wd:Q192997), Locative (wd:Q202142)
    • Crucially added Vocative case (wd:Q185077), essential for direct address in Ukrainian
    • Retained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction

Technical implementation details:

  • Utilized OPTIONAL clauses for all non-lemma forms to ensure query robustness
  • Implemented consistent use of wikibase:grammaticalFeature for form specification
  • Employed REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") for lexeme ID extraction
  • Utilized wikibase:label service for human-readable labels where applicable

This enhancement significantly broadens our morphological coverage of Ukrainian, providing a rich dataset for advanced NLP tasks, including but not limited to:

  • Morphological analysis and generation
  • Named Entity Recognition (NER) with case-sensitive features
  • Machine Translation with deep grammatical understanding
  • Linguistic research on Ukrainian morphosyntax

I've rigorously tested these queries on the Wikidata Query Service (https://query.wikidata.org/) to ensure optimal performance and accurate results. However, I welcome meticulous review, particularly focusing on:

  1. Correctness of Wikidata QIDs for grammatical features
  2. Query efficiency and potential for optimization
  3. Completeness of morphological paradigms for each part of speech

This pull request represents a significant stride towards a more nuanced and comprehensive representation of Ukrainian in our data pipeline. I'm eager to discuss any suggestions for further refinements or expansions to our linguistic feature set.

Contributor checklist


Description

Related issue

  • #ISSUE_NUMBER

I'm excited to present a substantial enhancement to our Ukrainian language data extraction pipeline. This pull request significantly expands our SPARQL queries to capture a more comprehensive morphological landscape of Ukrainian lexemes across multiple parts of speech. Let's delve into the technical specifics:

1. Verbs 🔠 (query_verbs.sparql):
   - Implemented extraction of finite verb forms:
     * Present tense: 1st, 2nd, 3rd person singular (wd:Q192613 + wd:Q21714344/wd:Q51929049/wd:Q51929074 + wd:Q110786)
     * Past tense: masculine, feminine, neuter singular (wd:Q1240211 + wd:Q499327/wd:Q1775415/wd:Q1775461 + wd:Q110786)
   - Added imperative mood: 2nd person singular (wd:Q22716 + wd:Q51929049 + wd:Q110786)
   - Retained infinitive form extraction (wd:Q179230)

2. Nouns 📚 (query_nouns.sparql):
   - Extended singular case paradigm:
     * Genitive (wd:Q146233), Dative (wd:Q145599), Accusative (wd:Q146078)
     * Instrumental (wd:Q192997), Locative (wd:Q202142)
   - Maintained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction

3. Adjectives 🏷️ (NEW: query_adjectives.sparql):
   - Implemented comprehensive adjectival paradigm:
     * Singular nominative: masculine (wd:Q499327), feminine (wd:Q1775415), neuter (wd:Q1775461)
     * Plural nominative (wd:Q146786)
   - Included degree forms: comparative (wd:Q14169499) and superlative (wd:Q1817208)

4. Adverbs 🔄 (NEW: query_adverbs.sparql):
   - Established query for adverbial extraction:
     * Base form (lemma)
     * Comparative (wd:Q14169499) and superlative (wd:Q1817208) degrees

5. Prepositions 📍 (query_prepositions.sparql):
   - Optimized existing query structure
   - Enhanced case association extraction (wdt:P5713)

6. Proper Nouns 👤 (query_proper_nouns.sparql):
   - Significantly expanded case paradigm for singular:
     * Nominative (lemma), Genitive (wd:Q146233), Dative (wd:Q145599)
     * Accusative (wd:Q146078), Instrumental (wd:Q192997), Locative (wd:Q202142)
   - Crucially added Vocative case (wd:Q185077), essential for direct address in Ukrainian
   - Retained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction

Technical implementation details:
- Utilized OPTIONAL clauses for all non-lemma forms to ensure query robustness
- Implemented consistent use of wikibase:grammaticalFeature for form specification
- Employed REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") for lexeme ID extraction
- Utilized wikibase:label service for human-readable labels where applicable

This enhancement significantly broadens our morphological coverage of Ukrainian, providing a rich dataset for advanced NLP tasks, including but not limited to:
- Morphological analysis and generation
- Named Entity Recognition (NER) with case-sensitive features
- Machine Translation with deep grammatical understanding
- Linguistic research on Ukrainian morphosyntax

I've rigorously tested these queries on the Wikidata Query Service (https://query.wikidata.org/) to ensure optimal performance and accurate results. However, I welcome meticulous review, particularly focusing on:
1. Correctness of Wikidata QIDs for grammatical features
2. Query efficiency and potential for optimization
3. Completeness of morphological paradigms for each part of speech

This pull request represents a significant stride towards a more nuanced and comprehensive representation of Ukrainian in our data pipeline. I'm eager to discuss any suggestions for further refinements or expansions to our linguistic feature set.
Copy link

github-actions bot commented Oct 18, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First PR Commit Check

  • The commit messages for the remote branch of a new contributor should be checked to make sure their email is set up correctly so that they receive credit for their contribution
    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo

@Collins-Webdev Collins-Webdev mentioned this pull request Oct 18, 2024
8 tasks
@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 18, 2024
@andrewtavis andrewtavis self-requested a review October 18, 2024 17:14
Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the quality of this PR, @Collins-Webdev 😊 Thanks so much for the care you put into the queries. We have a new SPARQL query writing guide that might give you a bit more information to improve, but really a great first PR :)

@andrewtavis andrewtavis merged commit 659cd69 into scribe-org:main Oct 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants