Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove articles from machine translation process #96

Closed
2 tasks done
andrewtavis opened this issue Mar 9, 2024 · 18 comments
Closed
2 tasks done

Remove articles from machine translation process #96

andrewtavis opened this issue Mar 9, 2024 · 18 comments
Assignees
Labels
data Relates to data or Wikidata help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Mar 9, 2024

Terms

Languages

All languages

Description

One thing that's coming from the new machine translation process in #81 and #88 is that we're routinely getting articles included in the translations. One way of fixing this is querying the articles from Wikidata for each language and then for each key removing the article and the space between it if it's the start of the translation. There could also be an option to remove these from translation outputs, but I personally am not sure on this.

Happy to discuss and implement or help implement this!

@andrewtavis andrewtavis added help wanted Extra attention is needed blocked Another issue is blocking data Relates to data or Wikidata labels Mar 9, 2024
@andrewtavis
Copy link
Member Author

CC @wkyoshida and @shashank-iitbhu via our discussion in the sync :)

@shashank-iitbhu
Copy link
Contributor

We can add a helper function like remove_articles and then remove articles after each batch is translated.
Tried this for english articles.

english_articles = ["a ", "an ", "the "]
translated_words = [remove_articles(translation, english_articles) for translation in translated_words]

remove_articles function:

def remove_articles(translation, articles):
    for article in articles:
        if translation.lower().startswith(article):
            return translation[len(article):]
    return translation

results:
Screenshot 2024-03-16 at 7 28 14 PM

@shashank-iitbhu
Copy link
Contributor

We could also consider writing a separate script to remove articles, implementing it after the translation process is complete. First, we'll need to query Wikidata to retrieve the articles for the Scribe languages. I'll look into this.

@andrewtavis, what do you think would be the best approach here?

@andrewtavis
Copy link
Member Author

I'd say getting the articles from Wikidata for each of the languages makes more sense so it's easier for us to add new languages later on, @shashank-iitbhu :) Thanks for your consideration here!

@andrewtavis
Copy link
Member Author

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

@andrewtavis
Copy link
Member Author

Another thing to think about here @shashank-iitbhu is that some of the values we're getting out are capitalized... Would 100% be a different issue, but maybe we can look into things once we have the lexeme IDs into the scripts and we can check to see if it's a proper noun or not. Maybe we should add if it's a proper noun to the noun queries, now that I think of it? 🤔 This would allow us to lower case the regular noun outputs (except for German as all nouns in German are capitalized).

@andrewtavis
Copy link
Member Author

I'll assign this to you, @shashank-iitbhu, and please let me know if further information is needed!

@shashank-iitbhu
Copy link
Contributor

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

I believe the optimal approach would be to split the translations into words, and then check for articles in the first word of the split. Afterward, we can concatenate the words back together. This way, we can avoid storing articles with an added space (e.g., "a ") and can directly use the articles we obtain from Wikidata.

def remove_articles(translation, articles):
    words = translation.split()
    if words and words[0].lower() in articles:
        return ' '.join(words[1:])
    return translation

@andrewtavis
Copy link
Member Author

Makes total sense, @shashank-iitbhu 😊 Thanks for the suggestion!

@andrewtavis
Copy link
Member Author

Via the sync, would be good to add in a query for articles for this. Happy to support, @shashank-iitbhu!

@andrewtavis
Copy link
Member Author

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

@shashank-iitbhu
Copy link
Contributor

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

Yes, please assign to someone else. 😄

@axif0
Copy link
Contributor

axif0 commented Jul 10, 2024

Thanks @andrewtavis, I think there are two way it can be fixed,

  1. hardcoded way, " a", "a ", " a ". we basically search all the articles in a line. if found, then will delete it.
  2. use spaCy when translate.

image

Which one you think is better?

@andrewtavis
Copy link
Member Author

Hey @axif0 👋 My initial inclination here would be to use Wikidata to get the articles 🤔 But then using spaCy might be a better idea as the library would handle things and we would't be making unneeded API calls for something that a package can handle. Question is though, what's spaCy's language coverage for this feature? As we're trying to cover a lot of languages, with many not being "common" within NLP tooling, it might make sense to leverage Wikidata :)

Here's an idea for the process:

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_language_articles(language):
    # SPARQL query template.
    query_langage_template = """
    tool: scribe-data
    SELECT 
        ?article

    WHERE {{
      VALUES ?language { wd:{} }
          ?lexeme dct:language ?language ;
          wikibase:lexicalCategory wd:Q2865743 ;
          wikibase:lemma ?article .
    }}
    """

    # Replace {} in the query template with the language value.
    query = query_langage_template.format(language)

    sparql.setQuery(query)
    results = sparql.query().convert()

    return results["results"]

We'd also need to include Q3813849 along with Q2865743 (the former is indefinite articles and the latter is definite articles that are already queried for). I could help you finalize the results a bit, but generally we'd pass a language QID like Q1860 for English, and then it would return a list of indefinite and definite articles that we could then remove from translations via whitespace removal techniques and checking for their presence :)

Your thoughts would be appreciated, @axif0!

@andrewtavis andrewtavis removed the blocked Another issue is blocking label Jul 10, 2024
@axif0
Copy link
Contributor

axif0 commented Jul 10, 2024

Hello @andrewtavis , Thank you for your kind reply.

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_french_articles(qid):
    query = f"""
    SELECT DISTINCT ?article WHERE {{
      VALUES ?language {{ wd:{qid} }}   
      ?lexeme dct:language ?language ;
              wikibase:lexicalCategory ?category ;
              wikibase:lemma ?lemma .
      VALUES ?category {{ wd:Q2865743 wd:Q3813849 }}  # Definite and indefinite articles

      # Include both lemmas and forms
      {{
        ?lexeme wikibase:lemma ?article .
      }} UNION {{
        ?lexeme ontolex:lexicalForm ?form .
        ?form ontolex:representation ?article .
      }}
    }}
    """

    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    articles = [result["article"]["value"] for result in results["results"]["bindings"]]
    return articles

qid = "Q1860"   
articles = get_all_french_articles(qid)
for article in articles:
    print(article)

Output for English are a, an , the. We can get the qid from language_metadata.json

There is a question, Do we consider Partitive Articles QID Q576670 ?

@andrewtavis
Copy link
Member Author

Hey @axif0! Great work so far! And yes, including Partitive Articles is a great thought 😊 Let's definitely include those :)

Quick note: it's always good to add syntax highlighting to your code on GitHub, which can be done by adding the language or file type (python or py, for instance) to the first set of backticks :) So like this:

    ```py
    # This would have Python syntax highlighting, but it's within backticks :)
    import * from scribe_data
    ```

@andrewtavis
Copy link
Member Author

Feel free to apply the function you have there to all of the machine translation files, and then we should be good for a PR! 🚀 Thanks for this :) :)

@andrewtavis
Copy link
Member Author

Closed via #175 😊 Thanks for all the efforts here, @axif0! 🚀 There's doubtless still a bit left to do, but we can do the final touches in #70 when we actually run the translation process :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Relates to data or Wikidata help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

3 participants