Remove articles from machine translation process #96

andrewtavis · 2024-03-09T16:36:19Z

Terms

I have searched open and closed data issues
I agree to follow Scribe-Data's Code of Conduct

Languages

All languages

Description

One thing that's coming from the new machine translation process in #81 and #88 is that we're routinely getting articles included in the translations. One way of fixing this is querying the articles from Wikidata for each language and then for each key removing the article and the space between it if it's the start of the translation. There could also be an option to remove these from translation outputs, but I personally am not sure on this.

Happy to discuss and implement or help implement this!

andrewtavis · 2024-03-09T16:36:45Z

CC @wkyoshida and @shashank-iitbhu via our discussion in the sync :)

shashank-iitbhu · 2024-03-16T14:27:02Z

We can add a helper function like remove_articles and then remove articles after each batch is translated.
Tried this for english articles.

english_articles = ["a ", "an ", "the "]
translated_words = [remove_articles(translation, english_articles) for translation in translated_words]

remove_articles function:

def remove_articles(translation, articles):
    for article in articles:
        if translation.lower().startswith(article):
            return translation[len(article):]
    return translation

results:

shashank-iitbhu · 2024-03-16T14:31:30Z

We could also consider writing a separate script to remove articles, implementing it after the translation process is complete. First, we'll need to query Wikidata to retrieve the articles for the Scribe languages. I'll look into this.

@andrewtavis, what do you think would be the best approach here?

andrewtavis · 2024-03-17T14:49:32Z

I'd say getting the articles from Wikidata for each of the languages makes more sense so it's easier for us to add new languages later on, @shashank-iitbhu :) Thanks for your consideration here!

andrewtavis · 2024-03-17T14:51:08Z

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

andrewtavis · 2024-03-17T14:53:01Z

Another thing to think about here @shashank-iitbhu is that some of the values we're getting out are capitalized... Would 100% be a different issue, but maybe we can look into things once we have the lexeme IDs into the scripts and we can check to see if it's a proper noun or not. Maybe we should add if it's a proper noun to the noun queries, now that I think of it? 🤔 This would allow us to lower case the regular noun outputs (except for German as all nouns in German are capitalized).

andrewtavis · 2024-03-17T14:53:45Z

I'll assign this to you, @shashank-iitbhu, and please let me know if further information is needed!

shashank-iitbhu · 2024-03-17T15:02:05Z

And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than "a " as something we're running through and fits better if we're getting the articles from Wikidata and not writing them ourselves.

I believe the optimal approach would be to split the translations into words, and then check for articles in the first word of the split. Afterward, we can concatenate the words back together. This way, we can avoid storing articles with an added space (e.g., "a ") and can directly use the articles we obtain from Wikidata.

def remove_articles(translation, articles):
    words = translation.split()
    if words and words[0].lower() in articles:
        return ' '.join(words[1:])
    return translation

andrewtavis · 2024-03-17T15:04:02Z

Makes total sense, @shashank-iitbhu 😊 Thanks for the suggestion!

andrewtavis · 2024-04-20T14:47:31Z

Via the sync, would be good to add in a query for articles for this. Happy to support, @shashank-iitbhu!

andrewtavis · 2024-07-09T12:35:47Z

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

shashank-iitbhu · 2024-07-09T15:33:54Z

Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊

Yes, please assign to someone else. 😄

axif0 · 2024-07-10T06:13:39Z

Thanks @andrewtavis, I think there are two way it can be fixed,

hardcoded way, " a", "a ", " a ". we basically search all the articles in a line. if found, then will delete it.
use spaCy when translate.

Which one you think is better?

andrewtavis · 2024-07-10T14:35:10Z

Hey @axif0 👋 My initial inclination here would be to use Wikidata to get the articles 🤔 But then using spaCy might be a better idea as the library would handle things and we would't be making unneeded API calls for something that a package can handle. Question is though, what's spaCy's language coverage for this feature? As we're trying to cover a lot of languages, with many not being "common" within NLP tooling, it might make sense to leverage Wikidata :)

Here's an idea for the process:

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_language_articles(language):
    # SPARQL query template.
    query_langage_template = """
    tool: scribe-data
    SELECT 
        ?article

    WHERE {{
      VALUES ?language { wd:{} }
          ?lexeme dct:language ?language ;
          wikibase:lexicalCategory wd:Q2865743 ;
          wikibase:lemma ?article .
    }}
    """

    # Replace {} in the query template with the language value.
    query = query_langage_template.format(language)

    sparql.setQuery(query)
    results = sparql.query().convert()

    return results["results"]

We'd also need to include Q3813849 along with Q2865743 (the former is indefinite articles and the latter is definite articles that are already queried for). I could help you finalize the results a bit, but generally we'd pass a language QID like Q1860 for English, and then it would return a list of indefinite and definite articles that we could then remove from translations via whitespace removal techniques and checking for their presence :)

Your thoughts would be appreciated, @axif0!

axif0 · 2024-07-10T22:05:05Z

Hello @andrewtavis , Thank you for your kind reply.

from scribe_data.wikidata.wikidata_utils import sparql

def get_all_french_articles(qid):
    query = f"""
    SELECT DISTINCT ?article WHERE {{
      VALUES ?language {{ wd:{qid} }}   
      ?lexeme dct:language ?language ;
              wikibase:lexicalCategory ?category ;
              wikibase:lemma ?lemma .
      VALUES ?category {{ wd:Q2865743 wd:Q3813849 }}  # Definite and indefinite articles

      # Include both lemmas and forms
      {{
        ?lexeme wikibase:lemma ?article .
      }} UNION {{
        ?lexeme ontolex:lexicalForm ?form .
        ?form ontolex:representation ?article .
      }}
    }}
    """

    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    articles = [result["article"]["value"] for result in results["results"]["bindings"]]
    return articles

qid = "Q1860"   
articles = get_all_french_articles(qid)
for article in articles:
    print(article)

Output for English are a, an , the. We can get the qid from language_metadata.json

There is a question, Do we consider Partitive Articles QID Q576670 ?

andrewtavis · 2024-07-11T15:09:26Z

Hey @axif0! Great work so far! And yes, including Partitive Articles is a great thought 😊 Let's definitely include those :)

Quick note: it's always good to add syntax highlighting to your code on GitHub, which can be done by adding the language or file type (python or py, for instance) to the first set of backticks :) So like this:

    ```py
    # This would have Python syntax highlighting, but it's within backticks :)
    import * from scribe_data
    ```

andrewtavis · 2024-07-11T15:10:23Z

Feel free to apply the function you have there to all of the machine translation files, and then we should be good for a PR! 🚀 Thanks for this :) :)

andrewtavis · 2024-07-25T19:16:40Z

Closed via #175 😊 Thanks for all the efforts here, @axif0! 🚀 There's doubtless still a bit left to do, but we can do the final touches in #70 when we actually run the translation process :)

andrewtavis added help wanted Extra attention is needed blocked Another issue is blocking data Relates to data or Wikidata labels Mar 9, 2024

andrewtavis assigned shashank-iitbhu Mar 17, 2024

andrewtavis unassigned shashank-iitbhu Jul 9, 2024

andrewtavis mentioned this issue Jul 9, 2024

Update update_data.py tqdm progress bar only on a successful stage #155

Closed

2 tasks

andrewtavis removed the blocked Another issue is blocking label Jul 10, 2024

andrewtavis assigned axif0 Jul 10, 2024

This was referenced Jul 15, 2024

Article remove from machine translation process with the help of sparql. #175

Merged

Generate all translations for the currently supported languages [was Colab testing] #70

Closed

andrewtavis closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove articles from machine translation process #96

Remove articles from machine translation process #96

andrewtavis commented Mar 9, 2024 •

edited

Loading

andrewtavis commented Mar 9, 2024

shashank-iitbhu commented Mar 16, 2024

shashank-iitbhu commented Mar 16, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

shashank-iitbhu commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Apr 20, 2024

andrewtavis commented Jul 9, 2024

shashank-iitbhu commented Jul 9, 2024

axif0 commented Jul 10, 2024

andrewtavis commented Jul 10, 2024

axif0 commented Jul 10, 2024 •

edited

Loading

andrewtavis commented Jul 11, 2024

andrewtavis commented Jul 11, 2024

andrewtavis commented Jul 25, 2024

Remove articles from machine translation process #96

Remove articles from machine translation process #96

Comments

andrewtavis commented Mar 9, 2024 • edited Loading

Terms

Languages

Description

andrewtavis commented Mar 9, 2024

shashank-iitbhu commented Mar 16, 2024

shashank-iitbhu commented Mar 16, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

shashank-iitbhu commented Mar 17, 2024

andrewtavis commented Mar 17, 2024

andrewtavis commented Apr 20, 2024

andrewtavis commented Jul 9, 2024

shashank-iitbhu commented Jul 9, 2024

axif0 commented Jul 10, 2024

andrewtavis commented Jul 10, 2024

axif0 commented Jul 10, 2024 • edited Loading

andrewtavis commented Jul 11, 2024

andrewtavis commented Jul 11, 2024

andrewtavis commented Jul 25, 2024

andrewtavis commented Mar 9, 2024 •

edited

Loading

axif0 commented Jul 10, 2024 •

edited

Loading