-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove articles from machine translation process #96
Comments
CC @wkyoshida and @shashank-iitbhu via our discussion in the sync :) |
We can add a helper function like english_articles = ["a ", "an ", "the "]
translated_words = [remove_articles(translation, english_articles) for translation in translated_words]
def remove_articles(translation, articles):
for article in articles:
if translation.lower().startswith(article):
return translation[len(article):]
return translation |
We could also consider writing a separate script to remove articles, implementing it after the translation process is complete. First, we'll need to query Wikidata to retrieve the articles for the Scribe languages. I'll look into this. @andrewtavis, what do you think would be the best approach here? |
I'd say getting the articles from Wikidata for each of the languages makes more sense so it's easier for us to add new languages later on, @shashank-iitbhu :) Thanks for your consideration here! |
And let's save the articles without the extra space at the end of them and then do a comparison that includes the space. Makes a bit more sense to me than |
Another thing to think about here @shashank-iitbhu is that some of the values we're getting out are capitalized... Would 100% be a different issue, but maybe we can look into things once we have the lexeme IDs into the scripts and we can check to see if it's a proper noun or not. Maybe we should add if it's a proper noun to the noun queries, now that I think of it? 🤔 This would allow us to lower case the regular noun outputs (except for German as all nouns in German are capitalized). |
I'll assign this to you, @shashank-iitbhu, and please let me know if further information is needed! |
I believe the optimal approach would be to split the translations into words, and then check for articles in the first word of the split. Afterward, we can concatenate the words back together. This way, we can avoid storing articles with an added space (e.g., def remove_articles(translation, articles):
words = translation.split()
if words and words[0].lower() in articles:
return ' '.join(words[1:])
return translation |
Makes total sense, @shashank-iitbhu 😊 Thanks for the suggestion! |
Via the sync, would be good to add in a query for articles for this. Happy to support, @shashank-iitbhu! |
Hey @shashank-iitbhu 👋 We're looking for issues for new contributors now. Hope it's ok that I unassign so someone else can pick it up! Also hope you're well! 😊 |
Yes, please assign to someone else. 😄 |
Thanks @andrewtavis, I think there are two way it can be fixed,
Which one you think is better? |
Hey @axif0 👋 My initial inclination here would be to use Wikidata to get the articles 🤔 But then using spaCy might be a better idea as the library would handle things and we would't be making unneeded API calls for something that a package can handle. Question is though, what's spaCy's language coverage for this feature? As we're trying to cover a lot of languages, with many not being "common" within NLP tooling, it might make sense to leverage Wikidata :) Here's an idea for the process: from scribe_data.wikidata.wikidata_utils import sparql
def get_all_language_articles(language):
# SPARQL query template.
query_langage_template = """
tool: scribe-data
SELECT
?article
WHERE {{
VALUES ?language { wd:{} }
?lexeme dct:language ?language ;
wikibase:lexicalCategory wd:Q2865743 ;
wikibase:lemma ?article .
}}
"""
# Replace {} in the query template with the language value.
query = query_langage_template.format(language)
sparql.setQuery(query)
results = sparql.query().convert()
return results["results"] We'd also need to include Your thoughts would be appreciated, @axif0! |
Hello @andrewtavis , Thank you for your kind reply.
Output for There is a question, Do we consider |
Hey @axif0! Great work so far! And yes, including Quick note: it's always good to add syntax highlighting to your code on GitHub, which can be done by adding the language or file type (
|
Feel free to apply the function you have there to all of the machine translation files, and then we should be good for a PR! 🚀 Thanks for this :) :) |
Terms
Languages
All languages
Description
One thing that's coming from the new machine translation process in #81 and #88 is that we're routinely getting articles included in the translations. One way of fixing this is querying the articles from Wikidata for each language and then for each key removing the article and the space between it if it's the start of the translation. There could also be an option to remove these from translation outputs, but I personally am not sure on this.
Happy to discuss and implement or help implement this!
The text was updated successfully, but these errors were encountered: