Add cross-language translation data #23

andrewtavis · 2022-11-05T15:20:20Z

Terms

I have searched open and closed data issues
I agree to follow Scribe-Data's Code of Conduct

Languages

All languages

Description

This issue is the interim step to adding full translation support to Scribe apps. What it entails is finding accurate enough models on 🤗 Hugging Face to translate between all currently supported languages and English. The format_translations.py script for each language will then need to be edited to run each model over a basic corpus to generate seven different translations.json files per language.

From there this will allow an option for which base language to translate from to be added to Scribe-iOS' menu, which will be developed in scribe-org/Scribe-iOS#16.

Completing this issue will unblock Add English keyboard Scribe-iOS#7, as users will be able to decide which language to translate from

The text was updated successfully, but these errors were encountered:

andrewtavis · 2022-11-05T16:44:15Z

Included in this issue is:

Finding viable translation models for each direction of each language pair
Updating the format_translation.py files to generate one file for each input language
An exploration of running the translation models on Google Colab or another platform that will give us GPU access in order to speed the model runs up
Updating the readmes with translation ranges of the least and most translations available
Editing the data_table.txt output to read average translations

andrewtavis · 2022-11-05T16:54:46Z

We should only use models that can translate both ways. The following are the language models from 🤗 Hugging Face that we can use to generate the translations (checked when implemented):

abhijeet78880 · 2023-04-14T11:56:37Z

hey..👋🏻
@andrewtavis I have gone through the issue and done some research regarding this . but can you please explain more about this.

andrewtavis · 2023-04-14T13:05:30Z

Hey @abhijeet78880 👋 Yes I definitely expected that more information would be needed. This is a tough one, but as I said just needs some persistence 😊

Will write more later today! :)

andrewtavis · 2023-04-14T16:53:43Z

Ok, sooooo :) There's plenty to do here, with the first part of it being some research you/we could do. This comment is an ongoing list of all machine translation models available from 🤗 Hugging Face that we're using (I just edited it to add links to the models). As of now we only translate from English to the language that a user is typing in, but we want to expand this so that the user can translate from any of Scribe's supported languages to any other. This will be an option in the new menu we'll build in this iOS issue, with the designs for that being found here.

Once we have a translation model we make a file like this one that translates from English to German. In this file we load a JSON that's words that we query in the source language (for now only English) from Wikidata, then we make a list of words from the JSON, set the model variable with one of the models we're documenting in the comment above that I already mentioned, and then in the for loop we go through and translate the words and add them all into a new JSON. This JSON is then used in Scribe apps to provide a translation 😊 It's not perfect, but it's what we're doing for now 🙃

What would be great is if you/we could look and find the missing models that we need for the new translations. A lot of these can just be Helsinki-NLP/opus-mt models, which are giant machine translation (hence mt) models from researchers in Helsinki. As we're using Helsinki-NLP/opus-mt-en-fr for English to French, we can use Helsinki-NLP/opus-mt-fr-en for French to English 🚀

At this point I think it's best if I check in with you and see how the above sounds. If you want to contribute in a simple way at first, going through and finding links for what models we'd need would be best. I could then do a more in depth explanation of the translation file from before and show you how to set up new ones for each of the models we find. We'd then run them, and bam we'd have translation data that we'd then reference and thus give users the option to translate from Spanish to German and all the other options 😊

This has been a lot! Again, just let me know how it sounds and feel free to ask questions. Everything we write here will help make things easier going forward :) :)

andrewtavis · 2023-04-20T07:52:42Z

@abhijeet78880, FYI I also made #35 just now which might be a nice first issue for you 😊 That's checking if the word Scribe is in the database for a given language and adding it if not :)

andrewtavis · 2023-04-23T08:58:40Z

Note that I've updated the models comment with further models we could use from Helsinki-NLP. There does appear to be some holes in their translation model coverage, so for some pairs we'll need to look harder for other models.

At this point we'd be ready to start copying over some translation files 😊 For this the script from any Helsinki-NLP translation file can be used and the model name just needs to be changed :)

andrewtavis · 2023-06-12T23:08:59Z

The following two models could help us plug some of the holes in the above translation coverage:

andrewtavis · 2023-06-21T22:23:58Z

Neither of the above models was what we were looking for, and after playing around with T5 a bit more and getting some very sub par German-Portuguese translations I was able to get some strong results from a dummy JSON dataset using m2m100_418M. I'd say that the small model is enough for our purposes as the single word or small phrase translation that we're doing isn't going to be improved by a larger model (or only marginally, as larger models would be taking advantage of contextual information that our short input strings would lack).

m2m100_418M should be able to handle all the missing language pairs for Scribe keyboard languages. A general thought might be to create a data pipeline that would use it as the sole model and then just switch the input and output languages as well as the input data during the run. Another thing to factor is that for now the outputs I was getting were capitalized as I'm assuming the model is expecting and returning a sentence. This can be remedied by the metadata that comes from Wikidata though, as we'll be querying a base translation corpus that includes word type and would thus know if a word is a proper noun that needs to be capitalized (or all nouns in German), or just lower case it.

Will continue to fine tune the current example and then present the results at the next Scribe Weekly 😊

nyfz18 · 2023-08-22T14:35:32Z

Hi! Sorry for the delay -- I had to figure out a bunch of stuff on my end. Where should I start?

andrewtavis · 2023-08-26T17:38:25Z

No stress on a delay, @nyfz18! Sorry for mine as well :) Let me organize some stuff and I'll send along some pseudocode for how this would be written as I said I would 😊 Generally the steps would be:

I'll write some SPARQL scripts to get words from each language
- We'll call those scripts to update the data each time before the translation's are done
I'll write a base script so that we can pass a language to the process like 'English'
- This allows us to test it easier
We then add the following steps:
- For any language that we're passing or all languages if we haven't past something
  - Run the script to update the data for this language (source language)
  - Load in the appropriate translation model
  - Translate from the source language to all the other languages
  - As we translate we want to update a dictionary
  - Save this dictionary to a JSON
  - Continue to the next language (if necessary)

Do you have any questions on the above, @nyfz18? Btw I messaged on Matrix to see if a checkin call would help for this 🙃

nyfz18 · 2023-08-27T15:42:16Z

Okay, sounds good. I sort of understand, but a check in call might be more helpful!

andrewtavis · 2023-09-08T09:57:46Z

Hey there @nyfz18! 👋 You now have extract_transform/translate.py at your disposal that loads in the model, checks arguments if you'd like to pass them, and prints out the ISO codes at the end. Following the working code there's also some pseudocode that outlines the steps we discussed in the call 😊 Let me know if you have any questions/comments!

andrewtavis · 2023-09-18T17:51:43Z

@nyfz18, we'll be doing the conversion of the JSON data that's being produced here in the new issue #46. @lillian-mo will work on that one 🚀

andrewtavis · 2024-02-27T00:44:31Z

Closing this issue as individual ones have been made for each language that can be worked on as a part of Google Summer of Code ☀️ Thanks all for the discussion here! Help on the individual issues would be welcome 😊

andrewtavis added help wanted Extra attention is needed data Relates to data or Wikidata labels Nov 5, 2022

andrewtavis mentioned this issue Nov 5, 2022

Add English keyboard scribe-org/Scribe-iOS#7

Closed

11 tasks

wkyoshida mentioned this issue Apr 11, 2023

Convert JSON data to SQLite scribe-org/Scribe-iOS#96

Closed

2 tasks

andrewtavis mentioned this issue Apr 13, 2023

Auxiliary verbs for German perfect conjugations #10

Open

2 tasks

andrewtavis added the -priority- High priority label Apr 13, 2023

andrewtavis mentioned this issue Apr 19, 2023

Allow for in-app download of data scribe-org/Scribe-iOS#89

Open

2 tasks

This was referenced May 21, 2023

Switch base translation language based on menu selection scribe-org/Scribe-iOS#307

Open

Menu item to choose which language to translate from scribe-org/Scribe-iOS#255

Open

andrewtavis self-assigned this Jul 17, 2023

andrewtavis mentioned this issue Sep 7, 2023

Update translations baseline data process #44

Closed

2 tasks

andrewtavis assigned nyfz18 Sep 8, 2023

andrewtavis added a commit that referenced this issue Sep 8, 2023

#39 #44 #23 formatting of files and adding words to translate

da6b5c4

andrewtavis added a commit that referenced this issue Sep 8, 2023

#44 #23 update changelog with plans for cross-language translation

10e2d0d

andrewtavis mentioned this issue Sep 18, 2023

Add translation files into the SQLite database process #46

Closed

2 tasks

andrewtavis closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cross-language translation data #23

Add cross-language translation data #23

andrewtavis commented Nov 5, 2022 •

edited

Loading

andrewtavis commented Nov 5, 2022 •

edited

Loading

andrewtavis commented Nov 5, 2022 •

edited

Loading

abhijeet78880 commented Apr 14, 2023

andrewtavis commented Apr 14, 2023

andrewtavis commented Apr 14, 2023 •

edited

Loading

andrewtavis commented Apr 20, 2023

andrewtavis commented Apr 23, 2023

andrewtavis commented Jun 12, 2023

andrewtavis commented Jun 21, 2023 •

edited

Loading

nyfz18 commented Aug 22, 2023

andrewtavis commented Aug 26, 2023 •

edited

Loading

nyfz18 commented Aug 27, 2023

andrewtavis commented Sep 8, 2023 •

edited

Loading

andrewtavis commented Sep 18, 2023

andrewtavis commented Feb 27, 2024

Add cross-language translation data #23

Add cross-language translation data #23

Comments

andrewtavis commented Nov 5, 2022 • edited Loading

Terms

Languages

Description

andrewtavis commented Nov 5, 2022 • edited Loading

andrewtavis commented Nov 5, 2022 • edited Loading

abhijeet78880 commented Apr 14, 2023

andrewtavis commented Apr 14, 2023

andrewtavis commented Apr 14, 2023 • edited Loading

andrewtavis commented Apr 20, 2023

andrewtavis commented Apr 23, 2023

andrewtavis commented Jun 12, 2023

andrewtavis commented Jun 21, 2023 • edited Loading

nyfz18 commented Aug 22, 2023

andrewtavis commented Aug 26, 2023 • edited Loading

nyfz18 commented Aug 27, 2023

andrewtavis commented Sep 8, 2023 • edited Loading

andrewtavis commented Sep 18, 2023

andrewtavis commented Feb 27, 2024

andrewtavis commented Nov 5, 2022 •

edited

Loading

andrewtavis commented Nov 5, 2022 •

edited

Loading

andrewtavis commented Nov 5, 2022 •

edited

Loading

andrewtavis commented Apr 14, 2023 •

edited

Loading

andrewtavis commented Jun 21, 2023 •

edited

Loading

andrewtavis commented Aug 26, 2023 •

edited

Loading

andrewtavis commented Sep 8, 2023 •

edited

Loading