Machine learning project for predicting loanwords
When speakers of a given language borrow words from another language, they will modify those words to conform to the phonology of their native language. They will substitute sounds that do not exist in their language with the ones they deem the closest, modify syllables to conform to their language's syllable structure, et cetera. For example, consider these borrowings into Japanese:
Loanword | Romanization | Original Word |
---|---|---|
アイスクリーム | aisu kurīmu | English ice cream |
アパート | apāto | English apartment |
アルバイト | arubaito | German Arbeit "work" |
ラッコ | rakko | Ainu rakko "sea otter" |
トマト | tomato | Nahuatl tomato "tomato" |
Let us examine the first borrowing, "ice cream" -> "aisu kurīmu". A Japanese syllable must be of the form (C)(j)V(Q/N), where C is a consonant, Q is a geminated consonant, and N is a generic nasal. The English word "ice" /aɪs/ does not conform to this syllable structure due to the presence of a coda /-s/, so it is repaired to /aisu/ through the epenthesis (insertion) of a /u/. Note that this causes the segment /s/ to change from a final consonant to an initial consonant.
Similarly, the English word "cream" /kɹiːm/ contains both a coda /-m/ as before, which is also repaired with epenthesis of a /u/. Additionally, it contains a consonant cluster /kɹ/, which is invalid in Japanese. The cluster is also repaired by epenthesis of a /u/. Several questions immediately arise: why is epenthesis of a /u/ the most common repair strategy? Why could it not have been a different vowel, such as /a/ or /i/? And why is epenthesis the strategy at all, when one could have instead deleted offending consonants and produced a form like */aɪ kīmu/? Why is the final m in "cream" repaired when syllable-final m is tolerated in Japanese? (This last question has a more ready answer: m has a place specification while the Japanese nasal generally lacks a place feature. But the question remains in a more abstract form: why not repair the segment by eliminating the place feature, yielding a form like */aisu kurīn/?)
TODO finish copy
- Identify source of compute for this project, eg Google Colab or a professor's servers (used Google Colab)
- Identify which language to study in this project (will be doing Japanese — its phonotactics and loanword acquisition are well-studied. Furthermore, Japanese has rather simple phonotactics, so it seems like an easier task for a model to learn; and it has copious amounts of well-documented recent loanwords.)
- Find a corpus representing naturalistic speech in this language (BCCWJ seems good)
- Find lists of loanwords with etymologies for this language, perhaps by scraping Wiktionary
- Design an autoencoder model to learn this task. RNN seems well-suited, but a CNN may be useful for long distance effects that could be relevant: vowel harmony in Turkish, Lyman's law in Japanese. A transformer may be well-suited but seems overkill. (used RNN)
- Summarize results in a writeup
- The Balanced Corpus of Contemporary Written Japanese (BCCWJ) hosted here.
- The word frequency list (
BCCWJ_frequencylist_suw_ver1_0.zip
) and its manual (BCCWJ_frequencylist_manual_ver1_0.pdf
) are available here
- The word frequency list (
- https://japanesetactics.com/english-gairaigo-list-learn-301-japanese-words-in-10-minutes
- idea: there may be some unconventional sources of data, such as transcriptions of demon names in Shin Megami Tensei. In general video games and anime may provide a source of nativizations, and possibly reflect more recent trends that older loans like the above will not have. On the other hand though, it's not clear that romanizations of say names will use the exact same phonological process as the above, which were not as intentional.
- scraping Japanese Wiktionary, such as lists of Gairaigo?
- https://github.com/bootphon/phonemizer to convert Japanese to IPA
- https://github.com/dmort27/panphon to convert IPA to feature vectors
- https://github.com/dmort27/epitran as another way to convert IPA to feature vectors
- https://lingpy.org/ to align words