The data is presented as tab-delimited text files (separately for IPA transcriptions, romanizations, and ASCII romanizations). Multiple transcriptions and/or romanizations for the same word are given as separate entries of the word. Fully duplicate entries are removed.
File names are Wiktionary language codes.
Wiktionary data is collected partly using a Wiktextract dump of the English Wiktionary and partly with a custom tool by Tamila Krashtan. Clipped transcriptions (such as /-səɹi/
in /ˈdʒænəˌzeɹi/, /-səɹi/
) are skipped. The data is provided in a canonically decomposed form.
ASCII romanizations are identical to the romanizations found in Wiktionary, except for being additionally normalized (using AnyAscii) to only contain lowercase Latin letters (a-z) and spaces.
Both ASCII and non-ASCII romanizations are currently filtered to be at most 20 (non-combining) characters long, which helps make the data much cleaner. This constraint applies to the words being romanized as well.
CMU Dictionary transcriptions were converted into IPA with a straightforward1 algorithm: see conversion chart. Note that AH
in unstressed syllables is represented as ə
, and ER
as ɚ
. These are the only instances of vowel reduction applied.
That is, apart from the syllabification bit. ↩