We have build corpus for Kazakh language from Wikipedia dump (https://dumps.wikimedia.org/kkwiki/). Using a WikiExtractor (https://github.com/attardi/wikiextractor) to parse data, and nltk to build n-grams.
A total of 21 million words were collected. With almost 600 thousand words of different derivations.
Link to the corpus https://drive.google.com/drive/folders/1A5xfmSaf3JW4b9F8dTNKkao090SUwPJ6?usp=sharing