This tokenizer was trained using the arabic dataset of Wikipedia ,it's based on the wordpiece algorithm
1] Fast WordPiece Tokenization [arXiv.2012.15524]
2] @ONLINE{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}