We develop the Classical Language Toolkit (CLTK) because we believe it is revolutionizing the study of the ancient world. It is doing so by removing barriers to entry for those doing natural language processing (NLP) in Classical languages (namely, the surviving literature of the entirety of Eurasia and north Africa, from roughly 3000 B.C. to A.D. 1500).
Due to how academic disciplines have evolved over the past 200 years, our earliest civilizations are often studied in isolation from one another. This is tragic, for today we know that the ancient world – from Rome to Mesopotamia to India to China – consisted of deeply interconnected networks of ideas, technologies, art, and beliefs. As a framework for multidisciplinary research, the CLTK will help scholars discover the commonalities of what were once thought disparate cultures.
As software, the CLTK is a suite of NLP tools suited to the special needs of ancient languages. We have have three goals: The most basic is to offer low-level libraries for doing NLP in particular Classical languages (e.g., Ancient Greek, Sanskrit). Developed with an extensible architecture, our code is easily hacked to support new languages. Second, the CLTK offers tools for students and scholars to do reproducible scientific research. For instance, it has version-controlled linguistic corpora and a suite of functions for stylometrics. Third, it is a framework for multidisciplinary language research. With pre-trained models (such as Word2Vec for vector space models), we provide easy-to-use tools to capture the transmission and evolution of knowledge, from the earliest human societies to the dawn of the modern era.
- Twitter: See our Project ideas page for a list of tasks that are suited to three months' work for a beginning-to-intermediate programmer. Most work needs to be written in Python or JavaScript, of which an intermediate-level knowledge is expected. For all projects, we ask that you carefully read and follow the our GSoC proposal template.
For a project focused on a Classical language, it is a requirement that you have an intermediate-level knowledge of that language. If you do not know much about NLP, that is fine, the mentors can work with you on this. If you know a Classical language that is not yet supported well by the CLTK (e.g., Hebrew, Sanskrit, Tibetan, Chinese), you may follow the pattern set by the current Greek and Latin libraries. See the docs for an idea of what functionalities we have for various languages.
For those not familiar with an ancient language, we would like to see a web project that builds off the work (both in JavaScript) of two previous GSoC students. In 2016, a text reading environment was created. In 2017, a text-serving backend and annotation framework, both specifically build to handle the diverse conventions of pre-modern texts, were added to this reading environment. For 2018, we would like to see the work of the previous summers turned into fully abstracted and added as libraries into the npm registry. See the Project ideas page for full technical details.
If you are unsure of how to contribute, please reach out to us on Gitter or Google Groups.