Skip to content

Get smart translations of sentences for various languages, including literal translations for and information on the inflections of words.

License

Notifications You must be signed in to change notification settings

twaslowski/grammr

Repository files navigation

grammr

grammr is a language learning tool that aims to make understanding the grammar of a variety of primarily Indo-European languages easier.

About

While there are too many great language learning apps to count, standardized grammatical references for languages can be hard to come by. It is possible to find dictionaries and inflection tables for different languages by googling; however, there is no standardized tool so far that does this for a variety of languages.

The aim of this project is to create a unified API to provide several tools for language learners across multiple languages:

  • Translations of texts, including the literal translations of individual words
  • Morphological analysis of words, including their part-of-speech tags and features
  • Inflection (conjugation and declension) of words
  • Anki Flashcard export to make learning easier

I have more ideas for this project, which you can find in the Features and Roadmap sections.

This project, therefore, does not take a didactic approach to learning languages, and should not be compared to an app like Duolingo; it rather aims to be a comprehensive, open reference tool that can be arbitrarily extended for different languages.

The biggest issue right now is data visualization. For prototyping and personal use, I've created a Telegram Bot; however, there is only so much information that fits into a message. Therefore, the creation of a frontend is the highest priority for this project at the moment.

[Example Image]

You can try the bot here:

https://t.me/lingolift_bot

Features

  • Translation of texts.
  • Literal translations for each word in a text.
  • Morphological analysis of each word in a text.
  • Creation of inflections for words.

The morphological analysis is done using spaCy, and contains the following information for each word:

  • The part of speech (POS) of a word.
  • The features of a word, including the Case, Number, Gender, Tense, Person, depending on word type, as well as its lemma, its basic form. For more information on this, see the Universal Dependencies.

I'm looking to add more features in the near future. These may include:

  • The ability to arbitrarily add languages (solved via Configmaps)
  • A solid frontend, in addition to the Telegram bot currently available
  • Inflection tables
  • Anki flash card export

Technical Roadmap

Beyond the features mentioned above, I'm also looking to make the technology stack more robust. I believe that this application should really be able to scale to complexity, so I'm investing a lot of time and energy into keeping the architecture as clean as possible.

Features and improvements will include:

  • A proper CI/CD pipeline, including automated deployment (halfway there)
  • Adding several more languages, which means creating additional sidecars for morphological analyis.
  • Building a frontend. Possibly web-app, possibly cross-platform apps.
  • Benchmarking against quality regression of prompts
  • Extract the Telegram Bot entirely from the core service. It was useful for prototyping, but should not be tied to the main application.

On the Frontend

I really like a project that is very similar to this one: @SalahEddineGhamri's textile. Unfortunately, the only way to interact with textile is through the CLI, making it inaccessible for a huge amount of people. I'm would prefer to create the frontend for this language in some JS flavour (Angular, React, Vue etc.) to create a more accessible Web Application with portability potential.

Reference image from Textile, taken from the original repository

Related projects

Listed here are projects that I am either using or would consider integrating into this project.

  • spaCy: Morphological analysis for a variety of languages.
  • pymorphy3: Fork of the currently unmaintained pymorphy2. Provides inflections for Russian.

Interesting related projects:

  • textile: Inspiration for a potential UI
  • mathigatti/spanish_inflections: Inflections for Spanish
  • DuyguA/DEMorphy: Inflections for German. Technically, this library only provides morphological analysis, but it does use a comprehensive lexicon under the hood, so creating inflections should be possible.
  • TimoBechtel/satzbau: Creation of natural language German texts, including declension and conjugation.
  • verbecc: Verb conjugation for a variety of roman languages.

Running

I tried to make running the project yourself as straightforward as possible. What you'll need:

  • A Telegram bot token. You can get one by talking to the BotFather.
  • An OpenAI API key. You can get one by signing up here.

Given those, you can run the project with a local configuration or by deploying the packaged Helm chart. My recommendation is using an .envrc file so you always have your environment variables handy, but do whatever works best for you.

By running ./scripts/run.sh, a Postgres container and the Python sidecar running spaCy will be launched. The Spring application will be run via mvn spring-boot:run.

If you choose to run ./scripts/deploy.sh, a Helm chart will be created and deployed to your Kubernetes cluster. This will require a running Kubernetes cluster (you could try using Minikube or k3s) and Helm to be installed. You should set a DATASOURCE_PASSWORD environment variable, which will be used to create a secret in the cluster for authentication with Postgres.

Developing

PRs are welcome! I'm happy to help you get started with the project, so feel free to reach out.

You can run unit tests with ./scripts/unit-test.sh and integration tests with ./scripts/integration-test.sh. Alternatively, run both with ./scripts/qa.sh. This will also ensure the most accurate test coverage report is generated, which you can access at target/site/jacoco/index.html.

Domain Language

Different projects use different terms to describe similar concepts. For instance, while pymorphy2 refers to the root form of a word as the lexeme, spaCy refers to it as the lemma. I've tried to standardize the terms used in a unified domain language that is to be used across the application and in the APIs it exposes.

The design is a work in progress, but I do think it offers some genuine value.

  • Token refers to a singular word of a phrase, that contains

    • a source_text
    • a lemma
    • a pos (part of speech, such as NOUN). Reference. _- a feature_set (a set of features, such as NUMBER=PLURAL and CASE=GEN). Reference.
    • an optional ancestor (a reference to another Token in the phrase that it relates to)._
  • Phrase refers to a collection of Tokens that form a sentence.

  • A SemanticTranslation is a Phrase that is a translation of another Phrase. The meaning of the original phrase is preserved as well as possible in the translation. It consists of a source_phrase and a target_phrase.

  • A LiteralTranslation is a direct, literal translation of a phrase. This can help users better understand how phrases are constructed. It consists of the source phrase and a collection of TokenTranslations, which are a key-value pair of tokens from the source phrase and their directly translated counterparts. For example, "Ich bin ein Student" would contain four translated tokens: (Ich, I), (bin, am), (ein, a), (Student, student).

  • Tokens are aggregated through the process of literally translating and grammatically analyzing phrases. These processes return Sets of TokenTranslations and TokenMorphology respectively, which are coalesced into Tokens that ultimately make up a Phrase. Therefore, a Analysis consists of a set of Tokens holding their literal translation and grammatical analysis, as well as the semantic translation of the phrase.

Part of Speech & Features

I've decided to use the Universal Dependencies as reference for my part-of-speech and feature definitions as well as the domain language defined above.

About

Get smart translations of sentences for various languages, including literal translations for and information on the inflections of words.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published