grammr

grammr is a language learning tool that aims to make understanding the grammar of a variety of primarily Indo-European languages easier.

About

While there are too many great language learning apps to count, standardized grammatical references for languages can be hard to come by. It is possible to find dictionaries and inflection tables for different languages by googling; however, there is no standardized tool so far that does this for a variety of languages.

The aim of this project is to create a unified API to provide several tools for language learners across multiple languages:

Translations of texts, including the literal translations of individual words
Morphological analysis of words, including their part-of-speech tags and features
Inflection (conjugation and declension) of words
Anki Flashcard export to make learning easier

I have more ideas for this project, which you can find in the Features and Roadmap sections.

This project, therefore, does not take a didactic approach to learning languages, and should not be compared to an app like Duolingo; it rather aims to be a comprehensive, open reference tool that can be arbitrarily extended for different languages.

The biggest issue right now is data visualization. For prototyping and personal use, I've created a Telegram Bot; however, there is only so much information that fits into a message. Therefore, the creation of a frontend is the highest priority for this project at the moment.

[]

You can try the bot here:

https://t.me/lingolift_bot

Features

Translation of texts.
Literal translations for each word in a text.
Morphological analysis of each word in a text.
Creation of inflections for words.

The morphological analysis is done using spaCy, and contains the following information for each word:

The part of speech (POS) of a word.
The features of a word, including the Case, Number, Gender, Tense, Person, depending on word type, as well as its lemma, its basic form. For more information on this, see the Universal Dependencies.

I'm looking to add more features in the near future. These may include:

The ability to arbitrarily add languages (solved via Configmaps)
A solid frontend, in addition to the Telegram bot currently available
Inflection tables
Anki flash card export

Technical Roadmap

Beyond the features mentioned above, I'm also looking to make the technology stack more robust. I believe that this application should really be able to scale to complexity, so I'm investing a lot of time and energy into keeping the architecture as clean as possible.

Features and improvements will include:

A proper CI/CD pipeline, including automated deployment (halfway there)
Adding several more languages, which means creating additional sidecars for morphological analyis.
Building a frontend. Possibly web-app, possibly cross-platform apps.
Benchmarking against quality regression of prompts
Extract the Telegram Bot entirely from the core service. It was useful for prototyping, but should not be tied to the main application.

On the Frontend

I really like a project that is very similar to this one: @SalahEddineGhamri's textile. Unfortunately, the only way to interact with textile is through the CLI, making it inaccessible for a huge amount of people. I'm would prefer to create the frontend for this language in some JS flavour (Angular, React, Vue etc.) to create a more accessible Web Application with portability potential.

Related projects

Listed here are projects that I am either using or would consider integrating into this project.

spaCy: Morphological analysis for a variety of languages.
pymorphy3: Fork of the currently unmaintained pymorphy2. Provides inflections for Russian.

Interesting related projects:

textile: Inspiration for a potential UI
mathigatti/spanish_inflections: Inflections for Spanish
DuyguA/DEMorphy: Inflections for German. Technically, this library only provides morphological analysis, but it does use a comprehensive lexicon under the hood, so creating inflections should be possible.
TimoBechtel/satzbau: Creation of natural language German texts, including declension and conjugation.
verbecc: Verb conjugation for a variety of roman languages.

Running

I tried to make running the project yourself as straightforward as possible. What you'll need:

A Telegram bot token. You can get one by talking to the BotFather.
An OpenAI API key. You can get one by signing up here.

Given those, you can run the project with a local configuration or by deploying the packaged Helm chart. My recommendation is using an .envrc file so you always have your environment variables handy, but do whatever works best for you.

By running ./scripts/run.sh, a Postgres container and the Python sidecar running spaCy will be launched. The Spring application will be run via mvn spring-boot:run.

If you choose to run ./scripts/deploy.sh, a Helm chart will be created and deployed to your Kubernetes cluster. This will require a running Kubernetes cluster (you could try using Minikube or k3s) and Helm to be installed. You should set a DATASOURCE_PASSWORD environment variable, which will be used to create a secret in the cluster for authentication with Postgres.

Developing

PRs are welcome! I'm happy to help you get started with the project, so feel free to reach out.

You can run unit tests with ./scripts/unit-test.sh and integration tests with ./scripts/integration-test.sh. Alternatively, run both with ./scripts/qa.sh. This will also ensure the most accurate test coverage report is generated, which you can access at target/site/jacoco/index.html.

Domain Language

Different projects use different terms to describe similar concepts. For instance, while pymorphy2 refers to the root form of a word as the lexeme, spaCy refers to it as the lemma. I've tried to standardize the terms used in a unified domain language that is to be used across the application and in the APIs it exposes.

The design is a work in progress, but I do think it offers some genuine value.

Token refers to a singular word of a phrase, that contains
- a source_text
- a lemma
- a pos (part of speech, such as NOUN). Reference. _- a feature_set (a set of features, such as NUMBER=PLURAL and CASE=GEN). Reference.
- an optional ancestor (a reference to another Token in the phrase that it relates to)._
Phrase refers to a collection of Tokens that form a sentence.
A SemanticTranslation is a Phrase that is a translation of another Phrase. The meaning of the original phrase is preserved as well as possible in the translation. It consists of a source_phrase and a target_phrase.
A LiteralTranslation is a direct, literal translation of a phrase. This can help users better understand how phrases are constructed. It consists of the source phrase and a collection of TokenTranslations, which are a key-value pair of tokens from the source phrase and their directly translated counterparts. For example, "Ich bin ein Student" would contain four translated tokens: (Ich, I), (bin, am), (ein, a), (Student, student).
Tokens are aggregated through the process of literally translating and grammatically analyzing phrases. These processes return Sets of TokenTranslations and TokenMorphology respectively, which are coalesced into Tokens that ultimately make up a Phrase. Therefore, a Analysis consists of a set of Tokens holding their literal translation and grammatical analysis, as well as the semantic translation of the phrase.

Part of Speech & Features

I've decided to use the Universal Dependencies as reference for my part-of-speech and feature definitions as well as the domain language defined above.

Part of Speech (POS) is a grammatical category of words that have similar grammatical properties. Read more: https://universaldependencies.org/u/pos/index.html
Features are morphological and syntactic properties of words. They are used to distinguish between different grammatical categories of words. Read more: https://universaldependencies.org/u/feat/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
charts		charts
docs		docs
local		local
scripts		scripts
sidecars		sidecars
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.java-version		.java-version
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
checkstyle.suppression.xml		checkstyle.suppression.xml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grammr

About

Features

Technical Roadmap

On the Frontend

Related projects

Running

Developing

Domain Language

Part of Speech & Features

About

Releases

Packages

Languages

License

twaslowski/grammr

Folders and files

Latest commit

History

Repository files navigation

grammr

About

Features

Technical Roadmap

On the Frontend

Related projects

Running

Developing

Domain Language

Part of Speech & Features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages