A tutoring web platform which allows professors and students to collaborate:
- professors can post exam questions, exercises, and give advice to students
- students can answer questions, rate each other's answers, and simulate exams.
This repository is structured in the following way:
- a
client
directory for the ReactJS frontend - an
ml
directory for the PyTorch auto-moderation ML pipelines - a
server
directory for the FastAPI backend
You can run the whole package in one command:
docker compose up -d
after having installed Docker Desktop on your macOS or Windows machine, or the docker.io
package on your Linux machine. (Update: Docker Desktop is now also available on Linux.)
This command will bring up all the necessary Docker containers that are needed to run the project in development mode, plus some bonuses, exposed to you as follows:
Service | Address |
---|---|
ReactJS Frontend Client | http://localhost:3000 |
Backend APIs | http://localhost:8000/v1 |
Backend APIs Documentation | http://localhost:8000/docs |
Mongo Express web UI for MongoDB | http://localhost:8081 |
Backend APIs (reverse proxied by nginx) | http://localhost/v1 |
Backend static resources (reverse proxied by nginx) | http://localhost/static |
MongoDB | mongodb://root:example@localhost:27017 |
It may be useful using the Docker Desktop dashboard to open a container's service port directly or to follow a specific container's logs, as shown below:
Containers | Single Container Logs |
---|---|
If you're on Linux, lazydocker is an excellent CLI alternative to display the same kind of information.
To develop the client stand-alone:
- Install Node.js
cd client
npm i
npm start
If you intend to commit any changes, please also run npm i
in the root (skeeelled
) directory, so that the git pre-commit hooks can run correctly.
At the moment we have configured husky to lint all the files residing in the client
directory with prettier
.
All your changed files should be properly linted by this procedure to avoid unnecessary merge conflicts.
The latest commit to the master
branch gets automatically deployed to our staging instance 👉 here 👈.
To develop the backend, we recommend to run docker-compose up -d
as suggested above, since both MongoDB and FastAPI are required, and Mongo Express may be useful.
If you really want to go down the FastAPI stand-alone route (and note that you will have to run a MongoDB instance separately), you can run the following commands:
- Install Python
cd server
python3 -m venv venv
source venv/bin/activate
(orsource venv\Scripts\activate
on Windows)pip install -r requirements.txt
uvicorn main:app --host=0.0.0.0 --reload
If you're using PyCharm Professional, I suggest you to add this configuration:
Edit Configurations | Add Configuration |
---|---|
The APIs documentation is made automatically available by FastAPI at http://localhost:8000/docs, but if you want to add any details you are free to do so by adding a docstring such as:
@app.get("/v1")
def index():
"""
This API lets you test if the backend is working properly.
This is an info docstring that will also appear at the Swagger link.
"""
return {"msg": "You successfully reached API v1"}
A machine-learning model is deployed to automatically monitor and moderate the comment activity of each post. Its primary objective is to prevent users to abuse the forum by posting profanity, blasphemy, hate-speech toxic content in the comments section. This is achieved by performing Sentiment Analysis, a common form of text classification used in Natural Language Processing (an interdisciplinary subset of Machine Learning) where one of multiple sentiment labels (e.g. anger, fear, joy etc...) is inferred by the model based on the content of the text.
The resulting mini-library called Modalii - Moderatore Automatico per le Lingue Italiana & Inglese
uses the HuggingFace Transformers library in order to assemble the full pipeline of the moderator. The models hosted in the library are imported locally and fine-tuned on purposely-built datasets for toxicity detection in STEM-oriented comments.
Given an input string, representing the comment to be classified, the library offers a high-level infer
function to retrieve its sentiment
Full-input pipeline |
---|
comment ➡️ Polyglot ➡️ ProfanityDetector ➡️ ToxicityDetector ➡️ sentiment |
The Modalii
library uses a binary-classification paradigm where a normalised score variable represents the above sentiment and contains the degree of toxicity detected within the comment. The actions taken by the library are either to pass
the comment or to quarantine
it according to a specific banning policy acting on score as returned by the pipeline.
The Polyglot
model is a language identification class which discern the language with which the comment is written (either english, italian or neither). According to which language is identified a ProfanityDetector
class is instantiated which is a simple Bag of Words (BoW) model, based on two language-specific vocabularies, that detects the presence of any blasphemic word and automatically bans the comment. Finally, if the comments passes the purity_check
performed by the latter, a ToxicityDetector
class takes care of the last hate-speech/toxicity assessment. This step is also language-specific with one model for the italian language and another for the english language. While the latter is used "as is", the former is fine-tuned on an ad-hoc built and manually labelled dataset (see Datasets section) using the train
high-level function provided by the library
Training pipeline |
---|
Builder ➡️ Reddit ➡️ Facebook ➡️ Loader |
We remark that correct use of the library's APIs is to instantiate the function on string variables. Should the user want to use infer
on a dataset (either .csv
, .json
etc...) then a miminalistic pre-processing (not provided by the library) is required. An example of such action is given in test.py
.
Both the Polyglot
and the two language-specific ToxicityDetector
s classes are based on HuggingFace hosted Transformers. Simply put a Transformer
, in the context of deep-learning, is a combination of two mechanics:
- A Sequence-2-Sequence translation of the input text (also known as encoder-decoder architecture);
- An Attention mechanism that focuses the important keywords of a given input sequence. For our purposes we identify each hosted model with:
- a
Tokenizer
which takes the input text and vectorise it (according to a pre-defined dictionary) to make it digestible for the input layer of neural network; - a
Classifier
that elaborates the input through the Attention mechanism discussed above and returning a vector with the probabilities of each label. These deep neural networks are very large architectures that revolutionised the world of automatic text interaction, including the tasks required by common-days chatbots, translators, word masking etc... The reader might refer to this well-written article for a preliminary but thorough introduction to the architecture.
We previously mentioned that Modalii
comes with a secondary high-level train
function that provides a straight-forward fine-tuning facility for the italian ToxicityDetector
model. The fine-tuning is performed on a specific dataset that is composed by Reddit
's threads and Facebook
groups scraped comments. The Builder
class is automatically instantiate by the training pipeline if neither ITA-Dataset.json
not ENG-Dataset.json
are present in the top-level directory (where Modalii.py
is located); such class scrapes both Reddit
and Facebook
comments found in the Subreddits and Groups (respectively) that are specified by the source
variable in the initialiser of each class. The user can change those to her/his own liking to build a separate dataset (notice that the new scraped dataset will not be automatically labelled). Finally the Loader
class takes care of the interaction with the scraped .json
dataset and the HuggingFace Transformer model.
To be able to use the Builder
class the user must have both a Reddit and Facebook active account. The Facebook
class is based on the facebook-scraper library and uses the get_posts
API to retrieve all the comments in the posts of the groups specified in the source
variable. The user must ensure to be logged it to her/his account to use the library as no further action are required. On the contrary the Reddit
class is based on praw which required the reddit_client_id
, reddit_client_secret_key
and reddit_user_agent
to be manually input in its instance. Those are imported from a .env
file in which they are specified. The user can retrieve those by creating an app on the bottom button here, name it as it likes, and using the information provided there
.env |
---|
reddit_client_id = 'code-you-find-below-username' |
reddit_client_secret_key = 'code-you-find-on-"secret"-tab' |
reddit_user_agent = 'name-you-gave-to-the-app' |
Also the training is performed using the Weights & Biases monitoring tool. To be able to use that the user must create an account with the library and retrieve its API key
.env |
---|
api_key = 'wand-key-given-on-the-prompt' |