Specification

Import SentencePiece to use for tokenizing our data
Prepare the dataset of Hacker News titles and upvote scores
- Obtain the data from the database postgres://arcanum:c1af7a45f6a59656cc89fbc4@pg.mlx.institute:5432/arcanum
- Tokenise the titles using SentencePiece
Implement and train an architecture to obtain word embeddings in the style of the word2vec paper https://arxiv.org/pdf/1301.3781.pdf using either the *continuous bag of words (CBOW) or Skip-gram model (or both).
Implement a regression model to predict a Hacker News upvote score from the pooled average of the word embeddings in each title.
Extension : train your word embeddings on a different dataset, such as

Take in a Hacker News title
Convert it to a list of token embeddings using our word2vec architecture
Take the average of those embeddings (this is called average pooling and it is actually quite a crude technique; we will see how you can do better next week with RNNs).
Pass this averaged embedding through a series of hidden layers with widths and activation functions of your choice.
Pass the result through an output layer, which should be a linear layer with a single neuron, in order to product a single number representing the network's prediction for the upvote score.
Compare the predicted score with the true score (the label ) via an Mean Square Error loss function.

The suggested Workflow will consist of 4 main Steps:

Develop your FastAPI serve that provides inference for your model locally on your laptop.
Turn your application into a Docker Image and push to DockerHub.
Pull your image from DockerHub, either on your local machine or the server from which inference will be done, and then instanciate a container.
Whenever you have a new version of your image with your model, tear down your old container and start another.

Provide feedback