Specification

Import SentencePiece to use for tokenizing our data
Prepare the dataset of Hacker News titles and upvote scores
- Obtain the data from the database postgres://arcanum:[email protected]:5432/arcanum
- Tokenise the titles using SentencePiece
Implement and train an architecture to obtain word embeddings in the style of the word2vec paper https://arxiv.org/pdf/1301.3781.pdf using either the *continuous bag of words (CBOW) or Skip-gram model (or both).
Implement a regression model to predict a Hacker News upvote score from the pooled average of the word embeddings in each title.
Extension : train your word embeddings on a different dataset, such as

Take in a Hacker News title
Convert it to a list of token embeddings using our word2vec architecture
Take the average of those embeddings (this is called average pooling and it is actually quite a crude technique; we will see how you can do better next week with RNNs).
Pass this averaged embedding through a series of hidden layers with widths and activation functions of your choice.
Pass the result through an output layer, which should be a linear layer with a single neuron, in order to product a single number representing the network's prediction for the upvote score.
Compare the predicted score with the true score (the label ) via an Mean Square Error loss function.

The suggested Workflow will consist of 4 main Steps:

Develop your FastAPI serve that provides inference for your model locally on your laptop.
Turn your application into a Docker Image and push to DockerHub.
Pull your image from DockerHub, either on your local machine or the server from which inference will be done, and then instanciate a container.
Whenever you have a new version of your image with your model, tear down your old container and start another.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
Dataset1000Titles.txt		Dataset1000Titles.txt
Dataset1000WithScore.txt		Dataset1000WithScore.txt
DatasetWithURLsInside.csv		DatasetWithURLsInside.csv
README.md		README.md
TitlesAndScore10000.csv		TitlesAndScore10000.csv
TitlesAndScoreALL.csv		TitlesAndScoreALL.csv
Word2Vec.ipynb		Word2Vec.ipynb
eda.ipynb		eda.ipynb
model.ipynb		model.ipynb
model.py		model.py
requirements.txt		requirements.txt
requirments.txt		requirments.txt
sentencePiece.ipynb		sentencePiece.ipynb
set_environment.py		set_environment.py
spm_Alex_week2.model		spm_Alex_week2.model
spm_Alex_week2.vocab		spm_Alex_week2.vocab
week2_Alex.ipynb		week2_Alex.ipynb

Provide feedback