gpt_the_office

Background

We are creating a simple gpt (Generative Pre-trained Transformer) that can learn writing texts given a dataset. The motivation behind this repo is to create scripts from The Office. The first iteration will be utilizing character-wise tokens. Inspiration came from Andrej Karpathy and his video here. For simplicity sake, this will utilize one GPU.

Quick Start

Setting up the environment

We need to make sure that our environment has all the required packages to run this repo. We shall create a virtual environment to ensure this.

Assumption is you have python3, using pip, and a linux os.

gh repo clone shassen14/gpt_the_office (or however you would like to clone the repo)
cd gpt_the_office
sh set_environment.sh or bash set_environment.sh

Here is a gif to showcase a similar procedure and expected output:

If one is not running linux, then do something similar to the following:

gh repo clone shassen14/gpt_the_office (or however you would like to clone the repo)
cd gpt_the_office
pip3 install virtualenv (if you don't already have virtualenv installed)
python3 -m venv ./venv to create your new environment (called 'venv' here)
source venv/bin/activate to enter the virtual environment
pip3 install -r requirements.txt to install the requirements in the current environment

Obtaining the dataset

We need to have a dataset to train and validate our gpt model is generating "correct" text. We shall download a set from an online source before training a model.

Ensure one is in the virtual environment with source venv/bin/activate
python3 ./data/prep_char_data.py

This will then create a character-level dataset directory that has the meta.pkl, script.txt, training.bin, and val.bin. The following gif shows the steps and example terminal outputs:

An example for the dataset directory created is the following:

This will rather download the dataset or confirm it's already there. One can edit the data/data_config.py to edit file names and download another dataset recommended in the comments or one's own.

Training

We shall train a gpt model given a dataset downloaded in the previous section.

Ensure one is in the virtual environment with source venv/bin/activate
python3 train.py

The terminal output shows the estimated training loss, validation loss, and an example text from that checkpoint. Here is an example terminal output from training a model for ~10 minutes sped up:

A .pt model should have been saved in the params directory like the following:

Generation

Once we trained a gpt model, we can now create some screenplays by generating texts.

Ensure one is in the virtual environment with source venv/bin/activate
python3 generation.py

The terminal output shows text printing line by line utilizing the model saved from the training section. Here is an example in the following:

The generated text shown in the terminal should also be saved as a .txt file in the examples directory such as the following:

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
config		config
data		data
docs/main		docs/main
examples		examples
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
requirements.txt		requirements.txt
set_environment.sh		set_environment.sh
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpt_the_office

Background

Quick Start

Setting up the environment

Obtaining the dataset

Training

Generation

Future Plans

About

Releases

Packages

Languages

License

shassen14/gpt_the_office

Folders and files

Latest commit

History

Repository files navigation

gpt_the_office

Background

Quick Start

Setting up the environment

Obtaining the dataset

Training

Generation

Future Plans

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages