This repository contains code for the paper Bottom-Up and Top-Down: Predicting Personality with Psycholinguistic and Language Model Features, published in IEEE International Conference of Data Mining 2020.
Here are a set of experiments written in tensorflow + pytorch to explore automated personality detection using Language Models on the Essays dataset (Big-Five personality labelled traits) and the Kaggle MBTI dataset.
Pull this repository from GitLab via:
git clone [email protected]:ml-automated-personality-detection/personality.git
See the requirements.txt for the list of dependent packages which can be installed via:
pip -r requirements.txt
First run the LM extractor code which passes the dataset through the language model and stores the embeddings (of all layers) in a pickle file. Creating this 'new dataset' saves us a lot of compute time and allows effective searching of the hyperparameters for the finetuning network. Before running the code, create a pkl_data folder in the repo folder. All the arguments are optional and passing no arguments runs the extractor with the default values.
python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'
Next run the finetuning network which is currently a MLP.
python finetuneNet.py
LM_extractor.py
On a RTX2080 GPU, the -embed 'bert-base' extractor takes about ~2m 30s and 'bert-large' takes about ~5m 30s
On a CPU, 'bert-base' extractor takes about ~25m
finetuneNet.py
On a RTX2080 GPU, running for 15 epochs (with no cross-validation) takes from 5s-60s, depending on the MLP architecture.
If you find this repo useful for your research, please cite it using the following BibTex entry:
@inproceedings{mehtabottom,
title={Bottom-Up and Top-Down: Predicting Personality with Psycholinguistic and Language Model Features},
author={Mehta, Yash and Fatehi, Samin and Kazameini, Amirmohammad and Stachl, Clemens and Cambria, Erik and Eetemadi, Sauleh},
booktitle={Proceedings of the International Conference of Data Mining},
Organization = {IEEE},
year={2020}}
}