This codebase is a re-implementation of the ACCESS model for controllable text generation using control tokens of four features:
- dependency length
- word frequency
- length (number of characters)
- Levenshtein similarity ratio
The model used is a Transformer implemented with fairseq
.
A prerequisite of training is extensive preprocessing with feature extraction.
Install a conda environment for python 3.6.
Run pip install -e .
in the main repository directory. This will install all requirements from requirements.txt and enable to successfully import functions from the utils, generation and with_fairseq subdirectories.
Alternatively, you can install the requirements from requirements.txt
with pip using pip install -r requirements.txt
.
If there is an issue with spacy language models, comment those lines out, try installing with pip again.
Please note, that installing only the requirements (instead of running pip install -e .
) will lead to issues at least for the scripts in the additional_scripts folder.
Afterwards, download the language models separately using
python -m spacy download en_core_web_sm
for Englishpython -m spacy download de_core_news_sm
for German
Other dependencies: easse
(install following the instructions), tensorboardX
(with pip), nltk
stopwords (python -m nltk.downloader stopwords
).
Download the Fasttext embeddings from this site and put the unzipped file into the directory:
data_auxiliary/en
for Englishdata_auxiliary/de
for German
The unpreprocessed training, validation and test sets should be put in data/en
or data/de
with file names: train.[src|tgt]
, valid.[src|tgt]
and test.[src|tgt]
.
In order to use the 'sentencepiece' tokenizer (as in the original ACCESS code) you need to place a trained SentencePiece tokenizer in the 'data_auxiliary/[en|de]' folder. We use the pretrained sentencepiece tokenizers from here: Download their trained LM for English / German following the instructions in the readme (section "Training Language Models") and place the downloaded '[en|de].sp.model' file in the 'data_auxiliary/[en|de]' folder (you do not need the '[en|de].arpa.bin' file).
The preprocessing involves several steps:
-
Prepare word frequency ranks:
Run the functionwrite_ranks_into_file
from the script inutils/prepare_word_embeddings_frequency_ranks.py
. This should create a json file indata_auxiliary/en
(or/de
) withword:rank
pairs. -
Prepare a configuration file for preprocessing:
The configuration file should follow the example inconfigs/preprocess_dummy.yaml
Arguments in the preprocessing config:experiment_id
: ignore thislang
: the language of the dataset, eitheren
orde
supported for nowfeatures
: a list of features to be extracted from the dataset. The options are dependency, frequency, length, levenshtein.analyze_features
: a functionality that plots histograms for the feature values. Boolean.overwrite
: ignore this It's best to run the preprocessing with feature extraction for all four features once because it is time-consuming. The features that won't be used in the experiments can be removed from files after.
-
Prepare output directory:
Create the directory where the newly processed data will be stored given the requested features (and the parent directories if needed). The directory name should be the name of the extracted features in alphabetical order, separated by '_'. This directory should be created at the following place:[repository_directory]/data_preprocessed/[lang]
. For example, usemkdir
to createdata_preprocessed/en/frequency_length
if you plan to preprocess English data and extract frequency and length features. -
Preprocess corpus:
Run the preprocessing script,python preprocess.py --config configs/[config_file].yaml --tokenizer [tokenizer_type] --shard [shard_name]
whereconfig_file
: name of the configuration filetokenizer_type
: optional, can be either 'spacy' (for using the spacy lm for tokenization) or 'sentpiece' (for sentence piece tokenization); default is 'sentpiece'shard_name
: optional, can be used to process only a part of the corpus instead of processing the target and source files of the train, test and validation split together, see the Wiki for more information about the usage.
Depending on the dataset size and the features this step might take some time, so it's best to run it with nohup
or screen
.
Once finished, the preprocessed files should be located in a subdirectory of data/en
or data/de
. For example, if the features [dependency, frequency] were selected, the preprocessed files will be in dependency_frequency/
.
For training first prepare the configuration file following the example in configs/preprocess_train_generate_example.yaml
.
Arguments in the training config:
preprocess
: Boolean. Creating a vocabulary or not.train
: Boolean. Train a model or not.generate
: Boolean. Inference , so generate with a trained model or not.experiment_id
: integer. The ID of the current experiment.features_requested
: list of str. The list of features we want to control the output for.language
:en
orde
arch
: currently only the Transformer is implementedoptimizer
: currently only Adam is supportedbatch_size
: integer. Batch size during training.update_freq
: integer. Number of batches of which the gradients get accumulated before updating the parameters. If set to 1 then the model is updated after each batch, if set to a larger value this increases the effective batchsize tobatch_size
*update_freq
.test_batch_size
: integer. Batch size during testing.max_epochs
: integer. Maximum number of epochs to train the model.patience
: integer. If validation performance does not improve forpatience
successive validations then early stopping is executed. Set to -1 if no early stopping should be happen.beam_size
: integer. Beam size for inference at the testing step.lr
: float. learning rate.
Run the script with python fairseq_preprocess_train_generate.py --config configs/preprocess_train_generate_example.yaml
.
Note that the logging information for the training will not be redirected to a file but printed to the command line. To store the information in a file add > log_file.txt
at the end of the command to run the training script.
See the Wiki for optional helpers for hyperparameter tuning.
Note that the original ACCESS code uses early stopping using the SARI metrics, but this implementation doesn't.
In the step above, the generation/inference uses the gold control token values. In contrast, this step allows for a free choice of control token values.
The control token values are given as arguments to the script generate.py
.
The source file should contain one sentence per line. Before running the script, make sure to name the source file test.txt
.
Required arguments of the script:
-
experiment-id
: Experiment ID of the trained model, i.e. the name of the folder in./experiments
containing thecheckpoint
folder with the trained checkpoint (checkpoint_best.pt
) and the vocabulary files created during training -
data-dir
: Directory containing the src test.txt file. -
language
: the language of the src file and the model, eitheren
orde
-
beam
: beam size
Optional arguments - control token values:
dependency
frequency
length
levenshtein
An example for running this script:
python generate.py --experiment-id 5 --data-dir data_generetaion --language en --beam 5 --length 0.95 --frequency 0.75 --levenshtein 0.75
.
Note: By default, the text in the source test.txt file will be tokenized using the sentencepiece tokenizer. If the model used for generation was trained on data preprocessed using spacy add --spacy
at the end of the command to run the generation script.
The folder additional_scripts contains scripts that are not necessary for training and using the simplification model but that might be helpful for inspecting the data, plotting, etc.
The Wiki provides further information about the individual scripts and their usage.