Skip to content

Commit

Permalink
Add bare metal and docker scripts/instructions to the bert-large-fp32…
Browse files Browse the repository at this point in the history
…-training package
  • Loading branch information
dmsuehir authored and Kasravi, Kam D committed Jul 2, 2020
1 parent 755d932 commit 4023d9a
Show file tree
Hide file tree
Showing 5 changed files with 331 additions and 7 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
test_data/
download_glue_data.py
data/
output/
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,175 @@ should be downloaded as mentioned in the [Google bert repo](https://github.com/g

Refer to google reference page for [checkpoints](https://github.com/google-research/bert#pre-trained-models).

## Datasets

### Pretrained models

Download and extract checkpoints the bert pretrained model from the
[google bert repo](https://github.com/google-research/bert#pre-trained-models).
The extracted directory should be set to the `CHECKPOINT_DIR` environment
variable when running example scripts.

For training from scratch, Wikipedia and BookCorpus need to be downloaded
and pre-processed.

### GLUE data

[GLUE data](https://gluebenchmark.com/tasks) is used when running BERT
classification training. Download and unpack the GLUE data by running
(this script)[https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e].

### SQuAD data

The Stanford Question Answering Dataset (SQuAD) dataset files can be downloaded
from the [Google bert repo](https://github.com/google-research/bert#squad-11).
The three files (`train-v1.1.json`, `dev-v1.1.json`, and `evaluate-v1.1.py`)
should be downloaded to the same directory. Set the `DATASET_DIR` to point to
that directory when running bert fine tuning using the SQuAD data.

## Example Scripts

| Script name | Description |
|-------------|-------------|
| [`fp32_training_multi_node.sh`](fp32_training_multi_node.sh) | This script is used by the Kubernetes pods to run training across multiple nodes using mpirun and horovod. |
| [`fp32_classifier_training.sh`](fp32_classifier_training.sh) | This script fine-tunes the bert base model on the Microsoft Research Paraphrase Corpus (MRPC) corpus, which only contains 3,600 examples. Download the [bert base pretrained model](https://github.com/google-research/bert#pre-trained-models) and set the `CHECKPOINT_DIR` to that directory. The `DATASET_DIR` should point to the [GLUE data](#glue-data). |
| [`fp32_squad_training.sh`](fp32_squad_training.sh) | This script fine-tunes bert using SQuAD data. Download the [bert large pretrained model](https://github.com/google-research/bert#pre-trained-models) and set the `CHECKPOINT_DIR` to that directory. The `DATASET_DIR` should point to the [squad data files](#squad-data). |
| [`fp32_training_single_node.sh`](fp32_training_single_node.sh) | This script is used by the single node Kubernetes job to run bert classifier inference. |
| [`fp32_training_multi_node.sh`](fp32_training_multi_node.sh) | This script is used by the Kubernetes pods to run bert classifier training across multiple nodes using mpirun and horovod. |

These examples can be run the following environments:
* [Bare metal](#bare-metal)
* [Docker](#docker)
* [Kubernetes](#kubernetes)

## Bare Metal

To run on bare metal, the following prerequisites must be installed in your enviornment:
* Python 3
* [intel-tensorflow==2.1.0](https://pypi.org/project/intel-tensorflow/)
* numactl
* git

Once the above dependencies have been installed, download and untar the model
package, set environment variables, and then run an example script. See the
[datasets](#datasets) and [list of example scripts](#example-scripts) for more
details on the different options.

The snippet below shows an example running with a single instance:
```
wget https://ubit-artifactory-or.intel.com/artifactory/list/cicd-or-local/model-zoo/bert-large-fp32-training.tar.gz
tar -xvf bert-large-fp32-training.tar.gz
cd bert-large-fp32-training
CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>
# Run a script for your desired usage
./examples/<script name>.sh
```

To run distributed training (one MPI process per socket) for better throughput,
set the `MPI_NUM_PROCESSES` var to the number of sockets to use. Note that the
global batch size is mpi_num_processes * train_batch_size and sometimes the learning
rate needs to be adjusted for convergence. By default, the script uses square root
learning rate scaling.

For fine-tuning tasks like BERT, state-of-the-art accuracy can be achieved via
parallel training without synchronizing gradients between MPI workers. The
`mpi_workers_sync_gradients=[True/False]` var controls whether the MPI
workers sync gradients. By default it is set to "False" meaning the workers
are training independently and the best performing training results will be
picked in the end. To enable gradients synchronization, set the
`mpi_workers_sync_gradients` to true in BERT options. To modify the bert
options, modify the example .sh script or call the `launch_benchmarks.py`
script directly with your preferred args.

To run with multiple instances, these additional dependencies will need to be
installed in your environment:
* openmpi-bin
* openmpi-common
* openssh-client
* openssh-server
* libopenmpi-dev
* horovod==0.19.1

```
wget https://ubit-artifactory-or.intel.com/artifactory/list/cicd-or-local/model-zoo/bert-large-fp32-training.tar.gz
tar -xvf bert-large-fp32-training.tar.gz
cd bert-large-fp32-training
CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>
MPI_NUM_PROCESSES=<number of sockets to use>
# Run a script for your desired usage
./examples/<script name>.sh
```

## Docker

The bert FP32 training model container includes the scripts and libraries
needed to run bert large FP32 fine tuning. To run one of the example usage scripts
using this container, you'll need to provide volume mounts for the pretrained model,
dataset, and an output directory where log and checkpoint files will be written.

The snippet below shows an example running with a single instance:
```
CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>
docker run \
--env CHECKPOINT_DIR=${CHECKPOINT_DIR} \
--env DATASET_DIR=${DATASET_DIR} \
--env OUTPUT_DIR=${OUTPUT_DIR} \
--env http_proxy=${http_proxy} \
--env https_proxy=${https_proxy} \
--volume ${CHECKPOINT_DIR}:${CHECKPOINT_DIR} \
--volume ${DATASET_DIR}:${DATASET_DIR} \
--volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
--privileged --init -it \
amr-registry.caas.intel.com/aipg-tf/model-zoo:2.1.0-language-modeling-bert-large-fp32-training \
/bin/bash examples/<script name>.sh
```

To run distributed training (one MPI process per socket) for better throughput,
set the `MPI_NUM_PROCESSES` var to the number of sockets to use. Note that the
global batch size is mpi_num_processes * train_batch_size and sometimes the learning
rate needs to be adjusted for convergence. By default, the script uses square root
learning rate scaling.

For fine-tuning tasks like BERT, state-of-the-art accuracy can be achieved via
parallel training without synchronizing gradients between MPI workers. The
`mpi_workers_sync_gradients=[True/False]` var controls whether the MPI
workers sync gradients. By default it is set to "False" meaning the workers
are training independently and the best performing training results will be
picked in the end. To enable gradients synchronization, set the
`mpi_workers_sync_gradients` to true in BERT options. To modify the bert
options, modify the example .sh script or call the `launch_benchmarks.py`
script directly with your preferred args.
```
CHECKPOINT_DIR=<path to the pretrained bert model directory>
DATASET_DIR=<path to the dataset being used>
OUTPUT_DIR=<directory where checkpoints and log files will be saved>
MPI_NUM_PROCESSES=<number of sockets to use>
docker run \
--env CHECKPOINT_DIR=${CHECKPOINT_DIR} \
--env DATASET_DIR=${DATASET_DIR} \
--env OUTPUT_DIR=${OUTPUT_DIR} \
--env MPI_NUM_PROCESSES=${MPI_NUM_PROCESSES} \
--env http_proxy=${http_proxy} \
--env https_proxy=${https_proxy} \
--volume ${CHECKPOINT_DIR}:${CHECKPOINT_DIR} \
--volume ${DATASET_DIR}:${DATASET_DIR} \
--volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
--privileged --init -it \
amr-registry.caas.intel.com/aipg-tf/model-zoo:2.1.0-language-modeling-bert-large-fp32-training \
/bin/bash examples/<script name>.sh
```

## Kubernetes

Download and untar the bert large FP32 training package:
Expand Down Expand Up @@ -77,7 +237,7 @@ The distributed training algorithm is handled by mpirun.
The command to run an MPIJob is shown below:

```
kubectl -k bert_large_fp32_training/examples/k8s/mlops/multi-node apply
kubectl -k bert-large-fp32-training/examples/k8s/mlops/multi-node apply
```

Within the multi-node use case, a number of kustomize processing directives are enabled.
Expand Down Expand Up @@ -128,7 +288,7 @@ the /workspace/bert-large-fp32-training/examples/fp32_training_single_node.sh co
The command to run a pod is shown below:

```
kubectl -k bert_large_fp32_training/examples/k8s/mlops/single-node apply
kubectl -k bert-large-fp32-training/examples/k8s/mlops/single-node apply
```

Within the single-node use case, the same number of kustomize processing directives are enabled as the multi-node.
Expand Down Expand Up @@ -174,7 +334,7 @@ kubectl logs -f $(kubectl get pods -oname|grep training|cut -c5-)
Removing this MPIJob (and stopping training) is done by running:

```
kubectl -k bert_large_fp32_training/examples/k8s/mlops/multi-node delete
kubectl -k bert-large-fp32-training/examples/k8s/mlops/multi-node delete
```

Remove the mpi-operator after running the example by running the following command with
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash
#
# Copyright (c) 2020 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

MODEL_DIR=${MODEL_DIR-$PWD}

echo 'MODEL_DIR='$MODEL_DIR
echo 'OUTPUT_DIR='$OUTPUT_DIR
echo 'CHECKPOINT_DIR='$CHECKPOINT_DIR
echo 'DATASET_DIR='$DATASET_DIR

if [[ -z $OUTPUT_DIR ]]; then
echo "The required environment variable OUTPUT_DIR has not been set" >&2
exit 1
fi

# Create the output directory, if it doesn't already exist
mkdir -p $OUTPUT_DIR

# Create an array of input directories that are expected and then verify that they exist
declare -A input_dirs
input_dirs[CHECKPOINT_DIR]=${CHECKPOINT_DIR}
input_dirs[DATASET_DIR]=${DATASET_DIR}

for i in "${!input_dirs[@]}"; do
var_name=$i
dir_path=${input_dirs[$i]}

if [[ -z $dir_path ]]; then
echo "The required environment variable $var_name is empty" >&2
exit 1
fi

if [[ ! -d $dir_path ]]; then
echo "The $var_name path '$dir_path' does not exist" >&2
exit 1
fi
done

mpi_num_proc_arg=""

if [[ -n $MPI_NUM_PROCESSES ]]; then
mpi_num_proc_arg="--mpi_num_processes=${MPI_NUM_PROCESSES}"
fi

python ${MODEL_DIR}/benchmarks/launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=32 \
${mpi_num_proc_arg} \
--output-dir=$OUTPUT_DIR \
-- train-option=Classifier \
task-name=MRPC \
do-train=true \
do-eval=true \
data-dir=$DATASET_DIR/MRPC \
vocab-file=$CHECKPOINT_DIR/vocab.txt \
config-file=$CHECKPOINT_DIR/bert_config.json \
init-checkpoint=$CHECKPOINT_DIR/bert_model.ckpt \
max-seq-length=128 \
learning-rate=2e-5 \
num-train-epochs=30 \
optimized_softmax=True \
experimental_gelu=False \
do-lower-case=True

Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env bash
#
# Copyright (c) 2020 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

MODEL_DIR=${MODEL_DIR-$PWD}

echo 'MODEL_DIR='$MODEL_DIR
echo 'OUTPUT_DIR='$OUTPUT_DIR
echo 'CHECKPOINT_DIR='$CHECKPOINT_DIR
echo 'DATASET_DIR='$DATASET_DIR

if [[ -z $OUTPUT_DIR ]]; then
echo "The required environment variable OUTPUT_DIR has not been set" >&2
exit 1
fi

# Create the output directory, if it doesn't already exist
mkdir -p $OUTPUT_DIR

# Create an array of input directories that are expected and then verify that they exist
declare -A input_dirs
input_dirs[CHECKPOINT_DIR]=${CHECKPOINT_DIR}
input_dirs[DATASET_DIR]=${DATASET_DIR}

for i in "${!input_dirs[@]}"; do
var_name=$i
dir_path=${input_dirs[$i]}

if [[ -z $dir_path ]]; then
echo "The required environment variable $var_name is empty" >&2
exit 1
fi

if [[ ! -d $dir_path ]]; then
echo "The $var_name path '$dir_path' does not exist" >&2
exit 1
fi
done

mpi_num_proc_arg=""

if [[ -n $MPI_NUM_PROCESSES ]]; then
mpi_num_proc_arg="--mpi_num_processes=${MPI_NUM_PROCESSES}"
fi

python ${MODEL_DIR}/benchmarks/launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--output-dir $OUTPUT_DIR \
-- train_option=SQuAD \
vocab_file=$CHECKPOINT_DIR/vocab.txt \
config_file=$CHECKPOINT_DIR/bert_config.json \
init_checkpoint=$CHECKPOINT_DIR/bert_model.ckpt \
do_train=True \
train_file=$DATASET_DIR/train-v1.1.json \
do_predict=True \
predict_file=$DATASET_DIR/dev-v1.1.json \
learning_rate=3e-5 \
num_train_epochs=2 \
max_seq_length=384 \
doc_stride=128 \
output_dir=./large \
optimized_softmax=True \
experimental_gelu=False \
do_lower_case=True

Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,12 @@ if [[ ! -d $DATASET_DIR ]]; then
exit 1
fi

BERT_BASE_DIR=$DATASET_DIR/dataset/bert_official/MRPC
BERT_BASE_DIR=$DATASET_DIR/dataset/bert_large_wwm/wwm_uncased_L-24_H-1024_A-16
GLUE_DIR=$DATASET_DIR/dataset/bert_official

python benchmarks/launch_benchmark.py \
--model-name=bert_large \
--precision=bfloat16 \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=32 \
Expand All @@ -70,5 +70,5 @@ python benchmarks/launch_benchmark.py \
num-train-epochs=30 \
output-dir=/tmp/mrpc_output/ \
optimized_softmax=True \
experimental_gelu=True \
experimental_gelu=False \
do-lower-case=True

0 comments on commit 4023d9a

Please sign in to comment.