diff --git a/.gitignore b/.gitignore
index e13d7d768..e8f8cc26e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,3 +10,4 @@
 test_data/
 download_glue_data.py
 data/
+output/
diff --git a/examples/language_modeling/tensorflow/bert_large/training/fp32/README.md b/examples/language_modeling/tensorflow/bert_large/training/fp32/README.md
index bf0eb6cc3..85cb87b70 100644
--- a/examples/language_modeling/tensorflow/bert_large/training/fp32/README.md
+++ b/examples/language_modeling/tensorflow/bert_large/training/fp32/README.md
@@ -9,15 +9,175 @@ should be downloaded as mentioned in the [Google bert repo](https://github.com/g
 
 Refer to google reference page for [checkpoints](https://github.com/google-research/bert#pre-trained-models).
 
+## Datasets
+
+### Pretrained models
+
+Download and extract checkpoints the bert pretrained model from the
+[google bert repo](https://github.com/google-research/bert#pre-trained-models).
+The extracted directory should be set to the `CHECKPOINT_DIR` environment
+variable when running example scripts.
+
+For training from scratch, Wikipedia and BookCorpus need to be downloaded
+and pre-processed.
+
+### GLUE data
+
+[GLUE data](https://gluebenchmark.com/tasks) is used when running BERT
+classification training. Download and unpack the GLUE data by running
+(this script)[https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e].
+
+### SQuAD data
+
+The Stanford Question Answering Dataset (SQuAD) dataset files can be downloaded
+from the [Google bert repo](https://github.com/google-research/bert#squad-11).
+The three files (`train-v1.1.json`, `dev-v1.1.json`, and `evaluate-v1.1.py`)
+should be downloaded to the same directory. Set the `DATASET_DIR` to point to
+that directory when running bert fine tuning using the SQuAD data.
+
 ## Example Scripts
 
 | Script name | Description |
 |-------------|-------------|
-| [`fp32_training_multi_node.sh`](fp32_training_multi_node.sh) | This script is used by the Kubernetes pods to run training across multiple nodes using mpirun and horovod. |
+| [`fp32_classifier_training.sh`](fp32_classifier_training.sh) | This script fine-tunes the bert base model on the Microsoft Research Paraphrase Corpus (MRPC) corpus, which only contains 3,600 examples. Download the [bert base pretrained model](https://github.com/google-research/bert#pre-trained-models) and set the `CHECKPOINT_DIR` to that directory. The `DATASET_DIR` should point to the [GLUE data](#glue-data). |
+| [`fp32_squad_training.sh`](fp32_squad_training.sh) | This script fine-tunes bert using SQuAD data. Download the [bert large pretrained model](https://github.com/google-research/bert#pre-trained-models) and set the `CHECKPOINT_DIR` to that directory. The `DATASET_DIR` should point to the [squad data files](#squad-data). |
+| [`fp32_training_single_node.sh`](fp32_training_single_node.sh) | This script is used by the single node Kubernetes job to run bert classifier inference. |
+| [`fp32_training_multi_node.sh`](fp32_training_multi_node.sh) | This script is used by the Kubernetes pods to run bert classifier training across multiple nodes using mpirun and horovod. |
 
 These examples can be run the following environments:
+* [Bare metal](#bare-metal)
+* [Docker](#docker)
 * [Kubernetes](#kubernetes)
 
+## Bare Metal
+
+To run on bare metal, the following prerequisites must be installed in your enviornment:
+* Python 3
+* [intel-tensorflow==2.1.0](https://pypi.org/project/intel-tensorflow/)
+* numactl
+* git
+
+Once the above dependencies have been installed, download and untar the model
+package, set environment variables, and then run an example script. See the
+[datasets](#datasets) and [list of example scripts](#example-scripts) for more
+details on the different options.
+
+The snippet below shows an example running with a single instance:
+```
+wget https://ubit-artifactory-or.intel.com/artifactory/list/cicd-or-local/model-zoo/bert-large-fp32-training.tar.gz
+tar -xvf bert-large-fp32-training.tar.gz
+cd bert-large-fp32-training
+
+CHECKPOINT_DIR=<path to the pretrained bert model directory>
+DATASET_DIR=<path to the dataset being used>
+OUTPUT_DIR=<directory where checkpoints and log files will be saved>
+
+# Run a script for your desired usage
+./examples/<script name>.sh
+```
+
+To run distributed training (one MPI process per socket) for better throughput,
+set the `MPI_NUM_PROCESSES` var to the number of sockets to use. Note that the
+global batch size is mpi_num_processes * train_batch_size and sometimes the learning
+rate needs to be adjusted for convergence. By default, the script uses square root
+learning rate scaling.
+
+For fine-tuning tasks like BERT, state-of-the-art accuracy can be achieved via
+parallel training without synchronizing gradients between MPI workers. The
+`mpi_workers_sync_gradients=[True/False]` var controls whether the MPI
+workers sync gradients. By default it is set to "False" meaning the workers
+are training independently and the best performing training results will be
+picked in the end. To enable gradients synchronization, set the
+`mpi_workers_sync_gradients` to true in BERT options. To modify the bert
+options, modify the example .sh script or call the `launch_benchmarks.py`
+script directly with your preferred args.
+
+To run with multiple instances, these additional dependencies will need to be
+installed in your environment:
+* openmpi-bin
+* openmpi-common
+* openssh-client
+* openssh-server
+* libopenmpi-dev
+* horovod==0.19.1
+
+```
+wget https://ubit-artifactory-or.intel.com/artifactory/list/cicd-or-local/model-zoo/bert-large-fp32-training.tar.gz
+tar -xvf bert-large-fp32-training.tar.gz
+cd bert-large-fp32-training
+
+CHECKPOINT_DIR=<path to the pretrained bert model directory>
+DATASET_DIR=<path to the dataset being used>
+OUTPUT_DIR=<directory where checkpoints and log files will be saved>
+MPI_NUM_PROCESSES=<number of sockets to use>
+
+# Run a script for your desired usage
+./examples/<script name>.sh
+```
+
+## Docker
+
+The bert FP32 training model container includes the scripts and libraries
+needed to run bert large FP32 fine tuning. To run one of the example usage scripts
+using this container, you'll need to provide volume mounts for the pretrained model,
+dataset, and an output directory where log and checkpoint files will be written.
+
+The snippet below shows an example running with a single instance:
+```
+CHECKPOINT_DIR=<path to the pretrained bert model directory>
+DATASET_DIR=<path to the dataset being used>
+OUTPUT_DIR=<directory where checkpoints and log files will be saved>
+
+docker run \
+  --env CHECKPOINT_DIR=${CHECKPOINT_DIR} \
+  --env DATASET_DIR=${DATASET_DIR} \
+  --env OUTPUT_DIR=${OUTPUT_DIR} \
+  --env http_proxy=${http_proxy} \
+  --env https_proxy=${https_proxy} \
+  --volume ${CHECKPOINT_DIR}:${CHECKPOINT_DIR} \
+  --volume ${DATASET_DIR}:${DATASET_DIR} \
+  --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
+  --privileged --init -it \
+  amr-registry.caas.intel.com/aipg-tf/model-zoo:2.1.0-language-modeling-bert-large-fp32-training \
+  /bin/bash examples/<script name>.sh
+```
+
+To run distributed training (one MPI process per socket) for better throughput,
+set the `MPI_NUM_PROCESSES` var to the number of sockets to use. Note that the
+global batch size is mpi_num_processes * train_batch_size and sometimes the learning
+rate needs to be adjusted for convergence. By default, the script uses square root
+learning rate scaling.
+
+For fine-tuning tasks like BERT, state-of-the-art accuracy can be achieved via
+parallel training without synchronizing gradients between MPI workers. The
+`mpi_workers_sync_gradients=[True/False]` var controls whether the MPI
+workers sync gradients. By default it is set to "False" meaning the workers
+are training independently and the best performing training results will be
+picked in the end. To enable gradients synchronization, set the
+`mpi_workers_sync_gradients` to true in BERT options. To modify the bert
+options, modify the example .sh script or call the `launch_benchmarks.py`
+script directly with your preferred args.
+```
+CHECKPOINT_DIR=<path to the pretrained bert model directory>
+DATASET_DIR=<path to the dataset being used>
+OUTPUT_DIR=<directory where checkpoints and log files will be saved>
+MPI_NUM_PROCESSES=<number of sockets to use>
+
+docker run \
+  --env CHECKPOINT_DIR=${CHECKPOINT_DIR} \
+  --env DATASET_DIR=${DATASET_DIR} \
+  --env OUTPUT_DIR=${OUTPUT_DIR} \
+  --env MPI_NUM_PROCESSES=${MPI_NUM_PROCESSES} \
+  --env http_proxy=${http_proxy} \
+  --env https_proxy=${https_proxy} \
+  --volume ${CHECKPOINT_DIR}:${CHECKPOINT_DIR} \
+  --volume ${DATASET_DIR}:${DATASET_DIR} \
+  --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \
+  --privileged --init -it \
+  amr-registry.caas.intel.com/aipg-tf/model-zoo:2.1.0-language-modeling-bert-large-fp32-training \
+  /bin/bash examples/<script name>.sh
+```
+
 ## Kubernetes
 
 Download and untar the bert large FP32 training package:
@@ -77,7 +237,7 @@ The distributed training algorithm is handled by mpirun.
 The command to run an MPIJob is shown below:
 
 ```
-kubectl -k bert_large_fp32_training/examples/k8s/mlops/multi-node apply
+kubectl -k bert-large-fp32-training/examples/k8s/mlops/multi-node apply
 ```
 
 Within the multi-node use case, a number of kustomize processing directives are enabled.
@@ -128,7 +288,7 @@ the /workspace/bert-large-fp32-training/examples/fp32_training_single_node.sh co
 The command to run a pod is shown below:
 
 ```
-kubectl -k bert_large_fp32_training/examples/k8s/mlops/single-node apply
+kubectl -k bert-large-fp32-training/examples/k8s/mlops/single-node apply
 ```
 
 Within the single-node use case, the same number of kustomize processing directives are enabled as the multi-node.
@@ -174,7 +334,7 @@ kubectl logs -f $(kubectl get pods -oname|grep training|cut -c5-)
 Removing this MPIJob (and stopping training) is done by running:
 
 ```
-kubectl -k bert_large_fp32_training/examples/k8s/mlops/multi-node delete
+kubectl -k bert-large-fp32-training/examples/k8s/mlops/multi-node delete
 ```
 
 Remove the mpi-operator after running the example by running the following command with
diff --git a/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_classifier_training.sh b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_classifier_training.sh
new file mode 100755
index 000000000..e79c5381f
--- /dev/null
+++ b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_classifier_training.sh
@@ -0,0 +1,81 @@
+#!/usr/bin/env bash
+#
+# Copyright (c) 2020 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+MODEL_DIR=${MODEL_DIR-$PWD}
+
+echo 'MODEL_DIR='$MODEL_DIR
+echo 'OUTPUT_DIR='$OUTPUT_DIR
+echo 'CHECKPOINT_DIR='$CHECKPOINT_DIR
+echo 'DATASET_DIR='$DATASET_DIR
+
+if [[ -z $OUTPUT_DIR ]]; then
+  echo "The required environment variable OUTPUT_DIR has not been set" >&2
+  exit 1
+fi
+
+# Create the output directory, if it doesn't already exist
+mkdir -p $OUTPUT_DIR
+
+# Create an array of input directories that are expected and then verify that they exist
+declare -A input_dirs
+input_dirs[CHECKPOINT_DIR]=${CHECKPOINT_DIR}
+input_dirs[DATASET_DIR]=${DATASET_DIR}
+
+for i in "${!input_dirs[@]}"; do
+  var_name=$i
+  dir_path=${input_dirs[$i]}
+ 
+  if [[ -z $dir_path ]]; then
+    echo "The required environment variable $var_name is empty" >&2
+    exit 1
+  fi
+
+  if [[ ! -d $dir_path ]]; then
+    echo "The $var_name path '$dir_path' does not exist" >&2
+    exit 1
+  fi
+done
+
+mpi_num_proc_arg=""
+
+if [[ -n $MPI_NUM_PROCESSES ]]; then
+  mpi_num_proc_arg="--mpi_num_processes=${MPI_NUM_PROCESSES}"
+fi
+
+python ${MODEL_DIR}/benchmarks/launch_benchmark.py \
+--model-name=bert_large \
+--precision=fp32 \
+--mode=training \
+--framework=tensorflow \
+--batch-size=32 \
+${mpi_num_proc_arg} \
+--output-dir=$OUTPUT_DIR \
+-- train-option=Classifier \
+task-name=MRPC \
+do-train=true \
+do-eval=true \
+data-dir=$DATASET_DIR/MRPC \
+vocab-file=$CHECKPOINT_DIR/vocab.txt \
+config-file=$CHECKPOINT_DIR/bert_config.json \
+init-checkpoint=$CHECKPOINT_DIR/bert_model.ckpt \
+max-seq-length=128 \
+learning-rate=2e-5 \
+num-train-epochs=30 \
+optimized_softmax=True \
+experimental_gelu=False \
+do-lower-case=True
+
diff --git a/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_squad_training.sh b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_squad_training.sh
new file mode 100755
index 000000000..12cb3efbc
--- /dev/null
+++ b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_squad_training.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+#
+# Copyright (c) 2020 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+MODEL_DIR=${MODEL_DIR-$PWD}
+
+echo 'MODEL_DIR='$MODEL_DIR
+echo 'OUTPUT_DIR='$OUTPUT_DIR
+echo 'CHECKPOINT_DIR='$CHECKPOINT_DIR
+echo 'DATASET_DIR='$DATASET_DIR
+
+if [[ -z $OUTPUT_DIR ]]; then
+  echo "The required environment variable OUTPUT_DIR has not been set" >&2
+  exit 1
+fi
+
+# Create the output directory, if it doesn't already exist
+mkdir -p $OUTPUT_DIR
+
+# Create an array of input directories that are expected and then verify that they exist
+declare -A input_dirs
+input_dirs[CHECKPOINT_DIR]=${CHECKPOINT_DIR}
+input_dirs[DATASET_DIR]=${DATASET_DIR}
+
+for i in "${!input_dirs[@]}"; do
+  var_name=$i
+  dir_path=${input_dirs[$i]}
+ 
+  if [[ -z $dir_path ]]; then
+    echo "The required environment variable $var_name is empty" >&2
+    exit 1
+  fi
+
+  if [[ ! -d $dir_path ]]; then
+    echo "The $var_name path '$dir_path' does not exist" >&2
+    exit 1
+  fi
+done
+
+mpi_num_proc_arg=""
+
+if [[ -n $MPI_NUM_PROCESSES ]]; then
+  mpi_num_proc_arg="--mpi_num_processes=${MPI_NUM_PROCESSES}"
+fi
+
+python ${MODEL_DIR}/benchmarks/launch_benchmark.py \
+--model-name=bert_large \
+--precision=fp32 \
+--mode=training \
+--framework=tensorflow \
+--batch-size=24 \
+--output-dir $OUTPUT_DIR \
+-- train_option=SQuAD \
+vocab_file=$CHECKPOINT_DIR/vocab.txt \
+config_file=$CHECKPOINT_DIR/bert_config.json \
+init_checkpoint=$CHECKPOINT_DIR/bert_model.ckpt \
+do_train=True \
+train_file=$DATASET_DIR/train-v1.1.json \
+do_predict=True \
+predict_file=$DATASET_DIR/dev-v1.1.json \
+learning_rate=3e-5 \
+num_train_epochs=2 \
+max_seq_length=384 \
+doc_stride=128 \
+output_dir=./large \
+optimized_softmax=True \
+experimental_gelu=False \
+do_lower_case=True
+
diff --git a/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_training_single_node.sh b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_training_single_node.sh
index c6482f193..195968f13 100755
--- a/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_training_single_node.sh
+++ b/examples/language_modeling/tensorflow/bert_large/training/fp32/fp32_training_single_node.sh
@@ -48,12 +48,12 @@ if [[ ! -d $DATASET_DIR ]]; then
   exit 1
 fi
 
-BERT_BASE_DIR=$DATASET_DIR/dataset/bert_official/MRPC
+BERT_BASE_DIR=$DATASET_DIR/dataset/bert_large_wwm/wwm_uncased_L-24_H-1024_A-16
 GLUE_DIR=$DATASET_DIR/dataset/bert_official
 
 python benchmarks/launch_benchmark.py \
     --model-name=bert_large \
-    --precision=bfloat16 \
+    --precision=fp32 \
     --mode=training \
     --framework=tensorflow \
     --batch-size=32 \
@@ -70,5 +70,5 @@ python benchmarks/launch_benchmark.py \
        num-train-epochs=30 \
        output-dir=/tmp/mrpc_output/ \
        optimized_softmax=True \
-       experimental_gelu=True \
+       experimental_gelu=False \
        do-lower-case=True