Skip to content

Commit

Permalink
Merge pull request #252 from foundation-model-stack/v100rc1_release
Browse files Browse the repository at this point in the history
  • Loading branch information
olson-ibm authored Jul 19, 2024
2 parents 16c2a1d + dcabbe7 commit 4ed1bc4
Show file tree
Hide file tree
Showing 70 changed files with 2,218 additions and 440 deletions.
41 changes: 37 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# FMS HF Tuning

- [Installation](#installation)
- [Data format](#data-format)
- [Supported Models](#supported-models)
- [Training](#training)
- [Single GPU](#single-gpu)
- [Multiple GPUs with FSDP](#multiple-gpus-with-fsdp)
- [Tuning Techniques](#tuning-techniques)
- [LoRA Tuning Example](#lora-tuning-example)
- [Prompt Tuning](#prompt-tuning)
- [Fine Tuning](#fine-tuning)
- [FMS Acceleration](#fms-acceleration)
- [Inference](#inference)
- [Running a single example](#running-a-single-example)
- [Running multiple examples](#running-multiple-examples)
- [Inference Results Format](#inference-results-format)
- [Changing the Base Model for Inference](#changing-the-base-model-for-inference)
- [Validation](#validation)
- [Trainer Controller Framework](#trainer-controller-framework)
- [More Examples](#more-examples)

This repo provides basic tuning scripts with support for specific models. The repo relies on Hugging Face `SFTTrainer` and PyTorch FSDP. Our approach to tuning is:
1. Models are loaded from Hugging Face `transformers` or the [foundation-model-stack](https://github.com/foundation-model-stack/foundation-model-stack) -- models are either optimized to use `Flash Attention v2` directly or through `SDPA`
2. Hugging Face `SFTTrainer` for the training loop
Expand All @@ -25,7 +45,7 @@ pip install fms-hf-tuning[aim]

If you wish to use [fms-acceleration](https://github.com/foundation-model-stack/fms-acceleration), you need to install it.
```
pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
pip install fms-hf-tuning[fms-accel]
```
`fms-acceleration` is a collection of plugins that packages that accelerate fine-tuning / training of large models, as part of the `fms-hf-tuning` suite. For more details on see [this section below](#fms-acceleration).

Expand Down Expand Up @@ -106,6 +126,7 @@ export CUDA_VISIBLE_DEVICES=0

python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
Expand All @@ -129,6 +150,7 @@ export CUDA_VISIBLE_DEVICES=0

python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
Expand Down Expand Up @@ -173,7 +195,8 @@ tuning/sft_trainer.py \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n### Response:" \
--dataset_text_field "output"
--dataset_text_field "output" \
--tokenizer_name_or_path $MODEL_PATH # This field is optional and if not specified, tokenizer from model_name_or_path will be used
```

To summarize you can pick either python for single-GPU jobs or use accelerate launch for multi-GPU jobs. The following tuning techniques can be applied:
Expand Down Expand Up @@ -205,6 +228,7 @@ Example command to run:
```bash
python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 40 \
Expand Down Expand Up @@ -323,7 +347,7 @@ python tuning/sft_trainer.py \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--peft_method pt \
--tokenizer_name_or_path $MODEL_PATH
--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
--prompt_tuning_init "RANDOM" \
--prompt_tuning_init_text "From the following input, identify target sentiment of following types: neutral, negative, positive"
```
Expand Down Expand Up @@ -358,6 +382,7 @@ accelerate launch \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \ # This field is optional and if not specified, tokenizer from model_name_or_path will be used
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
Expand Down Expand Up @@ -389,7 +414,7 @@ Equally you can pass in a JSON configuration for running tuning. See [build doc]

To access `fms-acceleration` features the `[fms-accel]` dependency must first be installed:
```
$ pip install https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
$ pip install fms-hf-tuning[fms-accel]
```

Furthermore, the required `fms-acceleration` plugin must be installed. This is done via the command line utility `fms_acceleration.cli`. To show available plugins:
Expand Down Expand Up @@ -516,6 +541,14 @@ python main.py \

The above runs several tasks with `hendrycksTest-*` being MMLU.

## Trainer Controller Framework

Trainer controller is a framework for controlling the trainer loop using user-defined rules and metrics.

This framework helps users define rules to capture scenarios like criteria for stopping an ongoing training (E.g validation loss reaching a certain target, validation loss increasing with epoch, training loss values for last 100 steps increasing etc).

For details about how you can use set a custom stopping criteria and perform custom operations, see [examples/trainer_controller/README.md](examples/trainer_controller/README.md)

## More Examples

[Prompt Tuning on Twitter Complaints](examples/prompt_tuning_twitter_complaints/README.md)
Expand Down
5 changes: 2 additions & 3 deletions architecture_records/001-trainer-controller-framework.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ controller-metrics:
- name: loss
class: Loss
controllers:
- name: loss-controller
- name: loss_controller
triggers:
- on_log
rule: "loss < 1.0"
Expand All @@ -107,9 +107,8 @@ controllers:
```
We follow the below naming convention for the above trainer controller configuration:
1. `-` could be used in the case of key names, and name of the metric, operation or controller. This is usually to break multiple words of a name phrase.
1. Python convention for [class name](https://visualgit.readthedocs.io/en/latest/pages/naming_convention.html#classes).
1. `_` are used for events and control actions.
1. `_` should be used between words in keys, values, events and control actions.

For defining custom handler classes, we have an interface defined as an abstract class as shown below, with two abstract methods, namely: `validate()` to define the validation conditions, and `compute()` to compute the metric. The `compute()` returns an `Any` type. While it could be any value, developers should keep in mind that it should be only key-value pairs that are used in the rule(s) defined in the configuration.

Expand Down
2 changes: 1 addition & 1 deletion examples/prompt_tuning_twitter_complaints/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ tuning/sft_trainer.py \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--eval_strategy "no" \
--save_strategy "epoch" \
--learning_rate 1e-5 \
--weight_decay 0. \
Expand Down
97 changes: 92 additions & 5 deletions examples/trainercontroller_configs/Readme.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,92 @@
# How-To
To use one of these files with the trainer, execute the `sft_trainer.py` with the following option:
```
--trainer_controller_config_file "examples/trainercontroller_configs/<file-name>"
```
# Trainer controller

Trainer controller is a framework for controlling the trainer loop using user-defined rules and metrics.

### Motivation

This frameworks helps user define rules to capture scenarios like criteria for stopping an ongoing training (E.g validation loss reaching a certain target, validation loss increasing with epoch, training loss values for last 100 steps increasing etc).

### Usage
*Note: Evaluation loss and validation loss are the same.*
1. The trainer controller feature can be used and its behavior is controlled by a configuration file (we will illustrate the configuration file below) supplied by the user at the start of the training. Here is a sample of how the user can initiate a trainer controller for a training job, by specifying path to an existing configuration `loss.yaml` in the `./examples/trainercontroller_configs` directory using the flag `--trainer_controller_config_file`:
```shell
python ./tuning/sft_trainer.py \
...
--trainer_controller_config_file "$EXAMPLE_CONFIGS/epoch-level-eval-loss-below-threshold.yaml" \
...
...
```

1. For this usage illustration, we could use the `epoch-level-eval-loss-below-threshold.yaml` in the `./examples/trainercontroller_configs` directory as shown below:
```yaml
controller_metrics:
- name: trainer_state
class: TrainingState
- name: evalmetric
class: EvalMetrics
controllers:
- name: epoch_level_eval_loss_below_threshold
triggers:
- on_epoch_end
rule: 'evalmetric["eval_loss"] < 2.25 and trainer_state["epoch"] > 2'
operations:
- hfcontrols.should_training_stop
```
Here is a brief primer on the above configuration. More details could be found [here](./architecture_records/001-trainer-controller-framework.md).
- *Description:* The above configuration stops the training when a **evaluation loss** decreases below 2.25 after two epochs.
- *Metrics:* The configuration uses two metrics listed under `controller-metrics` section. One is named `evalmetric`, which uses an in-built metric class called `EvalMetrics` to expose evaluation loss and the other (`trainer_state`) uses `TrainingState` to expose the current epoch. These are referred to in the `rule` as shown above. There are other metrics also which could be used in place of `evalmetric` and . Here is a list of supported metric classes:
- `Loss`: Exposes the **training loss** after every `on_log` event. See more on trainer events [here](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/callback#transformers.TrainerCallback).
- `TrainerState`: This metric exposes the **trainer state** (more on trainer state can be found [here](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/callback#transformers.TrainerState)). [Here](tests/data/trainercontroller/loss_on_threshold_with_trainer_state.yaml) is an example metric which uses both the `TrainerState` and `Loss` metric.
- `EvalMetrics`: This metric exposes all the evaluation metrics used in the training job (E.g evaluation/validation loss). [Here](tests/data/trainercontroller/exposed_metrics.yaml) is an example metric which uses both the `EvalMetrics`.
- `HistoryBasedMetric`: This metric exposes a moving **window** of evaluation metrics and training loss. It is useful to create rules on a history of values (i.e. evaluation metrics and training loss). Following are some examples which illustrate how this metric could be used:
- [epoch-level-eval-loss-patience.yaml](tests/data/trainercontroller/epoch-level-eval-loss-patience.yaml): This configuration performs a threshold test for evaluation loss with a **patience threshold** of 2. I.e suppose the evaluation loss lower threshold is 2, and patience threshold is 3, then the trainer controller will not take an action (E.g. stop training) when the rule becomes true (i.e. evaluation loss is lower than 2) for for three consecutive times.
- [non-decreasing-training-loss.yaml](tests/data/trainercontroller/non-decreasing-training-loss.yaml): This configuration compares the first and last values of a window of training loss samples and determines if the training loss has increased or not. If there is an increase, the training is stopped.

Let us assume use the below example to understand the usage:
```yaml
controller_metrics:
- name: history_window
class: HistoryBasedMetric
arguments:
window_size: 2
controllers:
- name: epoch_level_eval_loss_patience
triggers:
- on_epoch_end
rule: len(history_window["metrics"]) > 0 and history_window["metrics"]["eval_loss"][-1] > 2
patience:
patience_threshold: 2
operations:
- hfcontrols.should_training_stop
```
In the above YAML, the name for `HistoryBasedMetric` used is `history_window`. Here is short primer on defining rules using the `HistoryBasedMetric`:
1. Treat the `history_window` as a python dictionary. The structure of the data in this dictionary is:
```yaml
{
"metrics": {
"global_step": [...],
"epoch": [...],
"eval_loss": [...],
"user_eval_metric_1": [...],
"user_eval_metric_2": [...],
...
},
"training_loss": {
"global_step": [...],
"epoch": [...],
"loss": [...],
}
}
```
1. To access the first value in window of evaluation metric `eval_loss`, here is the illustration `history_window["metrics"]["eval_loss"][0]`. In the above YAML, the last element is accessed as follows: `history_window["metrics"]["eval_loss"][-1]`.
1. Similarly, the `history_window["metrics"]["global_step"][0]` is global_step at the time of generation of this evaluation metric and `history_window["metrics"]["epoch"][0]` is the corresponding epoch.
1. Similar approach is followed to access training loss (i.e. `history_window["training_loss"]["loss"][0]` givest the first training loss).

- *Trigger:* There is also a trigger event to decide when the `rule` needs to be evaluated. This event has to be one of the trainer events listed [here](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/callback#transformers.TrainerCallback).
- *Rule:* The `rule` is a python statement which could use the metric name (e.g. `loss` in the above case) to define conditions which, when satisfied (it is a boolean condition and should evaluate to True to be satisfied) will trigger the operation(s) listed in `operations`.
- *Operation:* The `operations` section lists the operations that could be performed when the `rule` is satisfied (i.e. condition becomes True). Currently, we support only one type of operation class `HFControls` (In this particular example, the class and corresponding operation name `hfcontrols` are not specified explicitly as they are considered default and can be omitted). The `HFControls` class supports all operations listed below. More on these operations can be found [here](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/callback#transformers.TrainerControl).
- `hfcontrols.should_training_stop`: Stops the training.
- `hfcontrols.should_epoch_stop`: Interrupts the current epoch.
- `hfcontrols.should_save`: Saves the model at the current step.
- `hfcontrols.should_evaluate`: Should the model be evaluated at current step.
- `hfcontrols.should_log`: Should logging happen at current step.
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
controller_metrics:
- name: trainer_state
class: TrainingState
- name: evalmetric
class: EvalMetrics
controllers:
- name: epoch_level_eval_loss_below_threshold
triggers:
- on_epoch_end
rule: evalmetric['eval_loss'] < 2.25 and trainer_state["epoch"] > 2
operations:
- hfcontrols.should_training_stop
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
controller_metrics:
- name: eval_loss_window
class: HistoryBasedMetric
arguments:
window_size: 2
controllers:
- name: epoch_level_eval_loss_patience
triggers:
- on_epoch_end
rule: len(eval_loss_window["metrics"]) > 0 and eval_loss_window["metrics"]["eval_loss"][-1] > 2
patience:
patience_threshold: 2
operations:
- hfcontrols.should_training_stop
14 changes: 14 additions & 0 deletions examples/trainercontroller_configs/epoch-level-eval-loss.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
controller_metrics:
- name: trainer_state
class: TrainingState
- name: eval_loss_window
class: HistoryBasedMetric
arguments:
window_size: 1
controllers:
- name: epoch_level_eval_loss
triggers:
- on_epoch_end
rule: len(eval_loss_window["metrics"]) > 0 and eval_loss_window["metrics"]["eval_loss"][-1] > 2.2 and trainer_state["epoch"] > 3
operations:
- hfcontrols.should_training_stop
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
controller_metrics:
- name: training_loss_window
class: HistoryBasedMetric
arguments:
window_size: 1
controllers:
- name: epoch_level_stop_on_training_loss_below_threshold
triggers:
- on_log
rule: len(training_loss_window["training_loss"]["loss"]) == training_loss_window["window_size"] and training_loss_window["training_loss"]["loss"][0] < 2.2 and training_loss_window["training_loss"]["epoch"][0] > 2
operations:
- hfcontrols.should_training_stop
14 changes: 14 additions & 0 deletions examples/trainercontroller_configs/epoch-level-training-loss.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
controller_metrics:
- name: trainer_state
class: TrainingState
- name: training_loss_window
class: HistoryBasedMetric
arguments:
window_size: 1
controllers:
- name: epoch_level_training_loss
triggers:
- on_epoch_end
rule: training_loss_window["training_loss"]["loss"][-1] > 2 and trainer_state["epoch"] > 3
operations:
- hfcontrols.should_training_stop
4 changes: 2 additions & 2 deletions examples/trainercontroller_configs/loss.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
controller-metrics:
controller_metrics:
- name: loss
class: Loss
controllers:
- name: loss-controller
- name: loss_controller
triggers:
- on_log
rule: loss < 1.0
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
controller_metrics:
- name: training_loss_window
class: HistoryBasedMetric
arguments:
window_size: 5
controllers:
- name: stop_on_training_loss_not_decreasing
triggers:
- on_log
rule: training_loss_window["training_loss"]["loss"][0] < training_loss_window["training_loss"]["loss"][-1] and len(training_loss_window["training_loss"]["loss"]) == training_loss_window["window_size"]
operations:
- hfcontrols.should_training_stop
12 changes: 12 additions & 0 deletions examples/trainercontroller_configs/thresholded-training-loss.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
controller_metrics:
- name: training_loss_window
class: HistoryBasedMetric
arguments:
window_size: 1
controllers:
- name: stop_on_training_loss_not_decreasing
triggers:
- on_log
rule: training_loss_window["training_loss"]["loss"][-1] > 2.2
operations:
- hfcontrols.should_training_stop
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ classifiers=[
dependencies = [
"numpy>=1.26.4,<2.0",
"accelerate>=0.20.3,<0.40",
"transformers>=4.34.1,<=4.40.2,!=4.38.2",
"transformers>4.41,<5.0",
"torch>=2.2.0,<3.0",
"sentencepiece>=0.1.99,<0.3",
"tokenizers>=0.13.3,<1.0",
Expand All @@ -44,6 +44,8 @@ dependencies = [
dev = ["wheel>=0.42.0,<1.0", "packaging>=23.2,<24", "ninja>=1.11.1.1,<2.0", "scikit-learn>=1.0, <2.0", "boto3>=1.34, <2.0"]
flash-attn = ["flash-attn>=2.5.3,<3.0"]
aim = ["aim>=3.19.0,<4.0"]
fms-accel = ["fms-acceleration>=0.1"]


[tool.setuptools.packages.find]
exclude = ["tests", "tests.*"]
Expand Down
Loading

0 comments on commit 4ed1bc4

Please sign in to comment.