Add APIs to offload states of model, optimizer, and engine #6011

tohtana · 2024-08-16T18:34:24Z

This PR adds the following APIs to offload model, optimizer, and engine states.

def offload_states(self,
                   include: Container[OffloadStateTypeEnum] = None,
                   device: OffloadDeviceEnum = OffloadDeviceEnum.cpu,
                   pin_memory: bool = True,
                   non_blocking: bool = False) -> None:
    """Move the ZeRO optimizer buffers to the specified device.

    Arguments:
        include: Optional. The set of states to offload. If not provided, all states are offloaded.
        device: Optional. The device to move the ZeRO optimizer buffers to.
        pin_memory: Optional. Whether to pin the memory of the offloaded states.
        non_blocking: Optional. Whether to offload the states asynchronously.
...
def offload_states_back(self, non_blocking: bool = False) -> None:

Here is the typical usage.

# Offload after forward, backward, and step
model.offload_states()
# Do something requiring a lot of device memory
...
# Load states back to device memory
model.offload_states_back()

You can selectively offload states to balance the offloading overhead and memory saving.

model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu)

Performance (4.3B parameters / 4x A100)

Environment (4x A100, benchmark script)
- Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s
- Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s
Mem (allocated by PyTorch)
- Before offload 18.2GB
- After offloading 17.7MB
Time (benchmark script, offloading time/loading time)

python output_table.py

	pin_memory=0 non_blocking=0	pin_memory=0 non_blocking=1	pin_memory=1 non_blocking=0	pin_memory=1 non_blocking=1
1	4.34 / 3.42	4.99 / 2.37	6.5 / 2.42	6.0 / 2.39
2	9.9 / 3.28	5.1 / 2.34	6.21 / 2.42	6.25 / 2.45
3	9.92 / 3.19	6.71 / 2.35	6.33 / 2.38	5.93 / 2.42
4	9.55 / 2.82	7.11 / 2.39	6.9 / 2.38	6.5 / 2.43
5	4.4 / 3.35	6.04 / 2.41	6.26 / 2.41	6.32 / 2.47
6	4.4 / 3.57	6.58 / 2.42	6.88 / 2.4	6.35 / 2.43
7	9.51 / 3.12	6.9 / 2.39	6.9 / 2.39	6.46 / 2.4
8	4.77 / 3.64	6.69 / 2.39	7.39 / 2.42	6.56 / 2.46
9	9.5 / 3.07	7.18 / 2.42	6.67 / 2.39	7.38 / 2.46

TODO:

Enable offloading to a NVMe storage -> NVMe support is non-trivial. I suggest adding the support in another PR
[DONE] Discard buffer (and recreate it) instead of offloading. We don't need to restore the contiguous buffer for reduce.
[DONE] Check pin_memory improves performance or not

deepspeed/runtime/engine.py

deepspeed/runtime/zero/offload_config.py

deepspeed/runtime/zero/stage3.py

deepspeed/runtime/utils.py

tests/unit/runtime/zero/test_offload_states.py

tohtana · 2024-09-04T23:03:29Z

@tjruwase Added the document.

…eepSpeed into tohtana/offload_zero_buffers

kfertakis · 2024-09-12T16:26:53Z

Hi @tohtana ,

Thank you for your work. I've been trying the new APIs to test model offloading in a multi-model deployment (e.g., deepspeed-chat) as part of #5620 . Although the API works in offloading a model and reducing GPU memory initially, after bringing the model back and completing the first training iteration (i.e., optimiser states have been updated), I get a RuntimeError: param {} still in flight exception when trying to offload the model again. I thus wanted to ask whether you think this has something to do with a misuse of the API from my end or if you could provide some further context. I'm providing the relevant stack trace below: Thank you again.

[rank0]: Traceback (most recent call last):
[rank0]:   File "training_script.py", line 173, in gen_function
[rank0]:     self.model_engine.offload_states()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/engine.py", line 3710, in offload_states
[rank0]:     self.optimizer.offload_states(include=include, device=device, pin_memory=pin_memory, non_blocking=non_blocking)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2794, in offload_states
[rank0]:     self.empty_partition_cache()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2785, in empty_partition_cache
[rank0]:     self.parameter_offload.empty_partition_cache()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 181, in empty_partition_cache
[rank0]:     self.partition_all_parameters()
[rank0]:   File "/home/user/DeepSpeed/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 159, in partition_all_parameters
[rank0]:     self.get_param_coordinator(training=self.module.training).release_and_reset_all(self.module)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/home/user/venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/user/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 412, in release_and_reset_all
[rank0]:     raise RuntimeError(f"param {param.ds_summary()} still in flight")
[rank0]: RuntimeError: param {'id': 1, 'status': 'INFLIGHT', 'numel': 4198400, 'ds_numel': 4198400, 'shape': (2050, 2048), 'ds_shape': (2050, 2048), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([4198400])} still in flight

tohtana · 2024-09-12T21:29:47Z

Thank you for reporting, @kfertakis!

I have an example script showing the usage of the APIs. Can you try this?
I suspect that ZeRO3 fails to clean the partitioning status for some models. I would like to clarify that your issue is model specific or not.

kfertakis · 2024-09-13T11:19:36Z

So I tested the issue again with various models and it seems the problem is model-size related as it does not seem to occur for smaller models (i.e., <= 1B params, e.g., gpt2, gpt2-medium) and it does for bigger ones(i.e., OPT-1.3B, mistral-7B). Is there anything I could do to investigate it further and debug it? By the way, I should mention that I'm testing this in a single node, single GPU configuration (i.e., single worker) thus ZeRO3 partitioning should not have to partition data across other workers. I will also test the benchmark you referenced with an artificially larger model size setting.

Thanks again.

tohtana · 2024-09-17T06:57:09Z

Hi @kfertakis, I tried this example with a 4B model but it worked. Can you try this on your environment?
It would be also great if you could offer us a simple repo.

tjruwase · 2024-09-18T10:13:46Z

in flight exception when trying to offload the model again. I thus wanted to ask whether you think this has something to do with a misuse of the API from my end or if you could provide some further context. I'm providing the relevant stack trace below: Thank you again.

@tohtana, I wonder if it is useful to expose validate_device() functionality as a deepspeed utility, so that clients can check/confirm the offload status at arbitrary points in their code?

DeepSpeed/tests/unit/runtime/zero/test_offload_states.py

Line 20 in 8f81634

def validate_device(model, device: torch.device, include) -> None:

Similar to how see_memory_usage enables inspection of HBM/DRAM usage at any point, we could provide mechanisms for offload status. Perhaps we need something like see_offload_status that displays the mapping of params, grads, and optimizer to {HBM, DRAM, NVMe}.

@kfertakis, would love to get your thoughts as well on whether any of the above would be useful? Thanks!

kfertakis · 2024-09-18T13:36:51Z

Hey, thanks for the comments.

@tohtana, I've tried the example you provided and it does seem to work so I'm sharing a fork of the DeepSpeed-Examples repo to showcase the problem. I've modified the DeepSpeed-Chat code to use offload_states. After you prepare an environment with the right deepspeed version for the new API and also install DeepSpeed-Chat, you can then run the following:

deepspeed --num_gpus=1 ./applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py --actor_model_name_or_path facebook/opt-1.3b --critic_model_name_or_path facebook/opt-1.3b --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 1 --data_path Dahoas/rm-static --per_device_generation_batch_size 2 --per_device_training_batch_size 2 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --gradient_accumulation_steps 1 --actor_dropout 0.0 --deepspeed --dtype bf16 --enable_hybrid_engine --offload_test

this should lead to the RuntimeError: param {} still in flight that I mentioned. Any thoughts on this would be much appreciated.

@tjruwase thanks for the reference. Current problem aside, I can see how the helper functions can be useful in the future for ensuring consistency. thanks.

tohtana · 2024-09-21T01:09:21Z

Hi @kfertakis, thank you for sharing the repro. It seems that the actual issue is related to ZeRO3's prefetching.

I opened #6557 as a workaround to address this issue. Can you try the branch tohtana/clean_up_prefetch_param? It also includes the offloading APIs. You can just switch to it.

kfertakis · 2024-09-24T16:44:08Z

Hi @tohtana,

thank you for your work. I tried your branch and the issue seems to be fixed. I will continue testing and raise any new issues but for now, the offload_states API seems to be working as expected. Many thanks.

kfertakis · 2024-09-25T10:55:50Z

I also wanted to ask if the offloading functionality could be extended to support DeepSpeedCPUAdam optimiser, besides FusedAdam, in the future for offloading a model with an optimizer which is already offloaded to the CPU? Thank you

tohtana · 2024-09-27T05:37:22Z

I wonder if it is useful to expose validate_device() functionality as a deepspeed utility, so that clients can check/confirm the offload status at arbitrary points in their code?

@tjruwase Let me address this by another PR after this one is merged.

tohtana · 2024-09-27T05:41:41Z

Thank you @kfertakis for validating the fix.

I also wanted to ask if the offloading functionality could be extended to support DeepSpeedCPUAdam optimiser, besides FusedAdam, in the future for offloading a model with an optimizer which is already offloaded to the CPU? Thank you

Let me consider how to do this. Please feel free to open a new issue to track it as I am going to merge this PR first.

kfertakis · 2024-10-01T14:25:42Z

Thank you @tohtana for completing and merging the feature. I've opened two additional requests #6595 , #6596 to track the relevant extensions we discussed above. thanks.

add apis to offload states of model, optimizer, and engine

5825104

tohtana requested review from tjruwase and loadams as code owners August 16, 2024 18:34

tohtana and others added 13 commits August 16, 2024 18:44

update api doc

600c822

Merge branch 'master' into tohtana/offload_zero_buffers

153a482

reduce global reference to buffer

126d9b7

loosen type hint

05df37c

Merge branch 'master' into tohtana/offload_zero_buffers

837c06c

add option for pin_memory and non blocking copy

3f8179d

fix offloading of lp grad

37ffa02

add verification in test

93c5a90

improve offloading of lp params

512e9c9

Merge branch 'master' into tohtana/offload_zero_buffers

de2a894

fix pinning

c749b05

Merge branch 'master' into tohtana/offload_zero_buffers

36d6e10

resolve conflict

af95a37

tohtana assigned tjruwase Sep 3, 2024