-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for data parallel QLoRA training via DeepSpeed Zero stages 0, 1 and 2. #3728
Conversation
if self.model.trained_using_adapter: | ||
adapter_ref = ray.put(dist_strategy.extract_adapter_weights_for_serialization(self.model)) | ||
|
||
optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I wonder if this could be moved into the dist_strategy base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a few nits. One question that didn't quite fit anywhere: are there any additional validation checks we should add for DS config? Bounding DS stages, DS/qlora, etc.
stages, we load the base model back. For LLMs, this recreates either the base model or the PEFT model, depending | ||
on whether a PEFT adapter was specified. | ||
""" | ||
if self.zero_optimization_stage != 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This condition shows up in a few places, and it's either run as stage <= 2
or stage != 3
. Do we want to pick one as the canonical form? Would it make sense to move the condition itself into a DeepSpeedStrategy
property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffkinnison I think this is a good call out. One more option is to refactor to have DeepSpeedStrategy be a base class and DeepSpeedStage3 be its own other class that overrides some of the methods or something to that effect. Will take a look!
""" | ||
backend_type = _get_backend_type_from_config(config_obj) | ||
deepspeed_optimization_strategy = _get_deepspeed_optimization_stage_from_config(config_obj) | ||
if backend_type == "ray" and deepspeed_optimization_strategy is not None and deepspeed_optimization_strategy <= 2: | ||
# If using deepspeed stage 0, 1 or 2, we only load the model into memory once we're actually inside | ||
# of the training workers. | ||
return False | ||
# If using local backend or deepspeed stage 3, we load the model into memory upon class initialization. | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to push this into the backend class and maybe consider passing in the initialized self.backend
object in the LudwigModel class into the create_model function so it gets propagated to here, it shouldn't live here like this.
@@ -57,6 +59,7 @@ def __init__( | |||
|
|||
super().__init__(**kwargs) | |||
self.zero_optimization = zero_optimization or DEFAULT_ZERO_OPTIMIZATION | |||
self.zero_optimization_stage = self.zero_optimization.get("stage", 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why default to 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because in the worst case, it makes sense to assume that the model does not fit into memory and we need to do model parallel + data parallel training since the model will not fit into a single GPU. It also assumes no quantization based fine-tuning, which I feel like is a fair assumption to make. All of this is to say that we want a config as simple as this "to just work"
model_type: llm
base_model: ...
input_features: ...
output_features: ...
trainer:
type: finetune
backend:
type: ray
trainer:
strategy:
type: deepspeed
to have the highest chance of succeeding irrespective of LLM model size. This is the worst case scenario that does full fine-tuning with an adapter or without quantization. All of those only reduce the model size, but this is the most useful configuration to default to assuming none of those are set in the config for LLM fine-tuning.
Let me know if this makes sense. I'll also add a comment in the DeepSpeed class explaining why we default to stage 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review yet, but something I see coming up repeatedly is references to the optimiation stage outside of the DeepSpeed strategy. This is a red flag that we should be moving this code into the DistributedStrategy interface rather than coupling these two different abstractions (backend and strategy) together.
model_ref = ray.put(dist_strategy.extract_model_for_serialization(self.model)) | ||
optimization_stage = _get_optimization_stage_from_trainer_config(self.trainer_kwargs) | ||
model_ref = ray.put( | ||
dist_strategy.extract_model_for_serialization(self.model, optimization_stage=optimization_stage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the dist_strategy
already know the optimization stage? This is coupling the DistributedStrategy interface with the Deepspeed optimization stage, which is not desirable. Would be better to keep this internal to the strategy itself.
dist_model = distributed.prepare_for_inference(model) | ||
if adapter_ref and (distributed_optimization_stage and distributed_optimization_stage <= 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be pushed into the DistributedStrategy, again, so we don't need to couple everything to DeepSpeed.
This PR adds support for data parallel QLoRA training using DeepSpeed Stages 0, 1, and 2.
As a refresher, here is what each DeepSpeed Zero stage corresponds to:
With this PR, you can now perform QLoRA based training for larger-than-memory datasets. You can find a full attached example in
ludwig/examples/llm_qlora_data_parallel
.For e.g, you can now use a config like the following:
to:
n
single GPU workers on a single multi-GPU noden
workersIn particular, since this uses DeepSpeed Stage 2 with Ray, it lets you stream datasets into memory at training time per-worker.
This is a snapshot of GPU utilization a single-node 4 GPU pod using 4xA5000s: