-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train the llama-7b in a machine with two Tesla T4 GPU's using Ray #3783
Comments
Hi @Ragul-Ramdass -- thank you for reporting this issue and the one in #3784 -- please give us a few business days to look into it and get back to you. Thank you. |
I'm facing the exact same issue with both the strategies - deepspeed and ddp. Below is the conda environment, and model.yaml for reference: requirement.txtabsl-py==2.0.0 model.yamlmodel_type: llm quantization: adapter: prompt:
input_features:
output_features:
trainer: preprocessing: backend: |
Hello @alexsherstinsky - Kind follow-up on this thread. Is there any workaround to resolve this issue? |
@SanjoySahaTigerAnalytics Yes, there was! We discussed this as a team, and I received direction for how to troubleshoot it in our own environment (containing the required number of GPUs). I am planning to do this starting tomorrow and into the next week. I will provide my findings for you here in the comments. Thank you very much for your patience. |
Hello @alexsherstinsky - Thank you very much for prioritizing it. Will wait for your response. |
Hello @alexsherstinsky - Kind follow-up on this thread. Please let us know in case there is any luck. |
@SanjoySahaTigerAnalytics -- sorry for the delay; this has been escalated to the team. Someone will investigate and respond soon. Thank you again for your patience. |
Hi @SanjoySahaTigerAnalytics! Apologies for the late response from our end. The reason you're running into issues is because 4 bit quantization isn't supported with DeepSpeed stage 3, which is what Ludwig defaults to when the To solve this issue, there are three options in total, each of which have their own tradeoffs - the right solution will depend on your goal: 1. Set backend to local instead of Raymodel_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: local This will perform naive model parallel training, where your 4-bit Llama-2 model will be sharded across both of your GPUs, but it will not perform data parallel training. Training will likely be slower than training on just one of the two T4 GPUs you have because there's an overhead in passing intermediate states between GPU 1 and GPU 2 per forward and backward pass, however, this will not run into any issues and is the path that I recommend for now. 2. Use DeepSpeed Stage 3 without quantizationmodel_type: llm
base_model: /root/CodeLlama-7b-Python-hf
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 3
offload_optimizer:
device: cpu
pin_memory: true
bf16:
enabled: true This will perform data parallel + model parallel training across both of your GPUs. Under the surface, the way it works is that it shards your model across both GPU devices and also shards the data by the total number of workers. During each forward pass, there are a few all gather and all reduce operations to propagate model states to each of the GPUs during the forward pass, and similarly to compute gradients and update the weights during the backward pass. This can also be a bit slow, but works nicely for larger models. The drawback here is, like I said earlier, that DeepSpeed Stage 3 unfortunately doesn't work with quantized models like 4-bit models. The reason is that Stage 3 does sharding of weights, but it seems opinionated on the fact that the data type of all layers must be the same, and it particularly doesn't like the nf4/int8 formats mixed with fp16 lora layers. For that reason, you'll notice that I removed the 3. Use 4-bit quantization with DeepSpeed Stage 2model_type: llm
base_model: /root/CodeLlama-7b-Python-hf
quantization:
bits: 4
adapter:
type: lora
prompt:
template: |
### Instruction:
{Instruction}
### Context:
{Context}
### Input:
{Input}
### Response:
input_features:
- name: prompt
type: text
preprocessing:
max_sequence_length: 2048
output_features:
- name: Response
type: text
preprocessing:
max_sequence_length: 2048
trainer:
type: finetune
learning_rate: 0.0001
batch_size: 1
max_batch_size: 1
gradient_accumulation_steps: 1
enable_gradient_checkpointing: true
epochs: 3
learning_rate_scheduler:
warmup_fraction: 0.01
backend:
type: ray
trainer:
use_gpu: true
strategy:
type: deepspeed
zero_optimization:
stage: 2 DeepSpeed Stage 2 doesn't do any sharding of model weights - just the gradients and optimizer state. Since 4-bit quantized Llama-2-7b fits on a single T4 GPU, this will essentially do Distributed Data Parallel (DDP) styled training with the side benefits of being able to shard the gradients and optimizer states across GPUs as well (and the options to offload them to CPU as well if needed). This would be the ideal solution for doing Llama-7b in a machine with 2 T4 GPUs, but it is not currently supported by Ludwig. We have an active PR (#3728) that we're hoping to merge into Ludwig master by EOW this week or sometime early next week at the latest. Stay tuned! Parting thoughtsI think that for now, I would recommend going with approach 1 and setting CUDA_VISIBLE_DEVICES to either just the single GPU or both GPUs depending on what you'd like - I expect that a single GPU will actually train faster in this case, but it is worth checking. The last thing I want to mention is that in your config, you have Hope this helps unblock you! |
Hello @arnavgarg1 - Thank you very much for looking into this. With Option 1, it failed due to error - CUDA OOMI have changed the config as below (instead of using 2048 for input and output features, using 4096 including together) and using PYTORCH_CUDA_ALLOC_CONF - max_split_size_mb:128:
With Option 2, it failed due to error - RecursionError: maximum recursion depth exceeded in comparison
|
Thanks for reporting results back! I may know the cause for both, but just want to check - may I ask what version of Torch and Ray you're using? |
Thanks @arnavgarg1 for quick response. Below is my Conda Environment:
|
Hello @arnavgarg1 - Kind follow-up on this. In the meantime when I executed with below config, the process completed successfully with both the infra configuration
Is there anyway to have the training successful with max_sequence_length set to 4096 (Merging both input and output)? As you mentioned with Single GPU it will support till 2048. But with 4096 context length, is it achievable via Multi GPU? Config
|
still ray doesn't work for quantization ? any idea ? |
Hi,
I'm trying to do a distributed training on llama-7b in a VM having two Tesla T4 GPU's using ray with strategy as deepspeed. I'm facing the following error "Could not pickle object as excessively deep recursion required."
My current OS is ubuntu :20.04
python version: 3.10.13
model.yaml:
Environment:
Can you guide me in solving this
Thanks in advance!!
The text was updated successfully, but these errors were encountered: