You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After I run pruning.sh, the command line outputs "Starting training..." and the config message.
However, I did not get any other output after running it overnight.
In fact, I'm running the program on two 4090 GPU, both of which have 16.6GB of memory and consume only 25W, as if the program is not running at all.
This makes me think that it's not a slow running problem, but that there may be something else going on.
Is this warning "The memory monitor only works on CUDA devices, but the model is on cpu." causing the problem?
Can you help me figure out the cause of the problem?
Below is my output:
/opt/conda/lib/python3.10/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
warnings.warn(
/opt/conda/lib/python3.10/site-packages/composer/callbacks/memory_monitor.py:94: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
Logging config...
data_local: LLM-Shearing/llmshearing/data/for_prune
data_remote: null
tokenizer_name: LLM-Shearing/llmshearing/meta-llama/Llama-2-7b-hf
max_seq_len: 512
global_seed: 17
run_name: llama2_7b_pruning_scaling_doremi_to2.7b_sl512
model:
name: mosaic_llama2_7b
path: LLM-Shearing/llmshearing/meta-llama/Llama-2-7b-hf/mosaic-7B/state_dict.pt
init_device: cpu
tokenizer_name: ${tokenizer_name}
d_model: 4096
n_heads: 32
n_layers: 32
intermediate_size: 11008
max_seq_len: ${max_seq_len}
vocab_size: 32000
init_std: 0.02
attn_pdrop: 0.0
resid_pdrop: 0.0
emb_pdrop: 0.0
attn_impl: flash
rms_norm_eps: 1.0e-05
l0_module:
start_sparsity: 0.0
target_sparsity: 0.5
pruning_modules:
- head
- intermediate
- layer
- hidden
lagrangian_warmup_steps: 640ba
target_model:
d_model: 2560
n_layers: 32
n_heads: 20
intermediate_size: 6912
vocab_size: 32000
eval_target_model: false
set_names:
- cc
- github
- book
- stackexchange
- wiki
- arxiv
- c4-rp
tokenizer:
type: hftokenizer
args:
tokenizer_name: ${tokenizer_name}
max_seq_len: ${max_seq_len}
train_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: train_small
shuffle: true
tokenizer_name: ${tokenizer_name}
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
is_uint16: true
drop_last: true
num_workers: 0
prefetch_factor: null
persistent_workers: false
eval_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: eval_merge
shuffle: false
tokenizer_name: ${tokenizer_name}
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
is_uint16: true
drop_last: false
num_workers: 8
scheduler:
t_warmup: 320ba
alpha_f: 0.1
optimizer:
lr: 0.0001
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0
max_duration: 3200ba
eval_interval: 50ba
eval_subset_num_batches: 1000
global_train_batch_size: 2
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
precision: amp_bf16
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: DEFAULT
activation_checkpointing: true
activation_cpu_offload: false
verbose: false
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
speed_monitor:
window_size: 10
memory_monitor: {}
lr_monitor: {}
loggers:
wandb:
project: pruning
name: ${run_name}
entity: pruning
init_kwargs:
mode: offline
dir: Shear-output/test_release_pruning_full/llama2_7b_pruning_scaling_doremi_to2.7b_sl512
project: pruning
name: llama2_7b_pruning_scaling_doremi_to2.7b_sl512
entity: pruning
save_interval: 3200ba
save_folder: Shear-output/test_release_pruning_full/llama2_7b_pruning_scaling_doremi_to2.7b_sl512
eval_first: false
autoresume: false
dist_timeout: 1800.0
n_gpus: 2
device_train_batch_size: 1
device_train_grad_accum: 1
n_params: 6738773034
Starting training...
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 2
num_nodes: 1
rank_zero_seed: 17
******************************
Seems that you are having a hanging issue. Could you refer to this issue and see if the solution helps?
Thanks, that solved my problem.
One more question for you, what is the minimum amount of GPU memory needed to run this pruning algorithm to prune Llama2? 4090 doesn't seem to be enough memory, and multiple GPUs don't solve the problem.
After I run pruning.sh, the command line outputs "Starting training..." and the config message.
However, I did not get any other output after running it overnight.
In fact, I'm running the program on two 4090 GPU, both of which have 16.6GB of memory and consume only 25W, as if the program is not running at all.
This makes me think that it's not a slow running problem, but that there may be something else going on.
Is this warning "The memory monitor only works on CUDA devices, but the model is on cpu." causing the problem?
Can you help me figure out the cause of the problem?
Below is my output:
GPU information
running bash
The text was updated successfully, but these errors were encountered: