0.5B model OOM when Initializing actor_rollout #217

MiaoLu3 · 2025-02-06T21:10:26Z

Hi there, thanks for the nice repo! I am trying to run the example for an 0.5B model on a single H100 (80G), but I got out of memory when it is initializing the actor_rollout, even before it goes into the training loop. Please see the traceback below. I later tried 8*H100 but also got OOM at the same stage. Wondering has anyone encountered similar situations like this?

(WorkerDict pid=3566775) before init cache memory allocated: 5.96926464GB, reserved: 6.079643648GB (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff9697576811089192d478044801000000 Worker ID: fa30bc39b193709af627f4217b9fda48f637379f135aadd5c4e8d12e Node ID: 077b4d57b1d8eb65baa427ab8b008991673f442e6639386460c71c2e Worker IP address: 10.112.9.31 Worker port: 38471 Worker PID: 3566775 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. Error executing job with overrides: ["data.train_files=['/users/miaolu/data/gsm8k/train.parquet']", "data.val_files=['/users/miaolu/data/gsm8k/test.parquet']", 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=512', 'data.max_response_length=256', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size=1', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4', 'critic.optim.lr=1e-5', 'critic.model.path=Qwen/Qwen2.5-0.5B-Instruct', 'critic.ppo_micro_batch_size=1', 'algorithm.kl_ctrl.kl_coef=0.001', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=1', 'trainer.nnodes=1', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.total_epochs=15', 'trainer.logger=[console]'] Traceback (most recent call last): File "/projects/m000069/miaolu/git/self-correction-verl/verl/trainer/main_ppo.py", line 101, in main run_ppo(config) File "/projects/m000069/miaolu/git/self-correction-verl/verl/trainer/main_ppo.py", line 109, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/users/miaolu/.conda/envs/self/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/users/miaolu/.conda/envs/self/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/users/miaolu/.conda/envs/self/lib/python3.11/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/users/miaolu/.conda/envs/self/lib/python3.11/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RayActorError): ray::main_task() (pid=3566187, ip=10.112.9.31) File "/projects/m000069/miaolu/git/self-correction-verl/verl/trainer/main_ppo.py", line 194, in main_task trainer.init_workers() File "/projects/m000069/miaolu/git/self-correction-verl/verl/trainer/ppo/ray_trainer.py", line 521, in init_workers self.actor_rollout_wg.init_model() File "/projects/m000069/miaolu/git/self-correction-verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: create_colocated_worker_cls.<locals>.WorkerDict actor_id: 9697576811089192d478044801000000 pid: 3566775 name: DWe8AhWorkerDict_0:0 namespace: e18e86fd-8e8c-4939-829d-c679651b90b8 ip: 10.112.9.31 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The text was updated successfully, but these errors were encountered:

liuzuxin · 2025-02-11T02:21:01Z

Same here, haven't found the root cause yet. It usually happens when set micro_batch_size to a small number. The ref_policy_wg.compute_ref_log_prob step here looks a bit suspicious since the job usually stuck here. @eric-haibin-lin Any insights?

MiaoLu3 · 2025-02-11T07:18:57Z

@liuzuxin Hi, are you also encountering the same message of error here?

In my case I partially solved the problem: In my case I have to use Slurm to request GPU from a cluster. If I add --exclusive command during allocation (so that the whole node is exclusively under my usage), there would be no problem running the example. But it is still a bit strange, since when I request GPUs (even 8*H100, that is, the whole node) without the --exclusive command and there are no other tasks running on it, it still does not work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5B model OOM when Initializing actor_rollout #217

0.5B model OOM when Initializing actor_rollout #217

MiaoLu3 commented Feb 6, 2025

liuzuxin commented Feb 11, 2025

MiaoLu3 commented Feb 11, 2025

0.5B model OOM when Initializing actor_rollout #217

0.5B model OOM when Initializing actor_rollout #217

Comments

MiaoLu3 commented Feb 6, 2025

liuzuxin commented Feb 11, 2025

MiaoLu3 commented Feb 11, 2025