[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it #1369

TeddLi · 2025-01-30T03:21:05Z

**Your question**
Ask a clear and concise question about Megatron-LM.
/workspace/megatron/megatron/core/models/gpt/gpt_layer_specs.py:77: UserWarning: The fp8 argument in "get_gpt_layer_with_transformer_engine_spec" has been deprecated and will be removed soon. Please update your code accordingly.
  warnings.warn(
[rank7]: Traceback (most recent call last):
[rank7]:   File "/workspace/megatron/pretrain_gpt.py", line 300, in <module>
[rank7]:     pretrain(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 386, in pretrain
[rank7]:     iteration, num_floating_point_operations_so_far = train(
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 1478, in train
[rank7]:     train_step(forward_step_func,
[rank7]:   File "/workspace/megatron/megatron/training/training.py", line 766, in train_step
[rank7]:     losses_reduced = forward_backward_func(
[rank7]:   File "/workspace/megatron/megatron/core/pipeline_parallel/schedules.py", line 1877, in forward_backward_pipelining_without_interleaving
[rank7]:     config.finalize_model_grads_func(
[rank7]:   File "/workspace/megatron/megatron/core/distributed/finalize_model_grads.py", line 225, in finalize_model_grads
[rank7]:     model_chunk.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/distributed_data_parallel.py", line 447, in finish_grad_sync
[rank7]:     bucket_group.finish_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 368, in finish_grad_sync
[rank7]:     self.start_grad_sync()
[rank7]:   File "/workspace/megatron/megatron/core/distributed/param_and_grad_buffer.py", line 306, in start_grad_sync
[rank7]:     with stream_context, _coalescing_manager(communication_group, async_ops=async_op) as cm:
[rank7]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
[rank7]:     next(self.gen)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2031, in _coalescing_manager
[rank7]:     work = group.reduce_scatter_tensor_coalesced(outputs, inputs, reduce_opts)
[rank7]: RuntimeError: Backend nccl does not support reduce_scatter_tensor_coalesced
[rank4]: Traceback (most recent call last):

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it #1369

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it #1369

TeddLi commented Jan 30, 2025 •

edited

Loading

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it #1369

[QUESTION] Backend nccl does not support reduce_scatter_tensor_coalesced, how could I solve it #1369

Comments

TeddLi commented Jan 30, 2025 • edited Loading

TeddLi commented Jan 30, 2025 •

edited

Loading