KINETO_USE_DAEMON=1 torchrun Cannot initialize CUDA without ATen_cuda library. #271

hi20240217 · 2024-06-23T00:41:09Z

cat << EOT > linear_model_example.py

import math
import torch
import torch.profiler
import torch.distributed as dist
import os

dist.init_process_group(backend='nccl')
local_rank=int(os.environ['LOCAL_RANK'])
rank=torch.distributed.get_rank()
torch.cuda.set_device(local_rank)

if not dist.is_available() or not dist.is_initialized():
    print("dist init error")

dtype = torch.float
device = torch.device("cuda:0")  # Uncomment this to run on GPU
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)
p = torch.tensor([1, 2, 3], device=device)
xx = x.unsqueeze(-1).pow(p)
print(xx.device)
model = torch.nn.Sequential(torch.nn.Linear(3, 1).to(device), torch.nn.Flatten(0, 1))
loss_fn = torch.nn.MSELoss(reduction="sum")

learning_rate = 1e-6
for t in range(200000):
    y_pred = model(xx)
    loss = loss_fn(y_pred, y)
    if t % 10000 == 99:
        print(t, loss.item())
    model.zero_grad()
    loss.backward()
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
linear_layer = model[0]
print(
    f"Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3"
)
EOT
export KINETO_USE_DAEMON=1
torchrun -m --nnodes=1 --nproc_per_node=1 linear_model_example

error:

[rank0]: RuntimeError: Cannot initialize CUDA without ATen_cuda library. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -Wl,--no-as-needed in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using ldd on your binary to see if there is a dependency on *_cuda.so library.
[rank0]:[W623 08:38:04.381995712 ProcessGroupNCCL.cpp:1161] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

The text was updated successfully, but these errors were encountered:

briancoutinho · 2024-06-28T20:54:47Z

Yes, seen this issue, its a static order intialization problem. Workaround this by delaying the profiler initialization.
Can you try adding this env var KINETO_DAEMON_INIT_DELAY_S=3 ?

briancoutinho · 2024-09-13T21:59:58Z

Checking back if we can close this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KINETO_USE_DAEMON=1 torchrun Cannot initialize CUDA without ATen_cuda library. #271

KINETO_USE_DAEMON=1 torchrun Cannot initialize CUDA without ATen_cuda library. #271

hi20240217 commented Jun 23, 2024

briancoutinho commented Jun 28, 2024

briancoutinho commented Sep 13, 2024

KINETO_USE_DAEMON=1 torchrun Cannot initialize CUDA without ATen_cuda library. #271

KINETO_USE_DAEMON=1 torchrun Cannot initialize CUDA without ATen_cuda library. #271

Comments

hi20240217 commented Jun 23, 2024

briancoutinho commented Jun 28, 2024

briancoutinho commented Sep 13, 2024