You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the instructions, I've completed an installation and everything seemed to work including the generation of the MSA.
Specifically, I've done the recommended conda installation, the pip installation of triton, and the local download/unpack of the datasets.
Per my reading, the remainder of the instructions (e.g. Docker) seemed optional, so I jumped directly to trying inference.sh.
However, I'm hitting a repeatable Runtime CUDA error.
Since the same error occurs when I try the benchmark run, I'll paste the output for that at the bottom.
Keeping an eye on the VRAM, this does not seem to be an issue involving a lack of memory on the GPU (a RTX 3090)
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Any advice!?
Best wishes,
-Chris
(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256
[08/25/23 10:33:06] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/25/23 10:33:07] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
File "perf.py", line 187, in
main()
File "perf.py", line 152, in main
layer_inputs = attn_layers[lyr_idx].forward(*layer_inputs, node_mask, pair_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/evoformer.py", line 65, in forward
m = self.msa(m, z, msa_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 143, in forward
node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 63, in forward
b = F.linear(Z, self.linear_b_weights)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429806) of binary: /home/csnow/anaconda3/envs/fastfold/bin/python
Traceback (most recent call last):
File "/home/csnow/anaconda3/envs/fastfold/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
hi, I met the same problem as you. And I solved it at last.
I think you need to check whether your cuda version is matched with this project.
In this project, the torch version is 1.12.1 , it means that your cuda version must be one of [10.2 11.3 11.6]
Greetings!
Following the instructions, I've completed an installation and everything seemed to work including the generation of the MSA.
Specifically, I've done the recommended conda installation, the pip installation of triton, and the local download/unpack of the datasets.
Per my reading, the remainder of the instructions (e.g. Docker) seemed optional, so I jumped directly to trying inference.sh.
However, I'm hitting a repeatable Runtime CUDA error.
Since the same error occurs when I try the benchmark run, I'll paste the output for that at the bottom.
Keeping an eye on the VRAM, this does not seem to be an issue involving a lack of memory on the GPU (a RTX 3090)
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Any advice!?
Best wishes,
-Chris
(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256
[08/25/23 10:33:06] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/25/23 10:33:07] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
File "perf.py", line 187, in
main()
File "perf.py", line 152, in main
layer_inputs = attn_layers[lyr_idx].forward(*layer_inputs, node_mask, pair_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/evoformer.py", line 65, in forward
m = self.msa(m, z, msa_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 143, in forward
node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 63, in forward
b = F.linear(Z, self.linear_b_weights)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429806) of binary: /home/csnow/anaconda3/envs/fastfold/bin/python
Traceback (most recent call last):
File "/home/csnow/anaconda3/envs/fastfold/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
perf.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-08-25_10:33:10
host : icestorm
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 429806)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: