Skip to content
This repository has been archived by the owner on Oct 9, 2024. It is now read-only.

NotImplementedError: Cannot copy out of meta tensor; no data! #50

Open
zdaoguang opened this issue Feb 10, 2023 · 9 comments
Open

NotImplementedError: Cannot copy out of meta tensor; no data! #50

zdaoguang opened this issue Feb 10, 2023 · 9 comments

Comments

@zdaoguang
Copy link

zdaoguang commented Feb 10, 2023

Hi,

I now employ the deepspeed framework to speed up the inference of BLOOM 7.1B, as shown below:

deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-7b1

But instead I got the following bugs:

(bloom) xxx@HOST-xxx:~/projects/transformers-bloom-inference/bloom-inference-scripts$ bash run_deepspeed.sh 
[2023-02-10 17:46:16,148] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-10 17:46:16,202] [INFO] [runner.py:548:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name bigscience/bloom-7b1
[2023-02-10 17:46:19,604] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-10 17:46:19,604] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-10 17:46:19,604] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-10 17:46:19,604] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-10 17:46:19,604] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-10 17:46:23,455] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-7b1
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 33951.40it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8339.85it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 7358.43it/s]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26572.10it/s]
[2023-02-10 17:46:33,775] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-10 17:46:33,778] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-10 17:46:33,779] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/xxx/.cache/torch_extensions/py310_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1198277473449707 seconds
[2023-02-10 17:46:34,344] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0038442611694335938 seconds
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00,  9.94s/it]checkpoint loading time at rank 0: 21.33984684944153 sec
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.67s/it]
Traceback (most recent call last):
  File "/data/xxx/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 181, in <module>
    model = deepspeed.init_inference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1749, in to
    return super().to(*args, **kwargs)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
    return self._apply(convert)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-10 17:46:57,652] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 25235
[2023-02-10 17:46:57,653] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-7b1'] exits with return code = 1

My main conda environment is:

accelerate               0.16.0
deepspeed                0.8.0
deepspeed-mii            0.0.2
huggingface-hub          0.12.0
tokenizers               0.12.1
torch                    1.13.1
transformers             4.26.0

My nvidia-smi info is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0    37W / 250W |   1253MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   37C    P0    40W / 250W |   2411MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0    24W / 250W |      4MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Can you help me to solve this bug? Thank you very much!

@mayank31398
Copy link
Collaborator

This is a bug in DeepSpeed. Can you report it there?
Also, fyi DS-inference doesn't work with pytorch 1.13.1 yet.
I would suggest to fall back to 1.12.1

@zdaoguang
Copy link
Author

zdaoguang commented Feb 12, 2023

Thanks for your reply. When I changged torch down to 1.12.1 and brought cuda up to the suitable version (10.2.89), the previous error indeed disappeared, but a new one came, as shown below.

[2023-02-12 10:19:51,085] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 10:19:51,252] [INFO] [runner.py:548:main] cmd = /usr/local/tools/Python-3.10.9/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:19:53,867] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 10:19:53,868] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 10:19:53,868] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 10:19:53,868] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 10:19:53,868] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 10:19:56,839] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:20:01,592] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-12 10:20:01,594] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 10:20:01,594] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /root/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu102/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o 
FAILED: gelu.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00006b7b_00000000-6_gelu.cpp1.ii".
[2/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o 
FAILED: relu.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

1 error detected in the compilation of "/tmp/tmpxft_00006b7c_00000000-6_relu.cpp1.ii".
[3/4] //usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o 
FAILED: layer_norm.cuda.o 
//usr/local/cuda-10.2/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(520): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(409): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(414): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(421): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(422): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(447): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
          detected during:
            instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]" 
(191): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]" 
(165): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]" 
(167): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]" 
(169): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]" 
(171): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]" 
(173): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]" 
(178): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]" 
(181): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]" 
(184): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
          detected during:
            instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]" 
(187): here
            instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]" 
(199): here

7 errors detected in the compilation of "/tmp/tmpxft_00006b7d_00000000-6_layer_norm.cpp1.ii".
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
    subprocess.run(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
    model = deepspeed.init_inference(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 126, in __init__
    self._apply_injection_policy(config)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 792, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1061, in replace_module
    replaced_module, _ = _replace_module(model, policy)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1078, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 782, in replace_fn
    new_module = replace_with_policy(child,
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 473, in replace_with_policy
    new_module = transformer_inference.DeepSpeedTransformerInference(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
    inference_cuda_module = builder.load()
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 462, in load
    return self.jit_load(verbose)
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
    op_module = load(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1202, in load
    return _jit_compile(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'transformer_inference'
[2023-02-12 10:20:03,879] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 27397
[2023-02-12 10:20:03,880] [ERROR] [launch.py:324:sigkill_handler] ['/usr/local/tools/Python-3.10.9/bin/python3.10', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1

My conda env ls (Python 3.10.9):

Package            Version
------------------ ----------
accelerate         0.16.0
certifi            2022.12.7
charset-normalizer 3.0.1
deepspeed          0.8.0
filelock           3.9.0
hjson              3.1.0
huggingface-hub    0.12.0
idna               3.4
ninja              1.11.1
numpy              1.24.2
packaging          23.0
pip                22.3.1
psutil             5.9.4
py-cpuinfo         9.0.0
pydantic           1.10.4
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
setuptools         65.5.0
tokenizers         0.12.1
torch              1.12.1
tqdm               4.64.1
transformers       4.26.0
typing_extensions  4.4.0
urllib3            1.26.14

The nvcc -V result is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

Can you help me to solve it? Thanks.

@mayank31398
Copy link
Collaborator

mayank31398 commented Feb 12, 2023

I am not really sure. Haven't seen this before but seems like CUDA is not able to compile some kernels in DeepSpeed.
I am using CUDA 11.6 with 8x A100 80GB GPUs.
Can you try to switch to CUDA 11.6?
If not, there is a dockerfile that is tested and it works fine.

However, you will need to modify it a bit for the standalone script.
I am using it for the inference server.

@zdaoguang
Copy link
Author

Actually, I can only use cuda with version 10.2, as I am using other versions of cuda that report the following error:

[2023-02-12 16:48:25,193] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 16:48:25,352] [INFO] [runner.py:508:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:27,793] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 16:48:27,793] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 16:48:27,793] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 16:48:27,793] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 16:48:27,793] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 16:48:30,664] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:35,960] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-02-12 16:48:35,963] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 16:48:35,963] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
    model = deepspeed.init_inference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 124, in __init__
    self._apply_injection_policy(config)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 349, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 881, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1139, in replace_module
    replaced_module, _ = _replace_module(model, policy)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
    _, layer_id = _replace_module(child, policies, layer_id=layer_id)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1156, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 871, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 454, in replace_with_policy
    new_module = transformer_inference.DeepSpeedTransformerInference(
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
    inference_cuda_module = builder.load()
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 459, in load
    return self.jit_load(verbose)
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 474, in jit_load
    assert_no_cuda_mismatch()
  File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 100, in assert_no_cuda_mismatch
    raise Exception(
Exception: Installed CUDA version 11.1 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.
[2023-02-12 16:48:37,805] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15223
[2023-02-12 16:48:37,805] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1

The pip list result is:

Package                  Version
------------------------ ----------
accelerate               0.15.0
aiohttp                  3.8.3
aiosignal                1.3.1
anyio                    3.6.2
asttokens                2.2.1
async-timeout            4.0.2
asyncio                  3.4.3
attrs                    22.2.0
backcall                 0.2.0
certifi                  2022.12.7
charset-normalizer       2.1.1
click                    8.1.3
comm                     0.1.2
datasets                 2.9.0
debugpy                  1.6.6
decorator                5.1.1
deepspeed                0.7.6
deepspeed-mii            0.0.4
dill                     0.3.6
executing                1.2.0
fastapi                  0.89.1
filelock                 3.9.0
Flask                    2.2.2
Flask-API                3.0.post1
Flask-Cors               3.0.10
frozenlist               1.3.3
fsspec                   2023.1.0
grpcio                   1.51.1
grpcio-tools             1.50.0
gunicorn                 20.1.0
h11                      0.14.0
hjson                    3.1.0
huggingface-hub          0.10.1
idna                     3.4
ipdb                     0.13.11
ipykernel                6.21.0
ipython                  8.9.0
itsdangerous             2.1.2
jedi                     0.18.2
Jinja2                   3.1.2
joblib                   1.2.0
jupyter_client           8.0.2
jupyter_core             5.2.0
MarkupSafe               2.1.2
matplotlib-inline        0.1.6
multidict                6.0.4
multiprocess             0.70.14
ninja                    1.11.1
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
packaging                23.0
pandas                   1.5.3
parso                    0.8.3
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.4.0
pip                      23.0
platformdirs             2.6.2
prompt-toolkit           3.0.36
protobuf                 4.21.12
psutil                   5.9.4
ptyprocess               0.7.0
pure-eval                0.2.2
py-cpuinfo               9.0.0
pyarrow                  11.0.0
pydantic                 1.10.2
Pygments                 2.14.0
python-dateutil          2.8.2
pytz                     2022.7.1
PyYAML                   6.0
pyzmq                    25.0.0
regex                    2022.10.31
requests                 2.28.2
responses                0.18.0
sacremoses               0.0.53
sentencepiece            0.1.97
setuptools               65.6.3
six                      1.16.0
sniffio                  1.3.0
stack-data               0.6.2
starlette                0.22.0
tokenizers               0.12.1
tomli                    2.0.1
torch                    1.12.1
torchvision              0.13.1
tornado                  6.2
tqdm                     4.64.1
traitlets                5.9.0
transformers             4.25.1
typing_extensions        4.4.0
urllib3                  1.26.14
uvicorn                  0.19.0
wcwidth                  0.2.6
Werkzeug                 2.2.2
wheel                    0.37.1
xxhash                   3.2.0
yarl                     1.8.2

The nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

The nvidia-smi result is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   32C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   33C    P0    28W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   33C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It feels like a version issue, but I've tried to make sure the version is the same as your docker file. So, have you encountered this problem? Thank you again.

@mayank31398
Copy link
Collaborator

I think your environment is configured with CUDA 11.1 and torch is compiled using 10.2.
Can you install torch using the same CUDA version?

@wohenniubi
Copy link

Hi @mayank31398, I ran into a similar issue when employing the deepspeed framework to speed up the inference of BLOOM 7.1B. Could you please take a look? Many thanks

The cmd is shown below:
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

The log is listed as follows:

[root@7656ea32130c transformers-bloom-inference]# deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:27,119] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 06:03:30,403] [INFO] [runner.py:508:main] cmd = /opt/conda/envs/inference/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-devel-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-03-09 06:03:32,070] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-03-09 06:03:32,070] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-03-09 06:03:32,070] [INFO] [launch.py:162:main] dist_world_size=8
[2023-03-09 06:03:32,070] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-03-09 06:03:34,840] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 41.8kB/s]
/cos/HF_cache/models--bigscience--bloom/snapshots/ea51bbb9a58423efb336e2d6c900a8b3dc64b2eb
[2023-03-09 06:03:44,806] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu116/transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -o dequantize.cuda.o
[2/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o
[3/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -o transform.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(56): warning #177-D: variable "lane" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(95): warning #177-D: variable "half_dim" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(112): warning #177-D: variable "vals_half" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(113): warning #177-D: variable "output_half" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(130): warning #177-D: variable "lane" was declared but never referenced

[4/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -o apply_rotary_pos_emb.cuda.o
[5/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -o softmax.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(275): warning #177-D: variable "alibi_offset" was declared but never referenced

/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(430): warning #177-D: variable "warp_num" was declared but never referenced

[6/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o
[7/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o
[8/9] c++ -MMD -MF pt_binding.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -o pt_binding.o
[9/9] c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o -shared -lcurand -L/opt/conda/envs/inference/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o transformer_inference.so
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.044259786605835 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.013221502304077 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.90252995491028 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007809162139893 seconds
Time to load transformer_inference op: 24.017361402511597 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.906622886657715 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007588863372803 seconds
Time to load transformer_inference op: 24.015749216079712 seconds
[2023-03-09 06:04:09,565] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False}
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06393146514892578 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...Time to load transformer_inference op: 0.061557769775390625 seconds

No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.061757564544677734 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06235527992248535 seconds
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06160426139831543 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06882047653198242 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06495046615600586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07005953788757324 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05634450912475586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05931544303894043 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06092071533203125 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05466651916503906 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.058559417724609375 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05735135078430176 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05769968032836914 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06432437896728516 seconds
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 6: 0.0035653114318847656 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 4: 0.0038051605224609375 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 3: 0.0014710426330566406 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
    main()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
    benchmark_end_to_end(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 7: 0.002664327621459961 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
    results = execs()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
    self.model = get_model_class(args.deployment_framework)(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
Traceback (most recent call last):
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
    main()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
    benchmark_end_to_end(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
    results = execs()
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
    self.model = get_model_class(args.deployment_framework)(args)
  File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
    self.model = deepspeed.init_inference(
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
        self.model = deepspeed.init_inference(engine = InferenceEngine(model, config=ds_inference_config)

  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    self.module.to(device)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
    self.module.to(device)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    return self._apply(convert)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    module._apply(fn)
  File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Here, I use the docker image generated by the Dockerfile from https://github.com/huggingface/transformers-bloom-inference/blob/main/Dockerfile. The pip list shows

Package            Version
------------------ ------------
accelerate         0.16.0
anyio              3.6.2
certifi            2022.12.7
charset-normalizer 3.1.0
click              8.1.3
deepspeed          0.7.6
fastapi            0.89.1
filelock           3.9.0
Flask              2.2.3
Flask-API          3.0.post1
grpcio             1.51.3
grpcio-tools       1.50.0
gunicorn           20.1.0
h11                0.14.0
hjson              3.1.0
huggingface-hub    0.12.1
idna               3.4
importlib-metadata 6.0.0
itsdangerous       2.1.2
Jinja2             3.1.2
MarkupSafe         2.1.2
ninja              1.11.1
numpy              1.24.2
packaging          23.0
pip                23.0.1
protobuf           4.22.1
psutil             5.9.4
py-cpuinfo         9.0.0
pydantic           1.10.2
PyYAML             6.0
regex              2022.10.31
requests           2.28.2
setuptools         65.6.3
sniffio            1.3.0
starlette          0.22.0
tokenizers         0.13.2
torch              1.12.1+cu116
tqdm               4.65.0
transformers       4.26.1
typing_extensions  4.5.0
urllib3            1.26.14
uvicorn            0.19.0
Werkzeug           2.2.3
wheel              0.38.4
zipp               3.15.0

The nvidia-smi shows:

 +-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   32C    P0    69W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:2A:00.0 Off |                    0 |
| N/A   29C    P0    66W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:51:00.0 Off |                    0 |
| N/A   31C    P0    69W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:57:00.0 Off |                    0 |
| N/A   33C    P0    63W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:9E:00.0 Off |                    0 |
| N/A   32C    P0    65W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:A4:00.0 Off |                    0 |
| N/A   30C    P0    63W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C7:00.0 Off |                    0 |
| N/A   29C    P0    64W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   32C    P0    66W / 400W |     35MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The nvcc-V shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

@mayank31398
Copy link
Collaborator

The dockerfile works out of the box.
Can you give it a shot?

@wohenniubi
Copy link

Many thanks for your prompt response @mayank31398

The dockerfile is as follows:

root@super-klb:~/test/transformers-bloom-inference-GPU# cat Dockerfile
FROM nvidia/cuda:11.6.1-devel-ubi8 as base

RUN dnf install -y --disableplugin=subscription-manager make git && dnf clean all --disableplugin=subscription-manager

# taken form pytorch's dockerfile
RUN curl -L -o ./miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    chmod +x ./miniconda.sh && \
    ./miniconda.sh -b -p /opt/conda && \
    rm ./miniconda.sh

ENV PYTHON_VERSION=3.9 \
    PATH=/opt/conda/envs/inference/bin:/opt/conda/bin:${PATH}

# create conda env
RUN conda create -n inference python=${PYTHON_VERSION} pip -y

# change shell to activate env
SHELL ["conda", "run", "-n", "inference", "/bin/bash", "-c"]

FROM base as conda

# update conda
RUN conda update -n base -c defaults conda -y
# cmake
RUN conda install -c anaconda cmake -y

# necessary stuff
RUN pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 \
    transformers==4.26.1 \
    deepspeed==0.7.6 \
    accelerate==0.16.0 \
    gunicorn==20.1.0 \
    flask \
    flask_api \
    fastapi==0.89.1 \
    uvicorn==0.19.0 \
    jinja2==3.1.2 \
    pydantic==1.10.2 \
    huggingface_hub==0.12.1 \
        grpcio-tools==1.50.0 \
    --no-cache-dir

# clean conda env
RUN conda clean -ya

# change this as you like 🤗
ENV TRANSFORMERS_CACHE=/cos/HF_cache \
    HUGGINGFACE_HUB_CACHE=${TRANSFORMERS_CACHE}

FROM conda as app

WORKDIR /src
RUN chmod -R g+w /src

RUN mkdir /.cache && \
    chmod -R g+w /.cache

ENV PORT=5000 \
    UI_PORT=5001
EXPOSE ${PORT}
EXPOSE ${UI_PORT}

#CMD git clone https://github.com/huggingface/transformers-bloom-inference.git && \
#    cd transformers-bloom-inference && \
#    # install grpc and compile protos
#    make gen-proto && \
#    make bloom-560m

I simply commend the last 5 lines and do them in the docker manually (to avoid repeated git clone the repo when I docker exec the created instance with another terminal)
Especially, here are my steps to create the docker and launch the instance:

git clone https://github.com/huggingface/transformers-bloom-inference transformers-bloom-inference-GPU
cd transformers-bloom-inference-GPU
comment out the last 5 lines of Dockerfile as mentioned above
docker build -t transformers-bloom:v1.0 .
docker run --gpus all -it --name="bloom" -v /nfs/users/test:/nfs/users/test -w /nfs/users/test transformers-bloom:v1.0

Then in the docker, make bloom-176b, launch the benchmark, and hit the NotImplementedError: Cannot copy out of meta tensor; no data!

git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference && \
    # install grpc and compile protos
    make gen-proto && \
    make bloom-176b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

To supplement: I could successfully run the benchmark of bloom3b and get the perf data. First add the following lines in Makefile

bloom-3b:
        make ui

        TOKENIZERS_PARALLELISM=false \
        MODEL_NAME=bigscience/bloom-3b \
        MODEL_CLASS=AutoModelForCausalLM \
        DEPLOYMENT_FRAMEWORK=ds_inference \
        DTYPE=fp16 \
        MAX_INPUT_LENGTH=32 \
        MAX_BATCH_SIZE=4 \
        CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
        gunicorn -t 0 -w 1 -b 127.0.0.1:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'

Then

bloom-3b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom-3b --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

Untitled picture

@mayank31398
Copy link
Collaborator

Not sure why 176b is not working. I will try to look into it :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants