You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 9, 2024. It is now read-only.
cuBLAS error when running the HuggingFace accelerate following benchmark code on NVIDIA H100 HGX, CUDA v12.1, cuDNN 8.8.1, pytorch==2.0.0+cu118, within Jupyter Notebook:
line 10:
Add BitsAndBytesConfig to import call, as follows: from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
line 56:
Add quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) to statement as follows: infer_dtype = args.dtype if infer_dtype == "int8": dtype = torch.int8 quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) kwargs = dict( device_map="auto", )
line 77:
Edit kwargs to include quantization_config, as follows: if infer_dtype == "int8": print_rank0("Using load_in_8bit=True to use quanitized model") #kwargs["load_in_8bit"] = True kwargs={"load_in_8bit":True, "quantization_config": quantization_config, "device_map": "auto"} else: kwargs["torch_dtype"] = dtype
Save the updated PY file, then run the accelerate inferencing code.
Updated PY file runs benchmarks without errors. Recommend making these, or similar code changes, to the parent repo.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
cuBLAS error when running the HuggingFace
accelerate
following benchmark code on NVIDIA H100 HGX, CUDA v12.1, cuDNN 8.8.1, pytorch==2.0.0+cu118, within Jupyter Notebook:!CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark
...or at CLI:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark
SOLUTION:
Make the following edits within transformers-bloom-inference/bloom-inference-scripts/bloom-accelerate-inference.py to enable mixed precision:
line 10:
Add
BitsAndBytesConfig
to import call, as follows:from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
line 56:
Add
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
to statement as follows:infer_dtype = args.dtype if infer_dtype == "int8": dtype = torch.int8 quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) kwargs = dict( device_map="auto", )
line 77:
Edit
kwargs
to includequantization_config
, as follows:if infer_dtype == "int8": print_rank0("Using
load_in_8bit=Trueto use quanitized model") #kwargs["load_in_8bit"] = True kwargs={"load_in_8bit":True, "quantization_config": quantization_config, "device_map": "auto"} else: kwargs["torch_dtype"] = dtype
Save the updated PY file, then run the
accelerate
inferencing code.Updated PY file runs benchmarks without errors. Recommend making these, or similar code changes, to the parent repo.
The text was updated successfully, but these errors were encountered: