Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Alpa multi-node backend for OPT 175B #48

Open
ccmaymay opened this issue Aug 9, 2022 · 11 comments
Open

Add Alpa multi-node backend for OPT 175B #48

ccmaymay opened this issue Aug 9, 2022 · 11 comments
Labels
new-model Request to add support for a new model

Comments

@ccmaymay
Copy link
Collaborator

ccmaymay commented Aug 9, 2022

It looks like FB's license prohibits distribution of the OPT 175B weights, so any OPT 175B implementation is going to require a little extra work:

(good grief)

@ccmaymay ccmaymay added the new-model Request to add support for a new model label Aug 9, 2022
@ccmaymay ccmaymay self-assigned this Aug 17, 2022
@ccmaymay
Copy link
Collaborator Author

Installed locally on COE grid. Getting various errors like this:

(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service  

Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models

OPT jax benchmark script succeeds.

As far as I can tell, the GPUs are available and the ray worker is aware of them.

@ccmaymay
Copy link
Collaborator Author

Using the cuda Docker image with cuda compatibility on BRTX fails because the GPU is not supported by cuda compatibility:

2022-08-31 16:42:03.611628: E external/org_tensorflow/tensorflow/stream_executor
/cuda/cuda_driver.cc:272] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED
_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-08-31 16:42:03.612251: E external/org_tensorflow/tensorflow/stream_executo$
/cuda/cuda_diagnostics.cc:313] kernel version 450.51.6 does not match DSO versi$
n 455.45.1 -- cannot find working devices in this configuration
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 
and rerun for more info.)

@ccmaymay
Copy link
Collaborator Author

ccmaymay commented Aug 31, 2022

Building alpa from source with cuda 11.0 (which is not officially supported) produces these warnings and a core dump due to a bus error:

2022-08-31 17:19:42.554883: W external/org_tensorflow/tensorflow/stream_executor
/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is
 older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to 
incorrect results or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is ofte
n sufficient.

@ccmaymay
Copy link
Collaborator Author

Cherry-picking the ptxas binary from the corresponding cuda 11.1 image also produces a bus error.

@ccmaymay
Copy link
Collaborator Author

ccmaymay commented Aug 31, 2022

Installed locally on COE grid. Getting various errors like this:

(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service  

Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models

OPT jax benchmark script succeeds.

As far as I can tell, the GPUs are available and the ray worker is aware of them.

It seems these errors are coming from JAX. The stack trace ends at a call to the get_backend function exported from this module: https://github.com/google/jax/blob/main/jax/lib/xla_bridge.py

And running this part of the JAX quickstart in ipython produces similar errors: https://github.com/google/jax#automatic-differentiation-with-grad

At first I thought maybe the issue was that the GPUs are in use. It seems when I try to make a GPU reservation with qlogin, it doesn't work, as other processes end up using those GPUs sometimes. qstat -j JOB_ID shows the GPU indices that were requested, I think, but the scheduler doesn't automatically set CUDA_VISIBLE_DEVICES or SGE_HGR_gpu. That said, I have looked for unused GPUs on the node and tried to use those by setting CUDA_VISIBLE_DEVICES.

On one attempt, the test install script seemed to get further than usual, however. So I am not entirely sure this isn't just an issue of actually getting an unused GPU. nvidia-smi -q also says the compute mode is "exclusive process," which prohibits more than one context on a GPU.

@ccmaymay
Copy link
Collaborator Author

Installed locally on COE grid. Getting various errors like this:

(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service  

Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models
OPT jax benchmark script succeeds.
As far as I can tell, the GPUs are available and the ray worker is aware of them.

It seems these errors are coming from JAX. The stack trace ends at a call to the get_backend function exported from this module: https://github.com/google/jax/blob/main/jax/lib/xla_bridge.py

And running this part of the JAX quickstart in ipython produces similar errors: https://github.com/google/jax#automatic-differentiation-with-grad

At first I thought maybe the issue was that the GPUs are in use. It seems when I try to make a GPU reservation with qlogin, it doesn't work, as other processes end up using those GPUs sometimes. qstat -j JOB_ID shows the GPU indices that were requested, I think, but the scheduler doesn't automatically set CUDA_VISIBLE_DEVICES or SGE_HGR_gpu. That said, I have looked for unused GPUs on the node and tried to use those by setting CUDA_VISIBLE_DEVICES.

On one attempt, the test install script seemed to get further than usual, however. So I am not entirely sure this isn't just an issue of actually getting an unused GPU. nvidia-smi -q also says the compute mode is "exclusive process," which prohibits more than one context on a GPU.

Here is my approximate setup procedure:

module load cuda11.3/toolkit/11.3.1-1
module load nccl/2.9.9-1_cuda11.3 
module load cudnn/8.2.0.53_cuda11.x
conda create -n alpa-r10n06 python=3.8
conda activate alpa-r10n06
conda install -y pytorch torchvision torchaudio -c pytorch
pip install accelerate 'transformers>=4.20.1'
pip install cupy-cuda113
pip install alpa
pip install jaxlib==0.3.5+cuda113.cudnn820 -f https://alpa-projects.github.io/wheels.html
ray start --head
python -m alpa.test_install

@ccmaymay
Copy link
Collaborator Author

I overlooked this from the wiki:

To use a GPU in an interactive session, use qrsh with /bin/bash (not qlogin) to get a session.

qrsh -q gpu.q -l num_proc=1,mem_free=10G,h_rt=8:00:00,gpu=1

@ccmaymay
Copy link
Collaborator Author

Never mind, using qrsh doesn't fix the issue.

@ccmaymay
Copy link
Collaborator Author

ccmaymay commented Sep 1, 2022

Finally got the test_install script working on an EC2 p2 instance. The driver was misconfigured on the DL AMIs so I used a base Ubuntu AMI and installed cuda myself. Still needed to install the legacy v470 drivers for the K80 and then use a different install script to install the toolkit (11.3). And the test produces these warnings:

(MeshHostWorker pid=2411) 2022-09-01 23:03:00.750916: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:212] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.5 = f32[64,128]{1,0} custom-call(f32[64,128]{1,0} %multiply.147, f32[128,128]{1,0} %param_19), custom_call_target="__cublas$gemm", metadata={op_type="dot_general" op_name="parallelize(train_step_pipeshard_parallel_mesh_0)/dot_general[dimension_numbers=(((1,), (1,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/ubuntu/anaconda3/lib/python3.9/site-packages/flax/linen/linear.py" source_line=188}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"1\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"lhs_stride\":\"8192\",\"rhs_stride\":\"16384\"}" failed. Falling back to default algorithm.  Per-algorithm errors:

@ccmaymay
Copy link
Collaborator Author

ccmaymay commented Sep 1, 2022

Using that same EC2 instance, the textgen test succeeded in producing output (generating text completions) using the alpa/opt-125m model (on a single-node ray cluster), but produced the same warnings about picking the best algorithm, and also produced these errors at the end (hopefully just a cleanup issue??):

Exception ignored in: <function RemoteArrayRef.__del__ at 0x7fe4c5c43c10>       
Traceback (most recent call last):                                                File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/alpa/device_mesh.py",
 line 1373, in __del__                                                            File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/alpa/device_mesh.py",
 line 1140, in delete_remote_buffers                                            
TypeError: 'NoneType' object is not callable 

@ccmaymay
Copy link
Collaborator Author

ccmaymay commented Sep 2, 2022

Finally got the test_install script working on an EC2 p2 instance. The driver was misconfigured on the DL AMIs so I used a base Ubuntu AMI and installed cuda myself. Still needed to install the legacy v470 drivers for the K80 and then use a different install script to install the toolkit (11.3). And the test produces these warnings:

(MeshHostWorker pid=2411) 2022-09-01 23:03:00.750916: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:212] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.5 = f32[64,128]{1,0} custom-call(f32[64,128]{1,0} %multiply.147, f32[128,128]{1,0} %param_19), custom_call_target="__cublas$gemm", metadata={op_type="dot_general" op_name="parallelize(train_step_pipeshard_parallel_mesh_0)/dot_general[dimension_numbers=(((1,), (1,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/ubuntu/anaconda3/lib/python3.9/site-packages/flax/linen/linear.py" source_line=188}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"1\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"lhs_stride\":\"8192\",\"rhs_stride\":\"16384\"}" failed. Falling back to default algorithm.  Per-algorithm errors:

Investigating this warning message, I found the following in the algorithm picker code:

// We expect GemmWithAlgorithm to fail sometimes
// -- in fact, it will fail for all algorithms if
// we're targeting < sm_50

sm_50 refers to compute capability 5.0, and the K80 has compute capability 3.7. So I expect this warning to resolve itself on more recent GPUs.

@ccmaymay ccmaymay removed their assignment Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-model Request to add support for a new model
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant