You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Torch.dynamo is not working on H100 due to obsolete triton & pytorch
Steps to reproduce
Easily reproducible on H100 by running 'pytest -k benchmark'
Expected Behavior
Works.
Actual Behavior
Doesn't work. The issue is in old Triton (v2.0.0) which does not know anything about H100 (sm_90).
Getting the following errors:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
This one could be solved by installing a newer Torch 2.0.1+cu118 from the suggested url.
The second one is a triton issue:
E RuntimeError: CUDA error: no kernel image is available for execution on the device
E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
v2.0.0. has limitiation - it supports only up to < sm_90 (not including). Could not install a newer triton easily, since it complains being incompatible. However, I was able hack Triton: got it locally, synced to v2.0.0. tag and reverted the d54c04ab commit. But I am not sure it is using all SMs correctly on H100 after this surgery.
Your environment
Using Docker:
DOCKER_BUILDKIT=1 docker build -t kernl .
docker run --rm -it --gpus all -v $(pwd):/kernl kernl
Description
Torch.dynamo is not working on H100 due to obsolete triton & pytorch
Steps to reproduce
Easily reproducible on H100 by running 'pytest -k benchmark'
Expected Behavior
Works.
Actual Behavior
Doesn't work. The issue is in old Triton (v2.0.0) which does not know anything about H100 (sm_90).
Getting the following errors:
This one could be solved by installing a newer Torch 2.0.1+cu118 from the suggested url.
The second one is a triton issue:
v2.0.0. has limitiation - it supports only up to < sm_90 (not including). Could not install a newer triton easily, since it complains being incompatible. However, I was able hack Triton: got it locally, synced to v2.0.0. tag and reverted the d54c04ab commit. But I am not sure it is using all SMs correctly on H100 after this surgery.
Your environment
Using Docker:
Also tried the more recent NVidia Docker image (12.2.0-devel-ubuntu22.04 - same result.
Packages:
Self-service
Code of Conduct
The text was updated successfully, but these errors were encountered: