Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the supported AMDGPU versions are gfx1030gfx1100, may be lost a ',' between the devices "gfx1030,gfx1100" #2524

Open
gitleibin opened this issue May 4, 2024 · 5 comments

Comments

@gitleibin
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

v2.14.0-4248-g3448956e87e 2.14.0.600

Custom code

Yes

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

2024-05-04 09:45:04.334204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 0, name: Radeon RX 7900 XTX, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1100. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.

Standalone code to reproduce the issue

he supported AMDGPU versions are gfx1030gfx1100,

Relevant log output

>>> import os
>>> from tensorflow.python.client import device_lib
>>> os.environ["TF_CPP_MIN_LOG_LEVEL"]="99"
>>> 
>>> if __name__=="__main__":
...     print(device_lib.list_local_devices())
... 
2024-05-04 09:45:04.333922: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334007: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334103: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334144: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334184: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 0, name: Radeon RX 7900 XTX, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1100. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.
2024-05-04 09:45:04.334239: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:756] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-05-04 09:45:04.334253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:12:00.0) with AMDGPU version : gfx1036. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 14288687369984854945
xla_global_id: -1
]
@gitleibin gitleibin changed the title the supported AMDGPU versions are gfx1030gfx1100, may be lost a "gfx1030,gfx1100" the supported AMDGPU versions are gfx1030gfx1100, may be lost a ',' between the devices "gfx1030,gfx1100" May 4, 2024
@briansp2020
Copy link

This is already fixed in the code but they seem to take forever to release the updated binary. If you are ok with building from source, it should work. This script should give you some idea how to compile it. Also, the latest docker image has the fix. So, if you are OK with using a Docker container, try ROCm 6.1 images from https://hub.docker.com/r/rocm/tensorflow/tags

@JMaravalhasSilva
Copy link

This has been an issue for many months now... See #2410. If what @briansp2020 has said about the Docker image being fixed is accurate, it's quite baffling that they didn't bother to update the package on pypi...

Still, if you do not want to use the Docker image, there's an alternative to compiling tensorflow. You can download nightly wheels from here: http://ml-ci.amd.com:21096/job/tensorflow/job/release-rocmfork-r214-rocm-enhanced/job/release-build-whl/. This was mentioned by jayfurmanek in #2410, and it worked quite well for me.

@vivaaprimavera
Copy link

This issue is still present in rocm 6.1.2

2024-07-21 22:16:52.680470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6600, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1030. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.

tensorflow ignores a gfx1030. gpu

tensorflow_rocm-2.14.0.600


Agent 2


Name: gfx1030
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6600
Vendor Name: AMD

@Eskander
Copy link

Eskander commented Sep 7, 2024

I think they may have given up on the pypi package, but instructions on this repo were not updated and the change was poorly communicated (no surprises here). According to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/tensorflow-install.html:

As of ROCm 6.1, tensorflow-rocm packages are found at https://repo.radeon.com/rocm/manylinux. Prior to ROCm 6.1, packages were found at https://pypi.org/project/tensorflow-rocm.

@JMaravalhasSilva
Copy link

JMaravalhasSilva commented Sep 8, 2024

I confirm what @Eskander is saying. They dropped the pypi package, and the pypi page has no mention of that. However, I would currently advise against attempting to install tensorflow-rocm on your system - not because of tensorflow per se, but actually because of rocm itself (assuming you are installing rocm on your system).

Rocm currently breaks my system. I'm on a fresh install of Ubuntu 24.04.1, disabled iGPU on the BIOS (I have a Ryzen 7950X), attempted to install via amdgpu with dkms, then uninstalled, installed again with --no-dkms, and no luck. GNOME implodes the moment you get to the login screen - I could only login, see a bunch of glitching, open a terminal and uninstall.

Additionally, if you actually check their repos, there's currently no tensorflow build for python 3.12, which Ubuntu 24.04 now ships by default... So yeah, even if you could get ROCM working, tensorflow-rocm for Ubuntu LTS is currently broken, despite AMD claiming support for it...

Anyways, the docker version seems to work fine perfectly fine with my 7900 XTX, so I believe this particular issue has been solved and can now be closed.

Lastly, if you are running Fedora, I hear they now are shipping with ROCM 6 installed by default. You'll still have the python version issue, but maybe you'll have better luck there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants