tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

jin-eld · 2024-01-21T18:49:08Z

Issue type

Build/Install

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

tensorflow-rocm-2.13.0.570

Custom code

Yes

OS platform and distribution

Fedora 39

Mobile device

No response

Python version

3.11.7

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.

Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.

After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:

Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/__init__.py", line 36, in <module>
    from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in <module>
    self_check.preload_check()
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory

There is indeed no librccl.so.1 on the system and apparently no librccl.so.1 provider.
The tensorflow wheel does ship a venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so, by the way also torch comes iwth an own librccl.so, but for some reason tensorflow-rocm is trying to load librccl.so.1

As a workaround I tried to create a symlink librccl.so.1 -> librccl.so, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:

>>> import tensorflow
2024-01-21 19:18:41.965936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

However, attempting to run any TF related command results in an std::bad_alloc error:

>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1535, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc

I tried downgrading to tensorflow_rocm-2.12.0.560, but it behaves exactly the same as 2.13.

For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find libamdhip64.so.6, because Fedora 39 does not have ROCm 6.0.0 yet.

Standalone code to reproduce the issue

# this triggers "ImportError: librccl.so.1: cannot open shared object file: No such file or directory"
>>> import tensorflow

# then, once I create the library symlink and add LD_LIBRARY_PATH
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1451, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc



### Relevant log output

_No response_

The text was updated successfully, but these errors were encountered:

jin-eld changed the title ~~tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to laod wrong librccl.so, std::bad_alloc failure)~~ tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

jin-eld commented Jan 21, 2024

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

Comments

jin-eld commented Jan 21, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue