Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) #2374

Open
jin-eld opened this issue Jan 21, 2024 · 0 comments

Comments

@jin-eld
Copy link

jin-eld commented Jan 21, 2024

Issue type

Build/Install

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

tensorflow-rocm-2.13.0.570

Custom code

Yes

OS platform and distribution

Fedora 39

Mobile device

No response

Python version

3.11.7

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I am trying to install tensorflow-rocm on Fedora 39 and I ran into issues. Fedora 39 provides ROCm 5.7 packages in their repo, so I picked tensorflow-rocm-2.13.0.570 which was marked as the version supporting ROCm 5.7 in the compatibility table.

Since Fedora comes with Python 3.12.1 the setup had to be done in a Python 3.11.7 virtual environment; for the record - PyTorch 2.1.2+rocm5.6 does work in the same venv and does detect the GPU, so I think the ROCm libraries shipped by Fedora do seem to be OK.

After pip installing tensorflow_rocm-2.13.0.570 the following happens upon module import:

Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/__init__.py", line 38, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/__init__.py", line 36, in <module>
    from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in <module>
    self_check.preload_check()
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: librccl.so.1: cannot open shared object file: No such file or directory

There is indeed no librccl.so.1 on the system and apparently no librccl.so.1 provider.
The tensorflow wheel does ship a venvs/311_generic/lib/python3.11/site-packages/tensorflow/include/external/local_config_rocm/rocm/rocm/lib/librccl.so, by the way also torch comes iwth an own librccl.so, but for some reason tensorflow-rocm is trying to load librccl.so.1

As a workaround I tried to create a symlink librccl.so.1 -> librccl.so, but this did not help. Finally, I exported the directory via LD_LIBRARY_PATH before starting Python, which allowed me to import the module:

>>> import tensorflow
2024-01-21 19:18:41.965936: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

However, attempting to run any TF related command results in an std::bad_alloc error:

>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1535, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc

I tried downgrading to tensorflow_rocm-2.12.0.560, but it behaves exactly the same as 2.13.

For the sake of completeness I tried 2.14 as a "negative" test and as expected it did not find libamdhip64.so.6, because Fedora 39 does not have ROCm 6.0.0 yet.

Standalone code to reproduce the issue

# this triggers "ImportError: librccl.so.1: cannot open shared object file: No such file or directory"
>>> import tensorflow

# then, once I create the library symlink and add LD_LIBRARY_PATH
>>> print(tensorflow.reduce_sum(tensorflow.random.normal([1000, 1000])))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/mnt/ssd/xxx/venvs/311_generic/lib64/python3.11/site-packages/tensorflow/python/eager/context.py", line 1451, in _initialize_physical_devices
    devs = pywrap_tfe.TF_ListPhysicalDevices()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError: std::bad_alloc


### Relevant log output

_No response_
@jin-eld jin-eld changed the title tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to laod wrong librccl.so, std::bad_alloc failure) tensorflow-rocm 2.12 and 2.13 fails on Fedora 39 (attempt to load wrong librccl.so, std::bad_alloc failure) Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant