Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly built wheels from CI doesn't work on Navi 31 #2142

Open
evshiron opened this issue Jun 29, 2023 · 10 comments
Open

Nightly built wheels from CI doesn't work on Navi 31 #2142

evshiron opened this issue Jun 29, 2023 · 10 comments

Comments

@evshiron
Copy link

evshiron commented Jun 29, 2023

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

tf_nightly_rocm-2.14.0dev20230628.550-cp310-cp310-manylinux2014_x86_64.whl

Custom Code

Yes

OS Platform and Distribution

Ubuntu 22.04.2

Tested in ROCm 5.5.1 host and ROCm 5.5.0 container.

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

The sample script doesn't work when using tf-nightly-rocm from CI, which should have Navi 31 support out of the box for the time being.

The log for a previously succeeded run can be found here: ROCm/ROCm#1880 (comment).

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

features = np.random.randn(10000,25)
targets = np.random.randn(10000)

model = tf.keras.Sequential([
     tf.keras.layers.Dense(1)
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.MeanSquaredError())

model.fit(x=features, y=targets)

Relevant log output

2023-06-29 04:53:06.731832: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-29 04:53:06.755074: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-29 04:53:07.345334: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.352729: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.352755: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353192: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353228: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353254: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353322: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353346: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353374: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:809] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-06-29 04:53:07.353387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1816] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21420 MB memory:  -> device: 0, name: Radeon RX 7900 XTX, pci bus id: 0000:03:00.0
2023-06-29 04:53:07.479918: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:508] ROCm Fusion is enabled.
[SAME LOG TRIMMED]
2023-06-29 04:53:07.677340: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:30] 'hipModuleLoadData(&module, data)' failed with 'hipErrorNoBinaryForGpu'

2023-06-29 04:53:07.677348: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:30] 'hipModuleGetFunction(&function, module, kernel_name)' failed with 'hipErrorInvalidHandle'

2023-06-29 04:53:07.677353: W tensorflow/core/framework/op_kernel.cc:1816] INTERNAL: 'hipModuleLaunchKernel( function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<hipStream_t>(stream), params, nullptr)' failed with 'hipErrorInvalidHandle'
/usr/local/lib/python3.10/dist-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer GlorotUniform is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(
2023-06-29 04:53:07.678085: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:30] 'hipModuleLoadData(&module, data)' failed with 'hipErrorNoBinaryForGpu'

2023-06-29 04:53:07.678093: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:30] 'hipModuleGetFunction(&function, module, kernel_name)' failed with 'hipErrorInvalidHandle'

2023-06-29 04:53:07.678098: W tensorflow/core/framework/op_kernel.cc:1816] INTERNAL: 'hipModuleLaunchKernel( function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<hipStream_t>(stream), params, nullptr)' failed with 'hipErrorInvalidHandle'
Traceback (most recent call last):
  File "/root/test.py", line 14, in <module>
    model.fit(x=features, y=targets)
  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.InternalError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1384, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1367, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1348, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py", line 1125, in train_step
        y_pred = self(x, training=True)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.10/dist-packages/keras/src/backend.py", line 2102, in random_uniform
        return tf.random.stateless_uniform(

    InternalError: Exception encountered when calling layer 'sequential' (type Sequential).
    
    {{function_node __wrapped__Sub_device_/job:localhost/replica:0/task:0/device:GPU:0}} 'hipModuleLaunchKernel( function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<hipStream_t>(stream), params, nullptr)' failed with 'hipErrorInvalidHandle' [Op:Sub] name: 
    
    Call arguments received by layer 'sequential' (type Sequential):
      • inputs=tf.Tensor(shape=(None, 25), dtype=float32)
      • training=True
      • mask=None
@briansp2020
Copy link

Made some progress building TF for 7900XTX. #2191
I was able to run a simple example but it leaves my computer in an unstable state, and eventually, GUI crashes.

@evshiron
Copy link
Author

evshiron commented Aug 13, 2023

We have a thread ROCm/ROCm#1880 for it, and after #2101, TensorFlow should work to some extent. It's even already integrated in their CI. But in your PR there is a missing comma, so I don't know where has it been.

@briansp2020
Copy link

It looks like #2101 is already merged. It's weird that the comma is still missing in the main source.

With the comma added, my build is using the 7900XTX. I still see instability here and there. Hopefully, they will release the working code soon.

@briansp2020
Copy link

Is anyone else having issues building tf from the source? For me, the build usually fails 2 to 3 times, usually after consuming a lot of memory. Hower, if run the build script again, eventually it completes and generates whl file.

I'm now trying to build using 5.7 environment and the problem seems to be much worse. I retried at least 10 time and it still fails, usually with the following error. But if I retry, it compiles for a while again then crashes.

INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /root/tensorflow-upstream/tensorflow/compiler/xla/stream_executor/rocm/BUILD:532:11: Compiling tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc failed: undeclared inclusion(s) in rule '//tensorflow/compiler/xla/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc':
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_runtime_wrapper.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/cmath'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stddef.h'
'/opt/rocm-5.7.0/include/hip/hip_version.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_libdevice_declares.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_math.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/algorithm'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/new'
'/opt/rocm/llvm/lib/clang/17.0.0/include/limits.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stdint.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_stdlib.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_cuda_math_forward_declares.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_cmath.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_cuda_complex_builtins.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/complex'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__stddef_max_align_t.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stdarg.h'
'/opt/rocm-5.7.0/include/hip/hip_runtime.h'
'/opt/rocm-5.7.0/include/hip/hip_common.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_runtime.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_common.h'
'/opt/rocm-5.7.0/include/hip/hip_runtime_api.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/host_defines.h'
'/opt/rocm-5.7.0/include/hip/driver_types.h'
'/opt/rocm-5.7.0/include/hip/texture_types.h'
'/opt/rocm-5.7.0/include/hip/channel_descriptor.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_channel_descriptor.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_vector_types.h'
'/opt/rocm-5.7.0/include/hip/surface_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_runtime_pt_api.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/hip_ldg.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_atomic.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_device_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/math_fwd.h'
'/opt/rocm-5.7.0/include/hip/hip_vector_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/device_library_decls.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_warp_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_unsafe_atomics.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_surface_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/texture_fetch_functions.h'
'/opt/rocm-5.7.0/include/hip/hip_texture_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/ockl_image.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/texture_indirect_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_math_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/hip_fp16_math_fwd.h'
'/opt/rocm-5.7.0/include/hip/library_types.h'
'/opt/rocm-5.7.0/include/hip/hip_bfloat16.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_bfloat16.h'
'/opt/rocm-5.7.0/include/hip/hip_fp16.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_fp16.h'
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@i-chaochen
Copy link

i-chaochen commented Sep 18, 2023

Is anyone else having issues building tf from the source? For me, the build usually fails 2 to 3 times, usually after consuming a lot of memory. Hower, if run the build script again, eventually it completes and generates whl file.

I'm now trying to build using 5.7 environment and the problem seems to be much worse. I retried at least 10 time and it still fails, usually with the following error. But if I retry, it compiles for a while again then crashes.

INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /root/tensorflow-upstream/tensorflow/compiler/xla/stream_executor/rocm/BUILD:532:11: Compiling tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc failed: undeclared inclusion(s) in rule '//tensorflow/compiler/xla/stream_executor/rocm:rocm_helpers':
this rule is missing dependency declarations for the following files included by 'tensorflow/compiler/xla/stream_executor/rocm/rocm_helpers.cu.cc':
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_runtime_wrapper.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/cmath'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stddef.h'
'/opt/rocm-5.7.0/include/hip/hip_version.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_libdevice_declares.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_math.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/algorithm'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/new'
'/opt/rocm/llvm/lib/clang/17.0.0/include/limits.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stdint.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_stdlib.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_cuda_math_forward_declares.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_hip_cmath.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__clang_cuda_complex_builtins.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/cuda_wrappers/complex'
'/opt/rocm/llvm/lib/clang/17.0.0/include/__stddef_max_align_t.h'
'/opt/rocm/llvm/lib/clang/17.0.0/include/stdarg.h'
'/opt/rocm-5.7.0/include/hip/hip_runtime.h'
'/opt/rocm-5.7.0/include/hip/hip_common.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_runtime.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_common.h'
'/opt/rocm-5.7.0/include/hip/hip_runtime_api.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/host_defines.h'
'/opt/rocm-5.7.0/include/hip/driver_types.h'
'/opt/rocm-5.7.0/include/hip/texture_types.h'
'/opt/rocm-5.7.0/include/hip/channel_descriptor.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_channel_descriptor.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_vector_types.h'
'/opt/rocm-5.7.0/include/hip/surface_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_runtime_pt_api.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/hip_ldg.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_atomic.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_device_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/math_fwd.h'
'/opt/rocm-5.7.0/include/hip/hip_vector_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/device_library_decls.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_warp_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_unsafe_atomics.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_surface_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/texture_fetch_functions.h'
'/opt/rocm-5.7.0/include/hip/hip_texture_types.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/ockl_image.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/texture_indirect_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_math_functions.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/hip_fp16_math_fwd.h'
'/opt/rocm-5.7.0/include/hip/library_types.h'
'/opt/rocm-5.7.0/include/hip/hip_bfloat16.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_bfloat16.h'
'/opt/rocm-5.7.0/include/hip/hip_fp16.h'
'/opt/rocm-5.7.0/include/hip/amd_detail/amd_hip_fp16.h'
clang++: warning: argument unused during compilation: '-fcuda-flush-denormals-to-zero' [-Wunused-command-line-argument]
Target //tensorflow/tools/pip_package:build_pip_package failed to build

seems the build couldn't find rocm stuff at all, I suspect it's related to LLVM version (cannot guarantee).

We recommend you're using rocm/tensorflow-build to build from the source if you want to try
https://hub.docker.com/repository/docker/rocm/tensorflow-build

@briansp2020
Copy link

To use ROCm 5.7, I changed the ROCM_INSTALL_DIR in build_rocm_python3 to point to /opt/rocm. Making it to point to /opt/rocm-5.7.0 explicitly seems to fix the issue. Not sure why though since /opt/ROCm is a sym-link to the exact same location. 🤷‍♂️

@i-chaochen
Copy link

To use ROCm 5.7, I changed the ROCM_INSTALL_DIR in build_rocm_python3 to point to /opt/rocm. Making it to point to /opt/rocm-5.7.0 explicitly seems to fix the issue. Not sure why though since /opt/ROCm is a sym-link to the exact same location. 🤷‍♂️

I am not sure what's your environment, but could you try it in rocm/tensorflow-build ? Thanks

@briansp2020
Copy link

@i-chaochen
Can you tell me how to build using rocm/tensorflow-build? I usually build using build_rocm_python3 script in tensorflow-upstream repo. It's not clear to me how to build for rocm using rocm/tensorflow-build.

Thanks!

@i-chaochen
Copy link

i-chaochen commented Sep 18, 2023

@i-chaochen Can you tell me how to build using rocm/tensorflow-build? I usually build using build_rocm_python3 script in tensorflow-upstream repo. It's not clear to me how to build for rocm using rocm/tensorflow-build.

Thanks!

If you installed docker container, then you can follow the instructions from our repo. https://github.com/ROCmSoftwarePlatform/tensorflow-upstream#tensorflow-rocm-port

$ alias drun='sudo docker run \
      -it \
      --network=host \
      --device=/dev/kfd \
      --device=/dev/dri \
      --ipc=host \
      --shm-size 16G \
      --group-add video \
      --cap-add=SYS_PTRACE \
      --security-opt seccomp=unconfined \
      -v $HOME/dockerx:/dockerx'

$ drun rocm/tensorflow-build:latest-python3.9-rocm5.6.0    # for example I am using rocm5.6.0

Within the launched container, you can git clone our Tensorflow repo and change the correct rocm-version with directory path in build_rocm_python3 script and build it.

$ git clone https://github.com/ROCmSoftwarePlatform/tensorflow-upstream.git
$ cd tensorflow-upstream/
$ ./build_rocm_python3  

PS: Because I am using rocm5.6.0 as an example, it's same as in this script. If you're using rocm5.7.0 docker container, you need to change to rocm-5.7.0 as well., e.g., /opt/rocm-5.7.0 in https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/build_rocm_python3#L33

@johnnynunez
Copy link

#!/bin/sh

set -ex

sudo apt install -y patchelf

sudo apt-get update && sudo apt-get install -y openjdk-8-jdk openjdk-8-jre unzip wget git libstdc++-12-dev
python3 -m venv venv
python3 -m pip install --upgrade pip

source venv/bin/activate

export WORKDIR=$(pwd)
export PYTHON_BIN_PATH=$WORKDIR/venv/bin/python3
export PYTHON_LIB_PATH=$WORKDIR/venv/lib/python3.11/site-packages
export ROCM_PATH=/opt/rocm-5.7.0
export TF_NEED_ROCM=1
export GPU_DEVICE_TARGETS=gfx1100

# build tensorflow-rocm

if [ -d "tensorflow-upstream" ]; then
    echo "tensorflow-upstream folder exists. Skipping git clone."
else
    git clone --recursive https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
fi
cd tensorflow-upstream

# install bazel in venv
curl -L https://github.com/bazelbuild/bazelisk/releases/download/v1.18.0/bazelisk-linux-amd64 -o $WORKDIR/venv/bin/bazel \
  && chmod +x $WORKDIR/venv/bin/bazel

bazel clean --expunge

# declare build targets
printf '%s\n' ${GPU_DEVICE_TARGETS} | sudo tee -a $ROCM_PATH/bin/target.lst

# https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/243d98f63f538787b438a15ec0e7cc2f5f9c2d10/tensorflow/tools/ci_build/Dockerfile.rocm#L108
sudo touch $ROCM_PATH/.info/version

pip install setuptools wheel numpy packaging requests

bash build_rocm_python3

cd ..

Secondly modify this line. In my case 32gb 16 cores and 32 threads.

 RESOURCE_OPTION="--local_ram_resources=60000 --local_cpu_resources=35 --jobs=70"
RESOURCE_OPTION="--local_ram_resources=28000 --local_cpu_resources=16 --jobs=32"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants