Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Scan test failure in rocThrust caused by changes in LLVM for the ROCm 6.2.1 release #173

Closed
Beanavil opened this issue Oct 1, 2024 · 3 comments

Comments

@Beanavil
Copy link

Beanavil commented Oct 1, 2024

Problem Description

One of rocThrust's tests for the scan algorithm is failing right after the ROCm 6.2.1 release. During debugging, it was noticed that modifying the addition operator of the FixedVector type used in the test like so:

    __host__ __device__
    FixedVector operator+(const FixedVector& bs) const
    {
        FixedVector output;
+     #pragma unroll 1
        for(unsigned int i = 0; i < N; i++)
            output.data[i] = data[i] + bs.data[i];
        return output;
    }

fixes the test.

Further investigation tracked the origin of the issue to this particular commit. Manually building this repo checked out at said commit and compiling with the clang++ executable generated reproduces the issue, while checking out to a previous commit and repeating the same process yields a successful execution.

This test failure is confirmed to happen on Vega20 (gfx906), MI100 (gfx908), V620 (gfx1030), and rx7900xtx (gfx1100).

Operating System

Ubuntu 20.04.6 LTS

CPU

AMD EPYC 7713P 64-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

llvm-project

Steps to Reproduce

  1. Clone this repo
  2. Checkout to commit 6598eee547c545449daaf57323d55c63d7718106
  3. Configure, build and install
cmake -G Ninja -S llvm -B build -DLLVM_INSTALL_UTILS=ON -DCMAKE_INSTALL_PREFIX=/path/to/llvm/install -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_TESTS=False  -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt"
ninja -C build install
  1. Clone rocThrust
  2. Configure & build rocThrust scan test
cmake -G Ninja -D CMAKE_CXX_FLAGS=" -D_GLIBCXX_ASSERTIONS=ON" -D CMAKE_CXX_COMPILER=/path/to/llvm/install/bin/clang++ -D CMAKE_BUILD_TYPE=Release -D BUILD_TEST=ON -D BUILD_EXAMPLES=OFF -D BUILD_BENCHMARKS=OFF -D AMDGPU_TARGETS="gfx900;gfx906;gfx908;gfx1030;gfx90a" -D AMDGPU_TEST_TARGETS="gfx900;gfx906;gfx908;gfx1030;gfx90a" -D RNG_SEED_COUNT=3 -D PRNG_SEEDS="0, 1000" -S ~/rocThrust/ -B ~/rocThrust/build  -DROCPRIM_ROOT=./rocPRIM/ -D DOWNLOAD_ROCPRIM=ON
cmake --build build --target test_thrust_scan
  1. Run test
./build/testing/test_thrust_scan # this should fail
  1. Checkout to commit 164a5a0d9533300baf49c22cf36d1609a8cfe446
  2. Re-configure, build and install
cmake -G Ninja -S llvm -B build -DLLVM_INSTALL_UTILS=ON -DCMAKE_INSTALL_PREFIX=/path/to/llvm/install -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_TESTS=False  -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt"
ninja -C build install
  1. Re-configure & build rocThrust scan test
rm -rf build
cmake -G Ninja -D CMAKE_CXX_FLAGS=" -D_GLIBCXX_ASSERTIONS=ON" -D CMAKE_CXX_COMPILER=/path/to/llvm/install/bin/clang++ -D CMAKE_BUILD_TYPE=Release -D BUILD_TEST=ON -D BUILD_EXAMPLES=OFF -D BUILD_BENCHMARKS=OFF -D AMDGPU_TARGETS="gfx900;gfx906;gfx908;gfx1030;gfx90a" -D AMDGPU_TEST_TARGETS="gfx900;gfx906;gfx908;gfx1030;gfx90a" -D RNG_SEED_COUNT=3 -D PRNG_SEEDS="0, 1000" -S ~/rocThrust/ -B ~/rocThrust/build  -DROCPRIM_ROOT=./rocPRIM/ -D DOWNLOAD_ROCPRIM=ON && 
cmake --build build --target test_thrust_scan
  1. Run test
./build/testing/test_thrust_scan # this should pass

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@lamb-j
Copy link
Collaborator

lamb-j commented Oct 1, 2024

Thanks for the detailed report! @alex-t is taking a look

@alex-t
Copy link

alex-t commented Oct 25, 2024

The change was an attempt to redesign the whole control flow lowering in the AMDGPU backend. Further, I found a simpler and cleaner way to solve the problem that the change originated from.
The culprit commit is going to be reverted and for sure won't go to the next release.
So, it doesn't make sense to investigate the exact reason and make any changes to it now.

@jamesxu2
Copy link

jamesxu2 commented Jan 8, 2025

Thanks for the very clear repro steps @Beanavil and the update from @alex-t . I reran this from ROCm/llvm-project amd-staging and I don't see any test failures, so I'm closing this issue.

@jamesxu2 jamesxu2 closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants