Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to execute a VCK5000-based mlir-tutorial? #1838

Open
shibizhao opened this issue Oct 15, 2024 · 9 comments
Open

How to execute a VCK5000-based mlir-tutorial? #1838

shibizhao opened this issue Oct 15, 2024 · 9 comments
Assignees

Comments

@shibizhao
Copy link

Hi,
I am new of the mlir-aie. And I have installed the mlir-aie according to the document. And I would like to execute the mlir_tutorials on VCK5000.

First, I check the installed "install/runtime_lib" directory. There is only an "x86_64-hsa".

Secondly, I found the makefile-common is used for aarch64 (I guess it is for VCK190). So I would like to know how I should modify the makefile-common for VCK5000.

Thanks.

@shibizhao shibizhao changed the title How to execute a VCK5000-based mlir-tutorials? How to execute a VCK5000-based mlir-tutorial? Oct 15, 2024
@shibizhao
Copy link
Author

Update, I tried to modify the makefile-common as:

# Contains common definitions used across the Makefiles of all tutorials.

# MLIR-AIE install directory. If you have sourced utils/env_setup.sh before
# running make, the following should work to find the AIE install directory.
AIE_RUNTIME_LIB ?= $(shell realpath $(dir $(shell which aie-opt))/../runtime_lib)
AIE_INSTALL ?= $(AIE_RUNTIME_LIB)/x86_64-hsa


# VITIS related variables
VITIS_ROOT ?= $(shell realpath $(dir $(shell which vitis))/../)
VITIS_AIETOOLS_DIR ?= ${VITIS_ROOT}/aietools
VITIS_AIE_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/versal_prod/lib
VITIS_AIE2_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/aie_ml/lib


#${VITIS_ROOT}/gnu/aarch64/lin/aarch64-linux/aarch64-xilinx-linux/

# The libstdc++ version that is installed in the sysroot given above. This is
# used for include and library paths. If you built the sysroot with Vitis
# 2022.2 and PetaLinux 2022.2, libstdc++ 11.2.0 will be installed. 
# LIBCXX_VERSION ?= 3.4.30

# The following flags are passed to both AI core and host compilation for
# aiecc.py invocations.
AIECC_FLAGS += --host-target=x86_64-linux-gnu

CHESSCC_FLAGS = -f -p me -P ${VITIS_AIE_INCLUDE_DIR} -I ${VITIS_AIETOOLS_DIR}/include
CHESS_FLAGS = -P ${VITIS_AIE_INCLUDE_DIR}

# The following additional flags are only applied for host code.
AIECC_HOST_FLAGS += \
    -I$(AIE_INSTALL)/test_lib/include \
    -I${AIE_INSTALL}/xaiengine/include \
    --gcc-toolchain=/usr \
    -L$(AIE_INSTALL)/test_lib/lib -ltest_lib 

# Add the
# necessary search paths for the sysroot so clang++ can find the aarch64
# includes and libraries. Some of these shouldn't be necessary, except that
# sysroot shipped with Vitis is slightly broken, so clang can't find things
# automatically just using --gcc-toolchain
AIECC_HOST_FLAGS += \
    -I/usr/include/c++/12 \
    -I/usr/include \
    -I/usr/include/x86_64-linux-gnu/c++/12 \
    -L/usr/lib/x86_64-linux-gnu/ \
    -B/usr/lib/x86_64-linux-gnu/

And I entered to tutorial-1 dir and compiled the elf:

(sandbox) (mlir-aie-dev) bizhao.shi@server:~/research/compiler/mlir-aie/mlir_tutorials_vck5000/tutorial-1$ make core_1_4.elf
aiecc.py -j4 aie.mlir
Found xchesscc at /rshome/software/Xilinx/Vitis/2023.2/aietools
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 Workerwarning: overriding the module target triple with pdarch-unknown-unknown-elf [-Woverride-module]
1 warning generated.
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 WorkerWarning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_chess.h", line 647, column 131: ignoring attribute [[deprecated]] on class declaration
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:02 0/1 1 WorkerWarning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 63, column 132: ignoring attribute [[deprecated]] on class declaration
Warning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 96, column 131: ignoring attribute [[deprecated]] on class declaration
 AIE Compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:03 2/2 4 Workers

Maybe it works. But when I tried to compile tutorial-1.exe, there was an error:

(sandbox) (mlir-aie-dev) bizhao.shi@server:~/research/compiler/mlir-aie/mlir_tutorials_vck5000/tutorial-1$ make tutorial-1.exe
aiecc.py -j4 --host-target=x86_64-linux-gnu aie.mlir -I/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/test_lib/include -I/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/xaiengine/include --gcc-toolchain=/usr -L/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/test_lib/lib -ltest_lib  -I/usr/include/c++/12 -I/usr/include -I/usr/include/x86_64-linux-gnu/c++/12 -L/usr/lib/x86_64-linux-gnu/ -B/usr/lib/x86_64-linux-gnu/ ./test.cpp -o tutorial-1.exe
Found xchesscc at /rshome/software/Xilinx/Vitis/2023.2/aietools
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 Workerwarning: overriding the module target triple with pdarch-unknown-unknown-elf [-Woverride-module]
1 warning generated.
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 WorkerWarning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_chess.h", line 647, column 131: ignoring attribute [[deprecated]] on class declaration
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:02 0/1 1 WorkerWarning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 63, column 132: ignoring attribute [[deprecated]] on class declaration
Warning in "../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 96, column 131: ignoring attribute [[deprecated]] on class declaration
 AIE Compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00 0/2 4 Workersclang: error: no such file or directory: '/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64/test_lib/lib/libmemory_allocator_ion.a'
Error encountered while running: clang++ -std=c++17 --target=x86_64-linux-gnu /rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64/test_lib/lib/libmemory_allocator_ion.a -I/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64/xaiengine/include -L/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64/xaiengine/lib -L/rshome/software/Xilinx/Vitis/2023.2/aietools/lib/lnx64.o -Wl,-R/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64/xaiengine/lib -I/rshome/bizhao.shi/research/compiler/mlir-aie/mlir_tutorials_vck5000/tutorial-1/aie.mlir.prj -fuse-ld=lld -lm -lxaiengine -D__AIEARCH__=10 -I/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/test_lib/include -I/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/xaiengine/include --gcc-toolchain=/usr -L/rshome/bizhao.shi/research/compiler/mlir-aie/install/runtime_lib/x86_64-hsa/test_lib/lib -ltest_lib -I/usr/include/c++/12 -I/usr/include -I/usr/include/x86_64-linux-gnu/c++/12 -L/usr/lib/x86_64-linux-gnu/ -B/usr/lib/x86_64-linux-gnu/ ./test.cpp -o tutorial-1.exe
 AIE Compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00 0/2 4 Workers
make: *** [Makefile:24: tutorial-1.exe] Error 1

But there is not a subdirectory named "x86_64" in the runtime_lib.

I am confused.

@eddierichter-amd
Copy link
Collaborator

Thanks for your questions! The mlir-tutorials came out before our VCK5000 platform. Are you able to use the programming examples that we provide? The vector vector add example that we have has a Makefile that can compile either on Ryzen AI or VCK5000.

@shibizhao
Copy link
Author

Thanks for your reply!!!!

I executed the "make vck5000" and the "./test.elf" commands following the readme in the vector_vector_add example.
It could not run with output:

(sandbox) (mlir-aie-dev) bizhao.shi@server:~/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add$ make vck5000
mkdir -p build
python3 /rshome/bizhao.shi/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add/aie2.py xcvc1902 6 > build/aie.mlir
aiecc.py \
    --link_against_hsa --host-target=x86_64-amd-linux-gnu build/aie.mlir \                                                  -I/rshome/bizhao.shi/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add/../../../install/runtime_lib/x86_64-hsa/test_lib/include \
        /rshome/bizhao.shi/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add/test_vck5000.cpp \
        /rshome/bizhao.shi/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add/../../../install/runtime_lib/x86_64-hsa/test_lib/src/test_library.cpp \                                                                      -Wl,--whole-archive -Wl,--no-whole-archive -lstdc++ -ldl -lelf -o test.elf
Found xchesscc at /rshome/software/Xilinx/Vitis/2023.2/aietools
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 Workerwarning: overriding the module target triple with pdarch-unknown-unknown-elf [-Woverride-module]
1 warning generated.
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01 0/1 1 WorkerWarning in "../../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_chess.h", line 647, column 131: ignoring attribute [[deprecated]] on class declaration
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:02 0/1 1 WorkerWarning in "../../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 63, column 132: ignoring attribute [[deprecated]] on class declaration
Warning in "../../../../../../../software/Xilinx/Vitis/2023.2/aietools/data/versal_prod/lib/me_common.h", line 96, column 131: ignoring attribute [[deprecated]] on class declaration
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:03 0/1 1 WorkerWarning in "": (imprecise line-number, the error occurred somewhere in this function): loop with essential overflow in loop count computation (number of iterations exceeds internal maximum) [-Wloop-count-overflow]
 AIE Compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:04 2/2 4 Workers
(sandbox) (mlir-aie-dev) bizhao.shi@server:~/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add$ ./test.elf
hsa_init failed
[ERROR] Error when calling mlir_aie_init_device)

I have sourced the scripts of xrt and vitis 2023.2. And my server has three FPGAs:

(sandbox) (mlir-aie-dev) bizhao.shi@server:~/research/compiler/mlir-aie/programming_examples/basic/vector_vector_add$ xbutil validate
ERROR: Please specify a device using --device option
 Available devices:
  [0000:3b:00.1] : xilinx_u200_gen3x16_xdma_base_2
  [0000:86:00.1] : xilinx_u280_gen3x16_xdma_base_1
  [0000:af:00.1] : xilinx_vck5000_gen4x8_qdma_base_2

So I think maybe it is the default device index in "mlir_aie_init_device" function is incorrect.

I modified test_vck5000.cpp in the vector_vector_add example as:

int main(int argc, char *argv[]) {
  uint64_t row = 0;
  uint64_t col = 6;

  std::vector<hsa_queue_t *> queues;
  uint32_t aie_max_queue_size(0);

  aie_libxaie_ctx_t *xaie = mlir_aie_init_libxaie();

  // This is going to initialize HSA, create a queue
  // and get an agent
 // int ret = mlir_aie_init_device(xaie);
 int ret = mlir_aie_init_device(xaie, 2);

  if (ret) {
    std::cout << "[ERROR] Error when calling mlir_aie_init_device)"
              << std::endl;
    return -1;
  }

The error is the same.

@shibizhao
Copy link
Author

Update:

I have install the AIR PCIe kernel and driver for VCK5000.

And now I execute the vector_vector_add in the root mode, there is no any output and (maybe stuck at mlir-aie-init-device).

@eddierichter-amd
Copy link
Collaborator

What PDI do you have loaded on the VCK5000? For mlir-aie we use a PDI image that is different than the standard XRT shell. Instructions for how to obtain and load the image is in the platform repo: https://github.com/Xilinx/ROCm-air-platforms/blob/main/platform/vck5000/README.md.

If you do have the ROCm-air-platforms PDI loaded on the card, can you let me know what you see in dmesg when you insert the driver?

@shibizhao
Copy link
Author

Hi,

Thanks for your reply.

I have programmed vck5000 using the platform in ROCm-air-platforms.

The driver amdair.ko has been added and used by this vck5000. (from lspci -vd 10ee:).

The dmesg shows as:

[71314.406557] amdair amdair: amdair_open
[71314.406617] amdair 0000:c1:00.0: amdair_ioctl_alloc_device_memory: Created buffer object, handle 0, mmap offset c000000000000000, size 800000
[71314.406634] amdair 0000:c1:00.0: amdair_mmap: offset c000000000000000
[71314.406640] amdair 0000:c1:00.0: amdair_mmap: Mapping BO with handle 0
[71314.407610] amdair 0000:c1:00.0: Assigning doorbell page 1
[71314.407620] amdair amdair: doorbell offset 0, queue offset 4000000800000000, queue_id 1, db_id 0, dev id 0 DRAM heap CPU VA 75274e200000
[71314.407634] amdair 0000:c1:00.0: amdair_mmap: offset 0
[71314.407653] amdair 0000:c1:00.0: amdair_mmap: offset 4000000800000000
[71314.407667] amdair 0000:c1:00.0: amdair_mmap: offset 8000000800000000

@eddierichter-amd
Copy link
Collaborator

Ah great! So, one thing I am confused about is passing device_id=2. It should only be checking for devices that are under control of the AIR driver so you should pass device_id=0. Do you know how you were able to get past the hsa_init failed issue? That is happening because for some reason hsa_init() (https://github.com/Xilinx/mlir-aie/blob/main/runtime_lib/test_lib/test_library.cpp#L181C26-L181C34) is failing when you initialize the device. I don't see how passing a different device_id impacts that, did something else change in the system or application to get that to work?

One thing I would make sure to do in-between failed runs is resetting the device using this script: https://github.com/Xilinx/ROCm-air-platforms/blob/main/platform/vck5000/utils/reset-vck5000.sh. This script reloads the firmware on running on the ARM and the driver to make sure that neither are in a bad state. This script does NOT require a reboot afterwards.

@shibizhao
Copy link
Author

shibizhao commented Oct 23, 2024

Hi,

Thanks for your detailed reply.

I have changed the device_id to 0. But the vector_vector_add and weather_stencil.exe still cannot execute without any output. Just stuck.

I also execute the reset-vck5000.sh script to reset this device without reboot.

The dmesg shows as:

[  677.316403] amdair amdair: amdair_open
[  677.316420] amdair 0000:af:00.0: amdair_ioctl_alloc_device_memory: Created buffer object, handle 0, mmap offset c000000000000000, size 800000
[  677.316428] amdair 0000:af:00.0: amdair_mmap: offset c000000000000000
[  677.316430] amdair 0000:af:00.0: amdair_mmap: Mapping BO with handle 0
[  677.316901] amdair 0000:af:00.0: Assigning doorbell page 1
[  677.316905] amdair amdair: doorbell offset 0, queue offset 4000000800000000, queue_id 1, db_id 0, dev id 0 DRAM heap CPU VA 7fb637d53000
[  677.316911] amdair 0000:af:00.0: amdair_mmap: offset 0
[  677.316921] amdair 0000:af:00.0: amdair_mmap: offset 4000000800000000
[  677.316990] amdair 0000:af:00.0: amdair_mmap: offset 8000000800000000
[  681.890187] amdair amdair: amdair_release
[  733.332596] amdair 0000:af:00.0: removed
[  733.418778] amdair 0000:af:00.0: VCK5000 device found
[  733.418787] amdair 0000:af:00.0: DRAM BAR 0 0x39be00000000 (0x200000000)
[  733.418794] amdair 0000:af:00.0: BRAM BAR 2 0xffffada025d00000 (0x100000)
[  733.418805] amdair 0000:af:00.0: vck5000_init_queues: Initializing queues, queue map ffffffffffffff81 queue size 7000, queue buf size 7000
[  733.418814] amdair 0000:af:00.0: vck5000_init_doorbells: Initializing doorbells, size 7000, doorbell map ffffffffffffff81, num db pages 7
[  733.419229] amdair 0000:af:00.0: Adding AIE 0
[  776.192108] amdair amdair: amdair_open
[  776.192126] amdair 0000:af:00.0: amdair_ioctl_alloc_device_memory: Created buffer object, handle 0, mmap offset c000000000000000, size 800000
[  776.192134] amdair 0000:af:00.0: amdair_mmap: offset c000000000000000
[  776.192136] amdair 0000:af:00.0: amdair_mmap: Mapping BO with handle 0
[  776.192596] amdair 0000:af:00.0: Assigning doorbell page 1
[  776.192600] amdair amdair: doorbell offset 0, queue offset 4000000800000000, queue_id 1, db_id 0, dev id 0 DRAM heap CPU VA 7f9a169af000
[  776.192605] amdair 0000:af:00.0: amdair_mmap: offset 0
[  776.192615] amdair 0000:af:00.0: amdair_mmap: offset 4000000800000000
[  776.192621] amdair 0000:af:00.0: amdair_mmap: offset 8000000800000000
[  788.442889] amdair amdair: amdair_release
[  812.981650] amdair 0000:af:00.0: removed
[  813.053068] amdair 0000:af:00.0: VCK5000 device found
[  813.053077] amdair 0000:af:00.0: DRAM BAR 0 0x39be00000000 (0x200000000)
[  813.053084] amdair 0000:af:00.0: BRAM BAR 2 0xffffada025600000 (0x100000)
[  813.053095] amdair 0000:af:00.0: vck5000_init_queues: Initializing queues, queue map ffffffffffffff81 queue size 7000, queue buf size 7000
[  813.053104] amdair 0000:af:00.0: vck5000_init_doorbells: Initializing doorbells, size 7000, doorbell map ffffffffffffff81, num db pages 7
[  813.053517] amdair 0000:af:00.0: Adding AIE 0
[  834.834176] amdair amdair: amdair_open
[  834.834202] amdair 0000:af:00.0: amdair_ioctl_alloc_device_memory: Created buffer object, handle 0, mmap offset c000000000000000, size 800000
[  834.834211] amdair 0000:af:00.0: amdair_mmap: offset c000000000000000
[  834.834213] amdair 0000:af:00.0: amdair_mmap: Mapping BO with handle 0
[  834.834758] amdair 0000:af:00.0: Assigning doorbell page 1
[  834.834762] amdair amdair: doorbell offset 0, queue offset 4000000800000000, queue_id 1, db_id 0, dev id 0 DRAM heap CPU VA 7fa9cfa09000
[  834.834769] amdair 0000:af:00.0: amdair_mmap: offset 0
[  834.834801] amdair 0000:af:00.0: amdair_mmap: offset 4000000800000000
[  834.834813] amdair 0000:af:00.0: amdair_mmap: offset 8000000800000000
(sandbox) (mlir-aie-dev) root@server:/rshome/bizhao.shi/research/compiler/mlir-aie/build/ROCm-air-platforms/examples/sparta-weather-stencil# ./weather_stencil.exe
Found an AIE HSA agent
[NO ANY OTHER OUTPUT]

This is the output of lspci -vd 10ee: command:

af:00.0 Memory controller: Xilinx Corporation Device b034
        Subsystem: Xilinx Corporation Device 0007
        Flags: fast devsel, IRQ 480, NUMA node 1
        Memory at 39be00000000 (64-bit, prefetchable) [size=8G]
        Memory at ee600000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: [40] Power Management version 3
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [1c0] Secondary PCI Express
        Capabilities: [1f0] Virtual Channel
        Kernel driver in use: amdair

Oh, it is worth to mention that:
when I cold reboot this server, the vck5000 will return to the xilinx qdma shell.

@eddierichter-amd
Copy link
Collaborator

Thanks for the detailed response. Think we are getting close. Couple questions:

  • After you load the PDI, do you perform a warm reboot? It is possible that the commands are not going over PCIe if the link is not initialized.
  • What is the output when you run reset-vck5000.sh? Does it say anything about loading the firmware successfully or unsuccessfully?
  • We have a shell running on the ARM, are you able to use minicom or screen to connect to the UART connected to the ARM (on my machine the VCK5000 has /dev/ttyUSB[0-3] and the ARM UART is at /dev/ttyUSB2 and use a baud rate of 115200) and see if it prints anything? Since there is a shell running on the ARM Pressing Enter should return a shell. If you press Enter and nothing comes up that means the firmware was either never loaded or has crashed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants