Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task_rust.sh and task_cpp_unittest.sh fail with updated Docker images when USE_VITIS_AI ON #10696

Closed
leandron opened this issue Mar 21, 2022 · 5 comments · Fixed by #10889
Closed

Comments

@leandron
Copy link
Contributor

leandron commented Mar 21, 2022

It looks like there is an error happening in ./tests/scripts/task_rust.sh when we have updated images including Python 3.7/TensorFlow 2.6/h5py 3.1.0. The error only happens when USE_VITIS_AI ON in ./tests/scripts/task_config_build_cpu.sh.

This is causing me issues when testing TensorFlow 2.6 images. I tried downgrading h5py and reproducing the steps without success. Despite that TensorFlow 2.6 requires h5py>3, so it is not a viable solution anyway.

Steps to reproduce:

cd <your tvm repo>
rm -rf build
docker pull tlcpackstaging/ci_cpu:20220321-061034-bd684deb5
./docker/bash.sh tlcpackstaging/ci_cpu:20220321-061034-bd684deb5

# this happens inside the container
./tests/scripts/task_config_build_cpu.sh build
./tests/scripts/task_build.py
./tests/scripts/task_ci_setup.sh
./tests/scripts/task_rust.sh         # this is the one that crashes

Then you should see an error similar to this:

test device::tests::device ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

free(): invalid pointer
error: test failed, to rerun pass '--lib'

Caused by:
  process didn't exit successfully: `/workspace/rust/target/debug/deps/tvm_sys-0b86e7b6a5ad689c` (signal: 6, SIGABRT: process abort signal)
script returned exit code 101

By setting USE_VITIS_AI OFF in ./tests/scripts/task_config_build_cpu.sh and repeating the steps, the issue doesn't reproduce.

Full output in Jenkins: https://ci.tlcpack.ai/blue/organizations/jenkins/docker-images-ci%2Fdocker-image-run-tests/detail/docker-image-run-tests/69/pipeline/

cc @jtuyls @anilmartha can you help me troubleshooting this?

@leandron
Copy link
Contributor Author

Skipping task_rust.sh I also see an issue during cpp tests. Same case, similar error message. [2022-03-22T10:56:59.483Z] free(): invalid pointer.

Full log at: https://ci.tlcpack.ai/blue/rest/organizations/jenkins/pipelines/docker-images-ci/pipelines/docker-image-run-tests/runs/71/nodes/307/steps/514/log/?start=0

@leandron leandron changed the title ./tests/scripts/task_rust.sh fails with updated Docker images when USE_VITIS_AI ON task_rust.sh and task_cpp_unittest.sh fail with updated Docker images when USE_VITIS_AI ON Mar 22, 2022
@jtuyls
Copy link
Contributor

jtuyls commented Mar 22, 2022

@leandron I am looking at this. Not sure what is going wrong yet.

@masahi
Copy link
Member

masahi commented Mar 29, 2022

The CI update for ci-cpu is apparently blocked by this https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/239/pipeline/60

I've hit this error twice today.

@jtuyls
Copy link
Contributor

jtuyls commented Mar 29, 2022

This seems to be happening when tensorflow is being loaded inside pyxir. I don't know what the exact cause is for this starting to happen with tensorflow 2.6 but the issue looks very similar to this one: triton-inference-server/server#3777 but with tensorflow instead of pytorch.

I have a workaround by loading tensorflow eagerly only when needed. I am currently verifying with the ci-cpu docker image locally and will create a PR when successful.

@leandron
Copy link
Contributor Author

leandron commented Apr 4, 2022

Based on results at https://ci.tlcpack.ai/job/docker-images-ci/job/docker-image-run-tests/82/, which include #10858 (up to fcdf463 in the repo), we can update the Docker images now, as the bug reported here is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants