Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can it run on a Linux system with A100 GPUs? #18

Open
RZFan525 opened this issue Jul 25, 2024 · 26 comments
Open

Can it run on a Linux system with A100 GPUs? #18

RZFan525 opened this issue Jul 25, 2024 · 26 comments

Comments

@RZFan525
Copy link

No description provided.

@xuanlinli17
Copy link
Collaborator

Yes. Please follow the instructions in readme and troubleshooting. Though, rendering for the drawer tasks will be slow due to the use of ray tracing.

@RZFan525
Copy link
Author

RZFan525 commented Jul 26, 2024

Thank you for your reply. However, I encountered the same error as #7. And, when I install vulkan-utils with sudo apt-get install vulkan-utils, an error appears:
The package vulkan-utils could not be located
I don't have any computers with RTX GPUs, how can I run it?

@RZFan525
Copy link
Author

I have followed https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#vulkan to add three json files, but it does not work.

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 27, 2024

Did you sudo apt update and vulkan-utils is still not found since it's ubuntu 22.04?

Try sudo apt install vulkan-tools

@RZFan525
Copy link
Author

Thank you for your reply. I have tried it, and it can be installed successfully. However, the same error has appeared.

And, I found that vulkaninfo works without /usr/share/vulkan/icd.d/nvidia_icd.json, /usr/share/glvnd/egl_vendor.d/10_nvidia.json, and /etc/vulkan/implicit_layer.d/nvidia_layers.json. But, when I follow https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#vulkan to manually add these three files, vulkaninfo doesn't work with the error ERROR_OUT_OF_HOST_MEMORY.

Anyway, the following error always appears whether the vulkaninfo can work or not.

[2024-07-28 11:58:15.019] [svulkan2] [error] GLFW error: X11: The DISPLAY environment variable is missing
[2024-07-28 11:58:15.019] [svulkan2] [warning] Continue without GLFW.
Traceback (most recent call last):
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv-OpenVLA/test.py", line 4, in <module>
    env = simpler_env.make('google_robot_pick_coke_can')
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv-OpenVLA/simpler_env/__init__.py", line 78, in make
    env = gym.make(env_name, obs_mode="rgbd", **kwargs)
  File "/cpfs01/user/liupengfei/rzfan/miniconda3/envs/simpler_env/lib/python3.10/site-packages/gymnasium/envs/registration.py", line 802, in make
    env = env_creator(**env_spec_kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/utils/registration.py", line 92, in make
    env = env_spec.make(**kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/utils/registration.py", line 34, in make
    return self.cls(**_kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/grasp_single_in_scene.py", line 630, in __init__
    super().__init__(**kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/grasp_single_in_scene.py", line 540, in __init__
    super().__init__(**kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/grasp_single_in_scene.py", line 64, in __init__
    super().__init__(**kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/custom_scenes/base_env.py", line 134, in __init__
    super().__init__(**kwargs)
  File "/cpfs01/user/liupengfei/rzfan/SimplerEnv/ManiSkill2_real2sim/mani_skill2_real2sim/envs/sapien_env.py", line 107, in __init__
    self._renderer = sapien.SapienRenderer(**renderer_kwargs)
RuntimeError: vk::Instance::enumeratePhysicalDevices: ErrorInitializationFailed

@RZFan525
Copy link
Author

I don't know how to run it :(

I have tried three different servers with A100 GPUs which encounter the same error. :(

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 29, 2024

Are you setting cuda devices properly? Also ensure that nvidia-driver version is at least above 535. Older nvidia drivers might not work.

You can make a fake display like

tmux new -s 1
sudo X :0 &
[exist tmux ctrl-b]
export DISPLAY=:0

@RZFan525
Copy link
Author

Thank you for getting back to me.

The servers I used are in a docker and I changed to another server, which makes it work.

However, I encountered another error which is attributed to the lack of display.

RuntimeError: Create window failed: context is not created with present support

Do you have any suggestions to help me observe the environment and the process of action?

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Jul 30, 2024

Inside docker, you might want to port the (fake) display (e.g., sudo X :0 &) in the main bash to the docker container

However, the SIMPLER environments shouldn't create a window unless you are visualizing robots using the utility scripts.

@RZFan525
Copy link
Author

I'm new in robotics, so I want to visualize the simulation environment to help me understand deeply. Maybe, it's better to output a video.

@xuanlinli17
Copy link
Collaborator

The evaluation videos are automatically saved.

@RZFan525
Copy link
Author

Thank you!

I can run the scripts scripts/openvla_bridge.sh, but it suddenly reports an error after running for a while.

image

@xuanlinli17
Copy link
Collaborator

If you consecutively create 2 environments in ipython, does it still report an error?

@RZFan525
Copy link
Author

When I create 2 environments, it can work. But there is a warning:

[2024-07-31 03:20:06.870] [svulkan2] [warning] A second renderer will share the same internal context with the first one. Arguments passed to constructor will be ignored.

@RZFan525
Copy link
Author

i don't know why. But, I also try SimplerEnv-OpenVLA/scripts/openvla_drawer_variant_agg.sh It's successful to output the average success

image

Thank you!

I can run the scripts scripts/openvla_bridge.sh, but it suddenly reports an error after running for a while.

image

@RZFan525
Copy link
Author

RZFan525 commented Aug 1, 2024

I find that the error appears when the obj_episode_id is 11 in any scripts that define obj-variation-mode as the episode.
image

@xuanlinli17
Copy link
Collaborator

That's strange; episode 11 doesn't introduce new objects.

@RZFan525
Copy link
Author

RZFan525 commented Aug 2, 2024

Could you give me some instructions on how to debug? Thank you very much!!

@xuanlinli17
Copy link
Collaborator

I actually don't know... and sorry that I don't have much bandwidth at the moment to look closely.

@RZFan525
Copy link
Author

RZFan525 commented Aug 2, 2024

Ok. Thank you for your reply.

@xuanlinli17
Copy link
Collaborator

Also you might create fake display like sudo X :0 &; export DISPLAY=:0 or xvfb-run -a {script}, to see if it works.

@RZFan525
Copy link
Author

RZFan525 commented Aug 3, 2024

Thank you. After trying this command, I found it cannot work. The error is the same. I don't know why.

@COST-97
Copy link

COST-97 commented Sep 18, 2024

Hello:

I don't know how to run it :(

I have tried three different servers with A100 GPUs which encounter the same error. :(

Same error in A100 GPU.
"libGLX_nvidia.so.0" does not exist in the A100.

Does anyone have an updated solution?
Thanks a lot!

@xuanlinli17
Copy link
Collaborator

Hello:

I don't know how to run it :(
I have tried three different servers with A100 GPUs which encounter the same error. :(

Same error in A100 GPU. "libGLX_nvidia.so.0" does not exist in the A100.

Does anyone have an updated solution? Thanks a lot!

Could you try the troubleshooting section in readme?

@Akila-Ayanthi
Copy link

Same error in A100 GPU.
"libGLX_nvidia.so.0" does not exist in the A100.

Does anyone have an updated solution?
Thanks a lot!

@yinsong1986
Copy link

I find that the error appears when the obj_episode_id is 11 in any scripts that define obj-variation-mode as the episode. image

I had similar issue, but when I sudo apt-get install libglvnd-dev, this error disappeared, and it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants