Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gazebo RTF drops to 0.01 randomly for 10-15 seconds #3363

Open
AlexKaravaev opened this issue Jan 4, 2024 · 5 comments
Open

Gazebo RTF drops to 0.01 randomly for 10-15 seconds #3363

AlexKaravaev opened this issue Jan 4, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@AlexKaravaev
Copy link

AlexKaravaev commented Jan 4, 2024

Environment

  • OS Version: Ubuntu 20.04
  • Source or binary build: Tried with both Gazebo 11.4 and master source build

Description

I was debugging that for quite a while, but still haven't managed to find a cause. We have automated tests for the robot and sometimes it happens that Gazebo RTF drops to 0.01(normally we have around 0.8) and it stays there for 2-5 minutes, then goes back into normal mode. Strange thing is that it happens absolutely randomly and moreover(maybe take this one is with grain of salt) this only happens on our servers and only in podman(it's like docker) container. Both conditions must be true. It doesn't happen on my personal laptop in container. And simulation runs normally if run on the host of the server lol. As for differences between server and laptop, they are really similar: We both have some mid-tier nvidia graphics card, good cpu(My is AMD, server is Intel, but I would be very surprised if that would matter), same amount of ram(32gbs) and ubuntu 20.04 isntalled. Server though doesn't have a monitor attached.

I ran profiler with it and it also shows that SensorLoop sleeps a lot, but also World Update step takes too much time. I couldn't get any more info from profiler, so I don't know, but I am attaching the screenshots

2024-01-04_15-16_1
2024-01-04_15-16

I would appreciate any thoughts/tips how to debug that

Steps to reproduce

Unfortunately, I don't know :( I think if I would have known, that would solve the problem. I also cannot share source code unfortunately because of NDA, but will be willing to assist.

Output

There is no output from Gazebo when it happens.

@AlexKaravaev AlexKaravaev added the bug Something isn't working label Jan 4, 2024
@AlexKaravaev
Copy link
Author

Can I somehow check what is exactly the callback1 means? Also suspicous thing is how I understood, if Sleeping in SensorManager dropped by 5x, then also WorldUpdate should(Because Sensors are updated by World Rate?), but what I have is that while WorldUpdate drops by 5x, for Sleeping it's more like 100x

@AlexKaravaev AlexKaravaev changed the title Gazebo is completely frozen randomly Gazebo RTF drops to 0.01 randomly for 10-15 seconds Jan 5, 2024
@traversaro
Copy link
Collaborator

traversaro commented Jan 8, 2024

Do you have any custom plugins running in your simulation? Perhaps there is something going on in the callback registered by those plugins? callback1 correspond to when the gazebo-classic's event system calls a callback:

IGN_PROFILE_BEGIN("callback1");
. One thing you could do is to add some profiler calls to the plugins your use, or use some other kind of profile. For example, if your are on non-virtualized linux amd64, for example you can use Intel VTune or Magic Trace to get more info beside the one provided by the gazebo-classic's profiler.

As a general comment, I would in general suggest any user of Gazebo Classic to migrate to gz-sim, but I guess it is not trivial in your case.

@AlexKaravaev
Copy link
Author

@traversaro thanks for the answer.

We have a lot of custom plugins, so that is also the reason why we can't migrate to gz-sim quickly. I tried Magic Trace, but the problem is that buffer is really just couple of ms max, so I can't understand why it freezes from this trace.

I tried adding profiling to all of all our plugins, but I had 2 problems:

  1. It seems like ODE physics disappeared after that - no idea why
  2. I had no luck getting more info, single thing is that there much more frequent signals when this freezing happens, but I don't know how to understand from where. On the bottom - is profile of this frequent signal calls.

2024-01-08_17-13_1
2024-01-08_17-13

@AlexKaravaev
Copy link
Author

This magic trace output if that would help
image

@KostyaYamshanov
Copy link

Hello! I recently encountered a similar issue. In my case, the problem was in the callback function; the operations took too long. When refactoring my code, I moved only the copying of data from the gazebo topic into the callback, and moved complex operations into a separate thread, I hope this will help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants