-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499
Comments
"tcache 2" ? |
This is very difficult to reproduce. The double free crash always comes with the "Falling back to malloc for counters" from mongoc.
|
It looks looks like a race condition under "high" load. |
Currently we reduced our setup from 3 to one orion-ld and no subscriptions. The orion gets round about 2000 messages per second and permanently died with this double free error. The same happens with 3 orions or 6 orions. |
Are you using C++ and DynamicArray somewhere? |
Not a fan of C++. I'm a C guy ... I'm replacing most of the old C++ code of Orion for pure C in Orion-LD. Now, if you could start the broker inside GDB or Valgrind, we'd get a stack trace. |
ok, we have the chance to get the core files. We also startet a pod with a gdb version of orion-ld. Now we have no debuginfos for gdb. How can we generate more information for you. I'm not a C guy. ;-) Orion: 1.4.0-PRE-1425-GDB
|
ok, getting closer. Now, when you go from RELEASE (-O2) compilation to DEBUG (-g) compilation, the executable changes. Quite a lot. |
now with 1.5.0-PRE-1551-debug:
|
ok!
I'll look into that. See if I can find the error. |
Can you run multiple Orion-LD processes and send them against a MogoDB replica set? |
ok, I can do that. |
We have now significantly reduced the entity and are currently still looking at around 2000 vehicles. However, this cannot be the solution, but only a temporary solution. |
Yeah, of course not. That's not the solution |
Are there any ideas how we can support this issue? Same crashes with 1.6.0-PRE-1561. |
Yeah, so sorry, this issue is long overdue. That said, I took a quick look at this this morning and I've come up with a "desperate" attempt. |
The PR #1548 has been merged into develop and the docker image should reach dockerhub in 15-20 minutes. It's usually that fast. Now, this is not that I've found the issue and fixed it. |
:-) We will test it as soon as it will be available. |
up and running :-), we will check it |
So, this "fix" is about MHD (3rd party lib for HTTP) calls the "request completed" callback more than once for one and the same connection. That would be one way for the error to appear. I kind of doubt it though. Another possibility is that |
One more fix for #1499 - protected mongocConnectionGet with a semaphore
So, I merged that second fix as well, semaphore around getting the connection to mongo. I'm far from convinced that any of this will fix the issue you're seeing, but ... it might ;) |
crash with 1574, now testing 1575 |
A first cautious feedback after round about 2 and a half hours: |
ok, after a week or so we might assume the semaphore fixed the problem. While looking at this yesterday, I found an alternative to using that "thread unsafe" mongo_client_t. |
So, still no more crashes of that type? I'll update #1441 right now. |
We only notice the FD_SETSIZE error since your semaphore fix. :-) |
Now, with this piece of information it seems almost clear to me that you suffer from a lack of resources. My suggestion then is that you add resources, to both the broker and to mongo and up the max FD size to a million. |
I don't think that the broker is too overloaded. It consumes about 15% of a CPU core and about 100MB of RAM. To address the FD problem we started 6 orions. The mongodb also has no ressource issues. It consumes 50% of a CPU core and about 2GB of RAM. And there are plenty of cores and RAM available. The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before. |
The mongodb has about 2 * 300 open connections. |
So, this comment wasn't really correct then :)
|
Yes, it is the last error left. :-) |
Now, if mongo has 600 connection, but you get to FDs over 1565, we need to find those other almost 1000 FDs ... That said, I'd up the FD MAX to a LARGE number and hope it's just normal. |
Question no 1 is:
|
We only get tons of this within a second and after that a crash (aka restart)
|
Yeah, we need more input. I'll try to find something more to "rule out". Not easy, but, we'll get there |
We finally solved the base image issue (not 100% but almost). That one uses the newer version 1.24.2 of mongoc that actually supports mongodb 7.0. Would be nice also to test with a less recent release of the mongo server. 4.4, 5.0, 6.0 ... |
We wave a cluster with mongodb 6 and one with mongodb 7. Both are now running with 1.6.0-PRE-1587. We will see. :-) |
So, not looking good then :( |
Different crash then? |
Same error FD_SETSIZE |
I did a quick search on the error (should have done that ages ago ...). |
Yes, this can be closed. |
ORION-LD V 1.4.0
Kubernetes v1.26.9
The text was updated successfully, but these errors were encountered: