Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

FR-ADDIX · 2023-12-13T10:12:29Z

ORION-LD V 1.4.0
Kubernetes v1.26.9

time=Wednesday 13 Dec 10:10:22 2023.838Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=mongocInit.cpp[58]:mongocLog | msg=MONGOC[mongoc]: Falling back to malloc for counters.                        
free(): double free detected in tcache 2                                                                                                                                                                                                      
Stream closed EOF for fiware/orion-ld-deployment-67cf4d9559-lscl2 (orion-ld)

The text was updated successfully, but these errors were encountered:

kzangeli · 2023-12-13T10:39:47Z

"tcache 2" ?
Not a clue what that might be ...
I¡m going to need some more info here to have a chance to look into this issue.
The usual stuff, how the broker was started, what requests are sent to the broker ...

ibordach · 2023-12-14T14:00:52Z

This is very difficult to reproduce. The double free crash always comes with the "Falling back to malloc for counters" from mongoc.

    spec:
      containers:
      - args:
        - -dbhost
        - mongodb-headless.fiware.svc.cluster.local:27017/
        - -logLevel
        - DEBUG
        - -ctxTimeout
        - "10000"
        - -mongocOnly
        - -lmtmp
        - -reqPoolSize
        - "120"
        - -logSummary
        - "30"
        env:
        - name: ORIONLD_CORE_CONTEXT
          value: v1.0
        - name: ORIONLD_MONGO_REPLICA_SET
          value: rs0
        - name: ORIONLD_MONGO_USER
          value: root
        - name: ORIONLD_MONGO_TIMEOUT
          value: "5000"
        - name: ORIONLD_MONGO_POOL_SIZE
          value: "180"
        - name: ORIONLD_MONGO_ID_INDEX
          value: "TRUE"
        - name: ORIONLD_STAT_COUNTERS
          value: "TRUE"
        - name: ORIONLD_STAT_SEM_WAIT
          value: "TRUE"
        - name: ORIONLD_STAT_TIMING
          value: "TRUE"
        - name: ORIONLD_STAT_NOTIF_QUEUE
          value: "TRUE"
        - name: ORIONLD_SUBCACHE_IVAL
          value: "10"
        - name: ORIONLD_DISTRIBUTED
          value: "TRUE"
        - name: ORIONLD_CONTEXT_DOWNLOAD_TIMEOUT
          value: "10000"
        - name: ORIONLD_NOTIF_MODE
          value: threadpool:252:4
        - name: ORIONLD_DEBUG_CURL
          value: "TRUE"

ibordach · 2023-12-15T11:55:13Z

It looks looks like a race condition under "high" load.

ibordach · 2023-12-15T12:59:52Z

Currently we reduced our setup from 3 to one orion-ld and no subscriptions. The orion gets round about 2000 messages per second and permanently died with this double free error. The same happens with 3 orions or 6 orions.

FR-ADDIX · 2023-12-15T14:37:02Z

Are you using C++ and DynamicArray somewhere?
"free(): double free detected in tcache 2" seems to be an error message if more elements are deleted from the array than are present.

kzangeli · 2023-12-17T09:19:59Z

Are you using C++ and DynamicArray somewhere?

Not a fan of C++. I'm a C guy ... I'm replacing most of the old C++ code of Orion for pure C in Orion-LD.

Now, if you could start the broker inside GDB or Valgrind, we'd get a stack trace.
That would be extremely helpful to understand where the problem comes from.
Either that, or a way for me to reproduce the crash.

ibordach · 2023-12-18T13:43:28Z

ok, we have the chance to get the core files. We also startet a pod with a gdb version of orion-ld. Now we have no debuginfos for gdb. How can we generate more information for you. I'm not a C guy. ;-)

Orion: 1.4.0-PRE-1425-GDB

(gdb) bt
#0  0x00000000004bb2a4 in ?? ()
#1  0x00000000004974ad in ?? ()
#2  0x00000000007fa132 in ?? ()
#3  0x00000000007fb870 in ?? ()
#4  0x0000000000803454 in ?? ()
#5  0x0000000000804105 in ?? ()
#6  0x00007fc6a18541cf in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc69cd36dd3 in clone () from /lib64/libc.so.6
(gdb)

ibordach · 2023-12-18T14:16:13Z

sorry for the Image

kzangeli · 2023-12-18T14:35:03Z

ok, getting closer.
mongoc_collection_destroy called twice.
That's already a good hint.
However, all in between is symbol less so, it could be better,
I think we already create debug images on dockerhub.
Would just be a question of using one of those instead (with debug symbols and not stripped).
This one for example: 1.5.0-PRE-1551-debug (or just any image whose name ends in "-debug")

Now, when you go from RELEASE (-O2) compilation to DEBUG (-g) compilation, the executable changes. Quite a lot.
Often. bugs disappear, or are moved elsewhere.
Let's see if we're lucky!

ibordach · 2023-12-18T15:49:37Z

now with 1.5.0-PRE-1551-debug:

(gdb) bt
#0  0x00007f4cb1ed0ee1 in abort () from /lib64/libc.so.6
#1  0x00007f4cb1f40037 in __libc_message () from /lib64/libc.so.6
#2  0x00007f4cb1f4719c in malloc_printerr () from /lib64/libc.so.6
#3  0x00007f4cb1f48f0d in _int_free () from /lib64/libc.so.6
#4  0x00007f4cb2e1623e in mongoc_collection_destroy (collection=0x7f4c3000e250) at /cmake-3.14.5/mongo-c-driver-1.22.0/src/libmongoc/src/mongoc/mongoc-collection.c:246
#5  0x000000000048fa0d in requestCompleted(void*, MHD_Connection*, void**, MHD_RequestTerminationCode) ()
#6  0x0000000000807622 in MHD_connection_close_ (connection=connection@entry=0x7f4c30000d60, termination_code=<optimized out>) at connection.c:842
#7  0x0000000000808949 in connection_reset (reuse=false, connection=0x7f4c30000d60) at connection.c:4328
#8  MHD_connection_handle_idle (connection=connection@entry=0x7f4c30000d60) at connection.c:4832
#9  0x000000000080bbb9 in call_handlers (con=con@entry=0x7f4c30000d60, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1234
#10 0x0000000000810a48 in MHD_epoll (daemon=0x2245f90, millisec=<optimized out>) at daemon.c:5149
#11 0x0000000000811695 in MHD_polling_thread (cls=0x2245f90) at daemon.c:5386
#12 0x00007f4cb6a011cf in start_thread () from /lib64/libpthread.so.0
#13 0x00007f4cb1ee8dd3 in clone () from /lib64/libc.so.6
(gdb)

kzangeli · 2023-12-18T18:35:53Z

ok!
Here's the code snippet:

  //                                                                                                                                                                                  
  // Release the connection to mongo - after all notifications have been sent (orionldAlterationsTreat takes care of that)                                                            
  //                                                                                                                                                                                  
  if (orionldState.mongoc.contextsP)       mongoc_collection_destroy(orionldState.mongoc.contextsP);
  if (orionldState.mongoc.entitiesP)       mongoc_collection_destroy(orionldState.mongoc.entitiesP);
  if (orionldState.mongoc.subscriptionsP)  mongoc_collection_destroy(orionldState.mongoc.subscriptionsP);
  if (orionldState.mongoc.registrationsP)  mongoc_collection_destroy(orionldState.mongoc.registrationsP);

I'll look into that. See if I can find the error.
Would be so much easier if I could reproduce the error myself, but ...

FR-ADDIX · 2023-12-19T16:09:56Z

Can you run multiple Orion-LD processes and send them against a MogoDB replica set?
Wir haben 3 Orion-LD die gegen eine MongoDB mit 2 Replikationen schreibt.
We currently have around 26,000 entities in the database, so perhaps that is another factor that causes the error to be triggered more quickly.
If we then write around 100 msg/sec against the CB, it runs stable for longer, if we increase the load to 1500 msg/sec, the crash can be produced relatively quickly.

kzangeli · 2023-12-19T16:15:41Z

ok, I can do that.
Good input, thanks

DasNordlicht · 2023-12-20T17:12:48Z

We have now significantly reduced the entity and are currently still looking at around 2000 vehicles.
This has made the Orion LD more stable.

However, this cannot be the solution, but only a temporary solution.
The aim is to reach well over 100,000 entities.

kzangeli · 2023-12-20T20:20:25Z

Yeah, of course not. That's not the solution

ibordach · 2024-02-08T10:13:09Z

Are there any ideas how we can support this issue? Same crashes with 1.6.0-PRE-1561.
That already causes us real pain.

kzangeli · 2024-02-08T11:47:29Z

Yeah, so sorry, this issue is long overdue.
I can only say I'm sorry.
Our development team is tiny, to say the least and these last few months I've been busy elsewhere, with various European projects and ETSI ISG CIM ...

That said, I took a quick look at this this morning and I've come up with a "desperate" attempt.
Let's start there (I'll let you know when the docker image is ready) and hopefully I'll have some more time in the near future.

Desperate attempt to fix issue #1499

kzangeli · 2024-02-08T13:49:08Z

The PR #1548 has been merged into develop and the docker image should reach dockerhub in 15-20 minutes. It's usually that fast.

Now, this is not that I've found the issue and fixed it.
We'll see if the modification changes anything.

ibordach · 2024-02-08T13:52:10Z

:-) We will test it as soon as it will be available.

kzangeli · 2024-02-08T13:58:52Z

1.6.0-PRE-1574

ibordach · 2024-02-08T14:01:22Z

up and running :-), we will check it

kzangeli · 2024-02-08T14:03:08Z

So, this "fix" is about MHD (3rd party lib for HTTP) calls the "request completed" callback more than once for one and the same connection. That would be one way for the error to appear. I kind of doubt it though.

Another possibility is that mongoc_client_get_collection returns the same mongoc_collection_t twice.
Also hard to believe, but I just saw that 'mongo_client_t` is not thread safe, and a client pointer is input to that function.
So, a semaphore there, around getting collections seems like a good idea.
I'll implement that now and we'll test again once the first test fails :)

One more fix for #1499 - protected mongocConnectionGet with a semaphore

kzangeli · 2024-02-08T16:56:00Z

So, I merged that second fix as well, semaphore around getting the connection to mongo.
The docker image as usual in 15 min.

I'm far from convinced that any of this will fix the issue you're seeing, but ... it might ;)

ibordach · 2024-02-09T07:02:06Z

crash with 1574, now testing 1575

ibordach · 2024-02-09T09:39:35Z

A first cautious feedback after round about 2 and a half hours:
so far we have not seen this type of error and only running into this issue #1441

kzangeli · 2024-02-09T10:26:19Z

ok, after a week or so we might assume the semaphore fixed the problem.

While looking at this yesterday, I found an alternative to using that "thread unsafe" mongo_client_t.
If this semaphore fixed the problem, then surely mongo_client_t is the culprit.
I'll try top replace all of those for the alternative. Abd, remove the semaphore, as it is not really needed (only the usage of mongo_client_t makes it needed)

kzangeli · 2024-02-27T17:28:41Z

So, still no more crashes of that type?

I'll update #1441 right now.

ibordach · 2024-02-28T07:38:06Z

We only notice the FD_SETSIZE error since your semaphore fix. :-)

kzangeli · 2024-02-28T09:05:48Z

We only notice the FD_SETSIZE error since your semaphore fix. :-)
That is interesting indeed.
The extra semaphore slightly changes the timing of the broker. It makes a bit slower.
Of course it means that the change to mongoc 1.24 won't make any difference, but that was just a "hit in the dark" anyway.

Now, with this piece of information it seems almost clear to me that you suffer from a lack of resources.
It really seems that way.
In the end it's the broker's fault, should manage resources and slow down incoming requests and stuff like that.
Wish I had the resources to implement all that!
But I don't, unfortunately.

My suggestion then is that you add resources, to both the broker and to mongo and up the max FD size to a million.
See how things turn out.
It doesn't seem like a leak of FDs. It seems like the broker is too overloaded.

ibordach · 2024-02-28T09:44:05Z

I don't think that the broker is too overloaded. It consumes about 15% of a CPU core and about 100MB of RAM. To address the FD problem we started 6 orions. The mongodb also has no ressource issues. It consumes 50% of a CPU core and about 2GB of RAM. And there are plenty of cores and RAM available.

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

ibordach · 2024-02-28T09:50:32Z

The mongodb has about 2 * 300 open connections.

ibordach · 2024-02-28T09:54:14Z

manual restart of the orions in the morning and reducing it to 5 instances:

kzangeli · 2024-02-28T09:58:45Z

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

So, this comment wasn't really correct then :)

We only notice the FD_SETSIZE error since your semaphore fix. :-)

ibordach · 2024-02-28T10:00:18Z

Yes, it is the last error left. :-)

kzangeli · 2024-02-28T10:02:14Z

Now, if mongo has 600 connection, but you get to FDs over 1565, we need to find those other almost 1000 FDs ...
Any info you find will help me to understand where the error is.

That said, I'd up the FD MAX to a LARGE number and hope it's just normal.
Slow mongo, for example.
Frankly, I have no clue what is causing this problem.
Very interested in fixing it though

kzangeli · 2024-02-28T10:04:40Z

Question no 1 is:

Is it really a leak of file descriptors?
and no 2, if (1) is YES:
What fd is it?
We already know it's a socket, but, of course, pretty much all FDs are sockets ...

ibordach · 2024-02-28T10:12:28Z

We only get tons of this within a second and after that a crash (aka restart)

New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).

#1441

kzangeli · 2024-02-28T10:17:14Z

Yeah, we need more input.
We already removed the Prometheus thingy.
We also tried without notifications (zero subscriptions).
You don't use registrations (right?)
The next step would be the new mongoc driver and I'm on that, just, doesn't look too good about fixing this.

I'll try to find something more to "rule out".

Not easy, but, we'll get there

kzangeli · 2024-02-28T19:15:44Z

We finally solved the base image issue (not 100% but almost).
There's a new image on dockerhub:
1.6.0-PRE-1587

That one uses the newer version 1.24.2 of mongoc that actually supports mongodb 7.0.
So, let's try again and see what happens ...

Would be nice also to test with a less recent release of the mongo server. 4.4, 5.0, 6.0 ...
Not that we're a bit desperate ... :)

ibordach · 2024-02-29T07:08:25Z

We wave a cluster with mongodb 6 and one with mongodb 7. Both are now running with 1.6.0-PRE-1587. We will see. :-)

ibordach · 2024-02-29T10:48:32Z

The bandwidth of one orion-ld instance:

at 10:15 we startet some more orion instances, so the traffic is somewhat lower per instance.

the replicaset of mongodb has no stress. The number of connections increases slowly over time. bandwidth usage is relaxed

kzangeli · 2024-02-29T10:51:27Z

So, not looking good then :(
If we only knew what FD is growing !!!

ibordach · 2024-02-29T10:55:57Z

the instance consumes a little bit of ram an then dies.

kzangeli · 2024-02-29T10:57:42Z

Different crash then?
Not too many open file descriptors?

ibordach · 2024-02-29T10:59:08Z

Same error FD_SETSIZE

kzangeli · 2024-02-29T12:23:32Z

I did a quick search on the error (should have done that ages ago ...).
The "article" is old, I know, but it might still be interesting.
Have a look at it and tell me what you think:
https://www.mongodb.com/community/forums/t/mongodb-4-2-19-extremely-high-file-descriptor-counts-70k/170223

kzangeli · 2024-02-29T12:29:59Z

Ah, we're commenting in the wrong issue.
Mea culpa.

We should be in #1441.
I'll move my last comment there.
This issue is about the crash that was presumable fixed with that semaphore I added.
Can this issue be closed? (and we continue the discussion in #1441?)

ibordach · 2024-02-29T15:29:07Z

Yes, this can be closed.

kzangeli added the Need more input label Dec 13, 2023

kzangeli added a commit that referenced this issue Feb 8, 2024

Desperate attempt to fix issue #1499

fd92829

kzangeli mentioned this issue Feb 8, 2024

Desperate attempt to fix issue #1499 #1548

Merged

kzangeli added a commit that referenced this issue Feb 8, 2024

Merge pull request #1548 from FIWARE/bug/issue1499

88be9a6

Desperate attempt to fix issue #1499

kzangeli added a commit that referenced this issue Feb 8, 2024

One more fix for #1499 - protected mongocConnectionGet with a semaphore

8a22ea7

kzangeli mentioned this issue Feb 8, 2024

One more fix for #1499 - protected mongocConnectionGet with a semaphore #1549

Merged

kzangeli added a commit that referenced this issue Feb 8, 2024

Merge pull request #1549 from FIWARE/bug/issue1499-2

9462c76

One more fix for #1499 - protected mongocConnectionGet with a semaphore

kzangeli removed the Need more input label Feb 27, 2024

kzangeli self-assigned this Feb 27, 2024

kzangeli closed this as completed Mar 1, 2024

Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

Comments

FR-ADDIX commented Dec 13, 2023

kzangeli commented Dec 13, 2023

ibordach commented Dec 14, 2023

ibordach commented Dec 15, 2023

ibordach commented Dec 15, 2023

FR-ADDIX commented Dec 15, 2023

kzangeli commented Dec 17, 2023 • edited Loading

ibordach commented Dec 18, 2023

ibordach commented Dec 18, 2023

kzangeli commented Dec 18, 2023

ibordach commented Dec 18, 2023

kzangeli commented Dec 18, 2023

FR-ADDIX commented Dec 19, 2023

kzangeli commented Dec 19, 2023

DasNordlicht commented Dec 20, 2023

kzangeli commented Dec 20, 2023

ibordach commented Feb 8, 2024

kzangeli commented Feb 8, 2024

kzangeli commented Feb 8, 2024

ibordach commented Feb 8, 2024

kzangeli commented Feb 8, 2024

ibordach commented Feb 8, 2024

kzangeli commented Feb 8, 2024

kzangeli commented Feb 8, 2024

ibordach commented Feb 9, 2024

ibordach commented Feb 9, 2024

kzangeli commented Feb 9, 2024

kzangeli commented Feb 27, 2024

ibordach commented Feb 28, 2024

kzangeli commented Feb 28, 2024

ibordach commented Feb 28, 2024 • edited Loading

ibordach commented Feb 28, 2024

ibordach commented Feb 28, 2024 • edited Loading

kzangeli commented Feb 28, 2024

ibordach commented Feb 28, 2024

kzangeli commented Feb 28, 2024

kzangeli commented Feb 28, 2024

ibordach commented Feb 28, 2024

kzangeli commented Feb 28, 2024

kzangeli commented Feb 28, 2024

ibordach commented Feb 29, 2024

ibordach commented Feb 29, 2024

kzangeli commented Feb 29, 2024

ibordach commented Feb 29, 2024

kzangeli commented Feb 29, 2024

ibordach commented Feb 29, 2024

kzangeli commented Feb 29, 2024

kzangeli commented Feb 29, 2024

ibordach commented Feb 29, 2024

kzangeli commented Dec 17, 2023 •

edited

Loading

ibordach commented Feb 28, 2024 •

edited

Loading

ibordach commented Feb 28, 2024 •

edited

Loading