Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classic queue can run into an exception after upgrade to RabbitMQ 3.13.7 (or 4.0.2) #12367

Open
netrmqdev opened this issue Sep 24, 2024 · 44 comments
Labels

Comments

@netrmqdev
Copy link

netrmqdev commented Sep 24, 2024

Describe the bug

After upgrading RMQ server to version 3.13.6 or 3.13.7 from version 3.12.14, we started having randomly "Restarting crashed queue" errors for some of our queues. Those queues can have a throughput of 300 event/s, with big messages. We have also tested version 4.0.1 but the same issue was also spotted. Everytime we rollback to 3.12.14, the issue no longer happens.

Our topology is quite simple, we have only one RabbitMQ server running on k8s with the Docker management image. Clients publishing and consuming the affected queues are written in .NET and use the officiel .NET RabitMQ client.

Here are some stacktraces of when that happens:

3.13.7

2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>              [{rabbit_msg_store,reader_pread_parse,1,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,696}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,reader_pread,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,688}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,read_from_disk,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,648}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_msg_store,client_read3,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 [{file,"rabbit_msg_store.erl"},{line,640}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,with_msg_store_state,3,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,1348}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,read_msg,5,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,1583}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_variable_queue,fetch,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_variable_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,616}]},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>               {rabbit_priority_queue,fetch,2,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                      [{file,"rabbit_priority_queue.erl"},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                       {line,302}]}]}
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>     offender: [{pid,<0.585.0>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {id,rabbit_amqqueue},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {mfargs,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                    {rabbit_prequeue,start_link,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                        [{amqqueue,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             {resource,<<"/">>,queue,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                                 <<"QUEUE_NAME">>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             true,false,none,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             <0.867.0>,[],[],[],undefined,undefined,[],
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             undefined,live,0,[],<<"/">>,
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             #{user => <<"rmq-internal">>},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                             rabbit_classic_queue,#{}},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                         recovery,<0.583.0>]}},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {restart_type,transient},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {significant,true},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {shutdown,600000},
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0>                {child_type,worker}]
2024-09-21 00:31:38.734074+00:00 [error] <0.584.0> 
2024-09-21 00:31:38.852784+00:00 [error] <0.217004.0> Restarting crashed queue 'QUEUE_NAME' in vhost '/'

4.0.1

2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     supervisor: {<0.551.0>,rabbit_amqqueue_sup}
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     errorContext: child_terminated
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     reason: {function_clause,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                 [{rabbit_variable_queue,d,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{delta,1053550,3357582,0,4411131}],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,988}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,2385}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,fetch_from_q3,1,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,2240}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,queue_out,1,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1262}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,collect_by_predicate,3,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1559}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,remove_by_predicate,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,1497}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_variable_queue,dropwhile,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_variable_queue.erl"},{line,558}]},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                  {rabbit_priority_queue,dropwhile,2,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                      [{file,"rabbit_priority_queue.erl"},{line,246}]}]}
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>     offender: [{pid,<0.552.0>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {id,rabbit_amqqueue},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {mfargs,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                    {rabbit_amqqueue_process,start_link,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                        [{amqqueue,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             {resource,<<"/">>,queue,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                                 <<"QUEUE_NAME_2">>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             true,false,none,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             <0.1275.0>,[],[],[],undefined,undefined,[],[],
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             stopped,0,[],<<"/">>,
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             #{user => <<"USER_NAME">>},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                             rabbit_classic_queue,#{}},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                         <0.550.0>]}},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {restart_type,transient},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {significant,true},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {shutdown,600000},
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0>                {child_type,worker}]
2024-09-23 10:30:56.018962+00:00 [error] <0.551.0> 
2024-09-23 10:30:56.020462+00:00 [error] <0.6365.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.

Those issues seem very similar to what is described here: #10902
However this issue is supposed to be fixed since 3.13.2.

Reproduction steps

Cannot reproduce in a test environment.

Expected behavior

No queue crashing

Additional context

No response

@netrmqdev netrmqdev added the bug label Sep 24, 2024
@michaelklishin
Copy link
Member

@netrmqdev how large are those "big messages"?

@lhoguin
Copy link
Contributor

lhoguin commented Sep 24, 2024

Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.

@michaelklishin michaelklishin changed the title "Restarting crashed queue" after upgrade to RabbitMQ 3.13.xx (still happening in 4.0.1) Classic queue can run into an exception after upgrade to RabbitMQ 3.13.xx (still happening in 4.0.1) Sep 24, 2024
@michaelklishin
Copy link
Member

Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.

Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.

@netrmqdev
Copy link
Author

@netrmqdev how large are those "big messages"?

The size varies a lot between a few KB to a few MB, but most messages are less than 1MB.

Another question is: how long should this "300 message/second, large messages" workload run before this issue can be hit.

Last time @lhoguin and @mkuratczyk had to investigate something similar it took a few weeks to find a reasonably reliable way to reproduce. So any details would be appreciated.

I wish I had identified a reproductible pattern, but:

  • I've seen the crashes anywhere between 10 minutes after starting the RabbitMQ server, up to several days after starting it
  • in many cases it happens with just a low throughput (5-10-15 messages/s), but the traffic on our environments varies a lot and all kinds of traffic have seen queues crash. I'm not really sure the throughput is linked to the issues here.

Hello! The first log is probably missing a few lines at the beginning, the line just above what you copied is important.

I've found a similar occurence where I have access to more logs than the first one (the ingestion of the logs into Grafana makes things a bit messy in this case):

2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     trap_exit: true
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     status: running
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     heap_size: 66222786
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     stack_size: 28
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>     reductions: 1274977716
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0>   neighbours:
2024-09-18 22:52:26.280015+00:00 [error] <0.585.0> 
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     supervisor: {<0.584.0>,rabbit_amqqueue_sup}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     errorContext: child_terminated
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     reason: {function_clause,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                 [{rabbit_msg_store,reader_pread_parse,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [[eof]],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,692}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,reader_pread,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,688}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,read_from_disk,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,648}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_msg_store,client_read3,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_msg_store.erl"},{line,640}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,with_msg_store_state,3,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,1348}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,read_msg,5,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,1583}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_variable_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_variable_queue.erl"},{line,616}]},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                  {rabbit_priority_queue,fetch,2,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                      [{file,"rabbit_priority_queue.erl"},{line,302}]}]}
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>     offender: [{pid,<0.585.0>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {id,rabbit_amqqueue},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {mfargs,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                    {rabbit_prequeue,start_link,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                        [{amqqueue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             {resource,<<"/">>,queue,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                                 <<"QUEUE_NAME">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             true,false,none,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             [{<<"x-queue-mode">>,longstr,<<"lazy">>}],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             <0.878.0>,[],[],[],undefined,undefined,[],
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             undefined,live,0,[],<<"/">>,
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             #{user => <<"rmq-internal">>},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                             rabbit_classic_queue,#{}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                         recovery,<0.583.0>]}},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {restart_type,transient},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {significant,true},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {shutdown,600000},
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0>                {child_type,worker}]
2024-09-18 22:52:26.316467+00:00 [error] <0.584.0> 
2024-09-18 22:52:26.317526+00:00 [error] <0.93868.0> Restarting crashed queue 'QUEUE_NAME' in vhost '/'

@mkuratczyk
Copy link
Contributor

@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.

Side note: this queue is configured as lazy. That's not a thing since 3.12:
https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode

@gomoripeti
Copy link
Contributor

we are also investigating a similar/same case: two single-node clusters were upgraded from 3.12.14 -> 3.13.7
They have similar topology and traffic patterns: a lot of vhosts each with one of the queues getting most of the traffic. The messages are ~20KB mixed with some ~200KB. The payloads are binary (so can contain any byte including 255). The crashing queues are long 4000-10,000 messages. The persistent msg store of the vhost has many GBs of data. (Probably dominantly belonging to these single long queues). The msgs are published via direct exchange to the queues (no fanout).

We couldn't reproduce the issue in a test environment so we don't know yet what attributes of the usage are relevant.

The queues crash when trying to fetch a message from the shared msg store, but the location does not match the content of the rdq file:

** Reason for termination ==
** {function_clause,
       [{rabbit_msg_store,reader_pread_parse,
            [[eof]],
            [{file,"rabbit_msg_store.erl"},{line,692}]},
        {rabbit_msg_store,reader_pread,2,
            [{file,"rabbit_msg_store.erl"},{line,688}]},
        {rabbit_msg_store,read_from_disk,2,
            [{file,"rabbit_msg_store.erl"},{line,648}]},
        {rabbit_msg_store,client_read3,2,
            [{file,"rabbit_msg_store.erl"},{line,640}]},
        {rabbit_variable_queue,with_msg_store_state,3,
            [{file,"rabbit_variable_queue.erl"},{line,1348}]},
        {rabbit_variable_queue,read_msg,5,
            [{file,"rabbit_variable_queue.erl"},{line,1583}]},
        {rabbit_variable_queue,fetch,2,
            [{file,"rabbit_variable_queue.erl"},{line,616}]},
        {rabbit_priority_queue,fetch,2,
            [{file,"rabbit_priority_queue.erl"},{line,302}]}]}

or

** Reason for termination ==
** {{badmatch,<<51,3,137,141,241,159,1,151,204,12,68,12,119,111,64,110,26,150,
...
                201,118,180,244,157,24,27,183,26,150,192,28,133,226,173,104>>},
    [{rabbit_msg_store,reader_pread_parse,1,
                       [{file,"rabbit_msg_store.erl"},{line,696}]},
     {rabbit_msg_store,reader_pread,2,
                       [{file,"rabbit_msg_store.erl"},{line,688}]},
     {rabbit_msg_store,read_from_disk,2,
                       [{file,"rabbit_msg_store.erl"},{line,648}]},
     {rabbit_msg_store,client_read3,2,
                       [{file,"rabbit_msg_store.erl"},{line,640}]},
     {rabbit_variable_queue,with_msg_store_state,3,
                            [{file,"rabbit_variable_queue.erl"},{line,1348}]},
     {rabbit_variable_queue,read_msg,5,
                            [{file,"rabbit_variable_queue.erl"},{line,1583}]},
     {rabbit_variable_queue,fetch,2,
                            [{file,"rabbit_variable_queue.erl"},{line,616}]},
     {rabbit_priority_queue,fetch,2,
                            [{file,"rabbit_priority_queue.erl"},{line,302}]}]}

I suspect that compaction somehow corrupts the rdq file or gets the rabbit_msg_store_ets_index out of sync.

Compaction should work in a way that it moves still referenced messages from the end towards the beginning of the file.
However I could not find the problematic msg Id anywhere in the rdq file. So it was lost somehow from the file.

We could not rule out 100% that there is some data from before the upgrade, but some crashes happened more than a week after the upgrade.

@netrmqdev
Copy link
Author

@netrmqdev there is no backlog of messages from old versions, right? if you are saying it can happen a few days after an upgrade, I assume there are no messages older than that, correct? I just want to rule out anything upgrade related.

No we flush everything before updating RMQ.
The only exception is when I tested the 3.13.7 -> 4.0.1 migration there was some older messages as I did not want to lose data, but that's the only case.

Side note: this queue is configured as lazy. That's not a thing since 3.12: https://www.rabbitmq.com/blog/2023/05/17/rabbitmq-3.12-performance-improvements#classic-queues-changes-to-the-lazy-mode

Indeed, I noticed that while investigating and I plan to clean that very soon!

@lhoguin
Copy link
Contributor

lhoguin commented Sep 24, 2024

When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.

@lhoguin
Copy link
Contributor

lhoguin commented Sep 24, 2024

Has any of these environments run out of disk at some point?

@netrmqdev
Copy link
Author

netrmqdev commented Sep 24, 2024

Has any of these environments run out of disk at some point?

No they have a lot of free space

When the queue restarts it should tell you how many persistent messages were dropped. Better yet, killing the node and restarting it will log a few more interesting things as it will force a full node recovery and may drop messages both from the queue and from the message store, if they're truly gone. Would be great to get the logs for a node restarted like this. It would help confirm whether rabbit ate the messages or whether the message store index or queue are out of sync.

I believe the dropped messages log when the queue restarts only exists in 4.0.x, so I only have those 3 logs to share (all from the same server):

2024-09-23 10:30:56.020462+00:00 [error] <0.6365.0> Restarting crashed queue 'QUEUE_NAME_1' in vhost '/'.
(...)
2024-09-23 10:33:15.393172+00:00 [warning] <0.6365.0> Queue QUEUE_NAME_1 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-23 13:37:26.358731+00:00 [error] <0.88507.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-23 13:37:28.046531+00:00 [warning] <0.88507.0> Queue QUEUE_NAME_2 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-24 14:53:39.553861+00:00 [error] <0.269219.0> Restarting crashed queue 'QUEUE_NAME_2' in vhost '/'.
2024-09-24 14:53:42.868098+00:00 [warning] <0.269219.0> Queue QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown

@lhoguin
Copy link
Contributor

lhoguin commented Sep 25, 2024

Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.

The first two crashes don't drop any message.

QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown

This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).

Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.

@gomoripeti
Copy link
Contributor

There are two types of cases I investigated

The first ones when the crash was actively happening, the second ones just inspecting backup of the data dir (without any logs or in-memory info available)

First cases

First case I was able to trace the msg store index lookup by

recon_trace:calls([{rabbit_msg_store, index_lookup_positive_ref_count, return_trace}], 10, [{pid, amqqueue:get_pid(element(2, rabbit_amqqueue:lookup(rabbit_misc:r(<<"VHOST">>,queue,<<"CRASHING_QUEUE">>))))}]).

And the MsgId was always found in the index with ref count = 1, eg:

rabbit_msg_store:index_lookup_positive_ref_count/2 --> {msg_location,<<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
    1,33915,6222253,378631}

The crash contains the queue state, from field q3 I can find the SeqId<->MsgId mappings (and infer publish order). eg:

[{msg_status,77555,
  <<42,0,200,124,102,232,152,34,183,160,227,152,159,25,95,231>>,
  undefined,true,true,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,377945}},
 {msg_status,77556,
  <<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>,
  undefined,true,undefined,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,18329}},
 {msg_status,77557,
  <<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>,
  undefined,true,undefined,rabbit_msg_store,true,
  msg_store,
  {message_properties,undefined,false,18329}},
...

Processed 33915.rdq and associated entries with queue sequence Ids. The first message 77555 is missing, but all the others are from the head (q3) of the crashed queue

33915.rdq content

Format is {Offset, Size, MsgId, SeqId} or {Offset, Size, zeros} for zero-holes

{0,281655,<<15,136,53,26,128,81,192,115,6,50,41,229,191,220,218,113>>,77598},
{281655,39229,<<41,60,124,86,181,148,210,178,9,186,28,255,134,99,233,150>>,77597},
{320884,131501,<<213,122,164,148,74,244,226,34,32,134,96,169,188,46,206,77>>,77596},
{452385,19021,<<167,197,103,118,214,105,204,87,245,252,196,61,106,114,130,139>>,77595},
{471406,25069,<<111,35,18,140,162,139,236,187,82,57,163,75,35,149,8,106>>,77594},
{496475,19021,<<93,160,151,55,225,61,226,74,61,26,177,199,240,178,79,240>>,77593},
{515496,15981,<<140,128,59,121,58,157,30,56,96,186,38,240,220,39,83,141>>,77592},
{531477,15101,<<143,144,32,118,80,167,104,244,230,186,77,181,188,157,117,125>>,77591},
{546578,39245,<<105,17,248,251,123,88,134,242,105,42,109,195,182,29,37,65>>,77590},
{585823,429964,<<179,55,227,126,4,3,85,64,0,59,149,56,18,114,235,49>>,77589},
{1015787,19021,<<147,123,187,27,7,24,134,121,138,6,141,202,183,52,182,62>>,77588},
{1034808,9933,<<41,132,140,249,145,109,251,160,62,125,2,238,2,79,204,207>>,77587},
{1044741,19005,<<153,62,107,62,155,113,167,241,7,73,193,234,236,30,33,183>>,77599},
{1063746,19005,<<165,40,148,32,57,16,242,59,210,71,18,102,62,89,51,209>>,77600},
{1082751,15101,<<106,126,121,8,191,167,18,23,155,55,63,75,80,100,55,31>>,77601},
{1097852,19005,<<90,128,67,235,135,235,209,215,23,249,142,23,175,16,249,78>>,77586},
{1116857,19021,<<235,227,93,47,32,176,85,123,112,212,98,100,178,233,14,10>>,77602},
{1135878,19021,<<21,23,77,157,215,42,33,243,101,195,47,249,10,141,64,149>>,77603},
{1154899,12957,<<25,16,251,192,116,38,25,73,77,121,205,252,144,18,95,192>>,77585},
{1167856,12957,<<35,131,17,134,207,157,48,49,241,196,3,224,180,15,231,134>>,77604},
{1180813,15981,<<52,219,231,45,62,86,167,184,115,208,172,18,182,83,224,196>>,77605},
{1196794,15981,<<21,204,21,209,208,79,89,7,102,28,185,195,197,246,143,22>>,77606},
{1212775,15981,<<31,160,139,232,218,86,4,254,197,218,37,178,118,28,197,85>>,77607},
{1228756,39229,<<247,19,233,251,178,216,13,157,6,215,234,27,220,188,173,77>>,77608},
{1267985,25069,<<109,18,158,146,183,121,217,196,107,197,116,137,3,197,117,206>>,77609},
{1293054,19005,<<232,11,115,53,197,240,156,177,84,105,54,60,57,95,212,181>>,77584},
{1312059,35517,<<92,181,210,222,239,115,50,57,153,111,247,108,199,149,105,217>>,77610},
{1347576,25053,<<161,208,148,111,3,38,172,216,188,98,69,9,145,153,170,141>>,77583},
{1372629,12957,<<166,116,195,132,38,158,134,14,25,125,54,209,141,30,11,179>>,77582},
{1385586,15101,<<180,224,229,246,220,224,228,177,223,37,145,127,174,156,202,228>>,77611},
{1400687,22029,<<217,46,191,202,135,212,33,127,64,65,72,166,253,57,145,39>>,77612},
{1422716,373383,<<7,205,44,193,200,20,44,180,18,8,26,173,185,61,138,56>>,77613},
{1796099,19005,<<75,34,208,84,201,186,238,210,29,178,209,102,191,68,233,209>>,77581},
{1815104,15101,<<227,139,224,220,59,240,175,45,147,18,44,36,128,234,69,138>>,77580},
{1830205,39229,<<121,149,66,78,151,174,252,97,241,73,70,67,36,144,45,92>>,77579},
{1869434,22029,<<60,148,227,129,111,25,25,211,136,220,4,13,102,231,186,34>>,77578},
{1891463,25053,<<41,25,20,41,107,31,27,234,180,72,231,140,231,197,29,222>>,77577},
{1916516,188226,<<105,88,108,13,238,100,57,34,112,153,204,248,168,226,131,9>>,77576},
{2104742,19005,<<226,254,138,133,55,136,177,166,25,193,160,245,158,253,245,218>>,77575},
{2123747,22029,<<24,201,18,119,244,65,10,139,159,190,174,164,130,152,62,134>>,77574},
{2145776,202322,<<15,4,233,55,44,212,200,14,157,81,243,165,223,87,252,102>>,77573},
{2348098,12957,<<155,32,213,210,137,243,111,184,146,85,26,146,177,133,59,171>>,77572},
{2361055,22029,<<105,203,138,248,239,176,15,33,44,213,228,150,236,136,24,252>>,77571},
{2383084,22045,<<193,54,69,189,42,165,238,121,56,118,84,19,35,208,33,57>>,77570},
{2405129,15101,<<147,59,236,173,183,48,62,21,249,64,17,174,18,211,186,106>>,77569},
{2420230,39229,<<196,11,133,67,181,170,108,32,106,123,253,3,78,112,47,187>>,77568},
{2459459,341063,<<153,96,201,214,229,11,136,202,174,112,120,64,223,209,145,35>>,77567},
{2800522,19005,<<96,236,151,145,94,238,243,65,62,104,246,45,66,27,113,114>>,77566},
{2819527,41085,<<235,214,98,8,252,2,204,171,94,132,97,188,96,134,5,108>>,77565},
{2860612,15981,<<0,35,51,81,45,94,32,134,252,83,253,41,23,212,174,163>>,77564},
{2876593,12957,<<9,223,69,251,191,203,87,225,126,186,6,72,220,135,175,221>>,77563},
{2889550,12957,<<89,27,228,152,35,223,65,203,55,16,177,150,75,123,240,138>>,77562},
{2902507,22045,<<123,122,9,191,241,59,209,185,35,76,9,133,134,123,133,217>>,77561},
{2924552,19005,<<185,205,169,78,127,94,33,111,119,101,67,179,193,113,81,169>>,77560},
{2943557,15101,<<72,60,100,69,67,33,69,253,187,115,31,33,207,217,0,117>>,77559},
{2958658,15981,<<193,48,7,100,105,167,32,245,154,142,124,187,122,196,165,83>>,77558},
{2974639,19005,<<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>,77557},
{2993644,19005,<<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>,77556},
{3012649,19005,<<199,248,153,18,29,228,175,11,182,209,63,158,15,236,181,166>>,77614},
{3031654,15981,<<244,151,151,42,77,170,29,18,66,70,168,180,165,107,84,3>>,77615},
{3047635,15981,<<131,168,111,204,242,17,52,87,92,141,254,179,240,130,216,72>>,77616},
{3063616,25069,<<84,199,58,217,248,200,28,34,8,74,196,106,5,228,156,15>>,77617},
{3088685,15981,<<132,2,95,133,137,172,7,224,140,72,155,239,37,128,159,111>>,77618},
{3104666,39245,<<236,239,131,72,69,39,135,200,96,167,24,167,81,0,193,139>>,77619},
{3143911,19021,<<125,153,5,11,76,62,220,39,29,62,29,172,227,244,146,254>>,77620},
{3162932,12957,<<100,212,81,36,154,30,222,118,121,234,76,136,44,152,192,124>>,77621},
{3175889,19021,<<31,238,224,32,74,156,108,212,102,4,172,168,134,2,61,96>>,77622},
{3194910,19005,<<102,222,148,111,129,243,104,216,36,152,60,219,102,39,66,133>>,77623},
{3213915,427340,<<138,116,71,21,200,203,57,75,174,236,123,206,236,105,172,193>>,77624},
{3641255,2_341_404,zeros},
{5982659,427340,{dup,<<138,116,71,21,200,203,57,75,174,236,123,206,236,105,172,193>>},77624}, <- offset 6_222_253 of missing msg 77555
{6409999,19005,{dup,<<102,222,148,111,129,243,104,216,36,152,60,219,102,39,66,133>>},77623},
{6429004,19021,{dup,<<31,238,224,32,74,156,108,212,102,4,172,168,134,2,61,96>>},77622},
{6448025,12957,{dup,<<100,212,81,36,154,30,222,118,121,234,76,136,44,152,192,124>>},77621},
{6460982,19021,{dup,<<125,153,5,11,76,62,220,39,29,62,29,172,227,244,146,254>>},77620},
{6480003,39245,{dup,<<236,239,131,72,69,39,135,200,96,167,24,167,81,0,193,139>>},77619},
{6519248,15981,{dup,<<132,2,95,133,137,172,7,224,140,72,155,239,37,128,159,111>>},77618},
{6535229,25069,{dup,<<84,199,58,217,248,200,28,34,8,74,196,106,5,228,156,15>>},77617},
{6560298,15981,{dup,<<131,168,111,204,242,17,52,87,92,141,254,179,240,130,216,72>>},77616},
{6576279,15981,{dup,<<244,151,151,42,77,170,29,18,66,70,168,180,165,107,84,3>>},77615},
{6592260,19005,{dup,<<199,248,153,18,29,228,175,11,182,209,63,158,15,236,181,166>>},77614},
{6611265,127981,zeros},
{6739246,19005,{dup,<<130,242,201,36,170,75,117,91,53,82,42,188,0,214,50,151>>},77556},
{6758251,19765,zeros},
{6778016,19005,{dup,<<198,29,81,216,243,153,73,243,243,79,123,100,248,82,90,220>>},77557},
{6797021,28987,zeros},
{6826008,15981,{dup,<<193,48,7,100,105,167,32,245,154,142,124,187,122,196,165,83>>},77558},
{6841989,247910,zeros},
{7089899,15101,{dup,<<72,60,100,69,67,33,69,253,187,115,31,33,207,217,0,117>>},77559},
{7105000,155194,zeros},
{7260194,19005,{dup,<<185,205,169,78,127,94,33,111,119,101,67,179,193,113,81,169>>},77560},
{7279199,22045,{dup,<<123,122,9,191,241,59,209,185,35,76,9,133,134,123,133,217>>},77561},
{7301244,119317,zeros},
{7420561,12957,{dup,<<89,27,228,152,35,223,65,203,55,16,177,150,75,123,240,138>>},77562},
{7433518,143402,zeros},
{7576920,12957,{dup,<<9,223,69,251,191,203,87,225,126,186,6,72,220,135,175,221>>},77563},
{7589877,177484,zeros},
{7767361,15981,{dup,<<0,35,51,81,45,94,32,134,252,83,253,41,23,212,174,163>>},77564},
{7783342,136064,zeros},
{7919406,41085,{dup,<<235,214,98,8,252,2,204,171,94,132,97,188,96,134,5,108>>},77565},
{7960491,110533,zeros},
{8071024,19005,{dup,<<96,236,151,145,94,238,243,65,62,104,246,45,66,27,113,114>>},77566},
{8090029,18101,zeros},
{8108130,341063,{dup,<<153,96,201,214,229,11,136,202,174,112,120,64,223,209,145,35>>},77567},
{8449193,158127,zeros},
{8607320,39229,{dup,<<196,11,133,67,181,170,108,32,106,123,253,3,78,112,47,187>>},77568},
{8646549,69493,zeros},
{8716042,15101,{dup,<<147,59,236,173,183,48,62,21,249,64,17,174,18,211,186,106>>},77569},
{8731143,138640,zeros},
{8869783,22045,{dup,<<193,54,69,189,42,165,238,121,56,118,84,19,35,208,33,57>>},77570},
{8891828,22029,{dup,<<105,203,138,248,239,176,15,33,44,213,228,150,236,136,24,252>>},77571},
{8913857,288746,zeros},
{9202603,12957,{dup,<<155,32,213,210,137,243,111,184,146,85,26,146,177,133,59,171>>},77572},
{9215560,162241,zeros},
{9377801,202322,{dup,<<15,4,233,55,44,212,200,14,157,81,243,165,223,87,252,102>>},77573},
{9580123,8459,zeros},
{9588582,22029,{dup,<<24,201,18,119,244,65,10,139,159,190,174,164,130,152,62,134>>},77574},
{9610611,19005,{dup,<<226,254,138,133,55,136,177,166,25,193,160,245,158,253,245,218>>},77575},
{9629616,282472,zeros},
{9912088,188226,{dup,<<105,88,108,13,238,100,57,34,112,153,204,248,168,226,131,9>>},77576},
{10100314,18101,zeros},
{10118415,25053,{dup,<<41,25,20,41,107,31,27,234,180,72,231,140,231,197,29,222>>},77577},
{10143468,158127,zeros},
{10301595,22029,{dup,<<60,148,227,129,111,25,25,211,136,220,4,13,102,231,186,34>>},77578},
{10323624,19013,zeros},
{10342637,39229,{dup,<<121,149,66,78,151,174,252,97,241,73,70,67,36,144,45,92>>},77579},
{10381866,15101,{dup,<<227,139,224,220,59,240,175,45,147,18,44,36,128,234,69,138>>},77580},
{10396967,19005,{dup,<<75,34,208,84,201,186,238,210,29,178,209,102,191,68,233,209>>},77581},
{10415972,373383,{dup,<<7,205,44,193,200,20,44,180,18,8,26,173,185,61,138,56>>},77613},
{10789355,22029,{dup,<<217,46,191,202,135,212,33,127,64,65,72,166,253,57,145,39>>},77612},
{10811384,15101,{dup,<<180,224,229,246,220,224,228,177,223,37,145,127,174,156,202,228>>},77611},
{10826485,10195,zeros},
{10836680,12957,{dup,<<166,116,195,132,38,158,134,14,25,125,54,209,141,30,11,179>>},77582},
{10849637,25053,{dup,<<161,208,148,111,3,38,172,216,188,98,69,9,145,153,170,141>>},77583},
{10874690,35517,{dup,<<92,181,210,222,239,115,50,57,153,111,247,108,199,149,105,217>>},77610},
{10910207,11501,zeros},
{10921708,19005,{dup,<<232,11,115,53,197,240,156,177,84,105,54,60,57,95,212,181>>},77584},
{10940713,25069,{dup,<<109,18,158,146,183,121,217,196,107,197,116,137,3,197,117,206>>},77609},
{10965782,39229,{dup,<<247,19,233,251,178,216,13,157,6,215,234,27,220,188,173,77>>},77608},
{11005011,15981,{dup,<<31,160,139,232,218,86,4,254,197,218,37,178,118,28,197,85>>},77607},
{11020992,15981,{dup,<<21,204,21,209,208,79,89,7,102,28,185,195,197,246,143,22>>},77606},
{11036973,15981,{dup,<<52,219,231,45,62,86,167,184,115,208,172,18,182,83,224,196>>},77605},
{11052954,12957,{dup,<<35,131,17,134,207,157,48,49,241,196,3,224,180,15,231,134>>},77604},
{11065911,3431,zeros},
{11069342,12957,{dup,<<25,16,251,192,116,38,25,73,77,121,205,252,144,18,95,192>>},77585},
{11082299,19021,{dup,<<21,23,77,157,215,42,33,243,101,195,47,249,10,141,64,149>>},77603},
{11101320,19021,{dup,<<235,227,93,47,32,176,85,123,112,212,98,100,178,233,14,10>>},77602},
{11120341,5003,zeros},
{11125344,19005,{dup,<<90,128,67,235,135,235,209,215,23,249,142,23,175,16,249,78>>},77586},
{11144349,15101,{dup,<<106,126,121,8,191,167,18,23,155,55,63,75,80,100,55,31>>},77601},
{11159450,19005,{dup,<<165,40,148,32,57,16,242,59,210,71,18,102,62,89,51,209>>},77600},
{11178455,19005,{dup,<<153,62,107,62,155,113,167,241,7,73,193,234,236,30,33,183>>},77599},
{11197460,93865,zeros},
{11291325,9933,{dup,<<41,132,140,249,145,109,251,160,62,125,2,238,2,79,204,207>>},77587},
{11301258,26117,zeros},
{11327375,19021,{dup,<<147,123,187,27,7,24,134,121,138,6,141,202,183,52,182,62>>},77588},
{11346396,73957,zeros},
{11420353,429964,{dup,<<179,55,227,126,4,3,85,64,0,59,149,56,18,114,235,49>>},77589},
{11850317,138362,zeros},
{11988679,39245,{dup,<<105,17,248,251,123,88,134,242,105,42,109,195,182,29,37,65>>},77590},
{12027924,69493,zeros},
{12097417,15101,{dup,<<143,144,32,118,80,167,104,244,230,186,77,181,188,157,117,125>>},77591},
{12112518,15981,{dup,<<140,128,59,121,58,157,30,56,96,186,38,240,220,39,83,141>>},77592},
{12128499,276582,zeros},
{12405081,19021,{dup,<<93,160,151,55,225,61,226,74,61,26,177,199,240,178,79,240>>},77593},
{12424102,62726,zeros},
{12486828,25069,{dup,<<111,35,18,140,162,139,236,187,82,57,163,75,35,149,8,106>>},77594},
{12511897,66917,zeros},
{12578814,19021,{dup,<<167,197,103,118,214,105,204,87,245,252,196,61,106,114,130,139>>},77595},
{12597835,40992,zeros},
{12638827,131501,{dup,<<213,122,164,148,74,244,226,34,32,134,96,169,188,46,206,77>>},77596},
{12770328,107131,zeros},
{12877459,39229,{dup,<<41,60,124,86,181,148,210,178,9,186,28,255,134,99,233,150>>},77597},
{12916688,90096,zeros},
{13006784,281655,{dup,<<15,136,53,26,128,81,192,115,6,50,41,229,191,220,218,113>>},77598},

The badmatch crash has a huge binary in it (luckily). I managed to identify messages 77623-77614 in them. Because there are no zeros between them and they are a continuous reversed block I suspected they were written there in one go during a compaction - potentially overwriting the missing 77555 message.

Second cases

In the second cases I only had the queue index (CQv1) on disk and the rdq files.

In one example the first index segment file starts with (format {RelSeqId, MsgId, Size})

{137,del_or_ack},
{137,del_or_ack},
{137,<<29,116,45,4,62,102,142,6,42,153,250,128,55,253, 101,164>>, 50457}},

{138,del_or_ack},
{138,<<146,223,62,74,141,134,33,54,203,94,55,94,185,164, 20,44>>, 204_137}},

{139,del_or_ack},
{139,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128, 98,196>>, 95033}},
...

The first message 137 is acked, so should have ref count = 0 in the msg store.
The second message 138 must be the one that causes the crash

rdq content

Format is {Offset, Size, MsgId, RelSeqId} or {Offset, Size, zeros} for zero-holes

{0,406636,<<107,201,64,142,168,145,210,148,90,206,85,18,161,19,35,0>>,146},
{406636,5037,<<248,82,68,196,81,78,105,7,244,78,202,246,61,193,105,255>>,145},
{411673,5693,zeros},
{417366,141170,<<64,3,75,172,94,232,137,37,88,125,224,15,72,113,204,245>>,154},
{558536,41213,<<211,32,156,253,161,218,191,143,24,126,171,163,6,149,105,42>>,144},
{599749,38061,<<42,61,20,7,164,238,70,167,49,7,21,156,242,95,135,134>>,143},
{637810,27741,<<160,231,235,43,68,242,155,152,28,250,2,50,58,224,20,36>>,142},
{665551,46173,<<203,122,57,233,195,167,126,165,164,106,114,54,133,179,25,99>>,141},
{711724,87261,<<135,145,222,45,78,30,125,36,115,223,20,219,252,22,96,47>>,147},
{798985,46781,<<0,100,99,53,55,153,6,145,10,20,169,9,8,221,71,76>>,140},
{845766,95709,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128,98,196>>,139},
{941475,80557,<<214,6,70,254,89,207,50,10,165,217,82,37,252,224,87,147>>,148},
{1022032,13325,<<4,129,51,209,34,155,37,199,24,123,49,227,28,151,157,44>>,149},
{1035357,374007,<<15,85,24,102,244,136,110,116,85,155,77,116,112,11,149,125>>,150},
{1409364,41247,zeros},
{1450611,210610,<<18,238,179,233,75,215,182,95,183,55,225,204,102,249,59,57>>,153},
{1661221,51133,zeros},
{1712354,32045,<<130,218,241,202,163,3,154,159,38,176,131,252,220,188,1,78>>,151},
{1744399,107037,<<4,132,31,162,252,213,132,84,194,170,173,16,62,135,213,101>>,152},
{1851436,367305,zeros},
{2218741,107037,{dup,<<4,132,31,162,252,213,132,84,194,170,173,16,62,135,213,101>>},152},
{2325778,32045,{dup,<<130,218,241,202,163,3,154,159,38,176,131,252,220,188,1,78>>},151},
{2357823,10_527_218,zeros},
{12885041,51133,<<29,116,45,4,62,102,142,6,42,153,250,128,55,253,101,164>>,137},
{12936174,374007,{dup,<<15,85,24,102,244,136,110,116,85,155,77,116,112,11,149,125>>},150},
{13310181,13325,{dup,<<4,129,51,209,34,155,37,199,24,123,49,227,28,151,157,44>>},149},
{13323506,80557,{dup,<<214,6,70,254,89,207,50,10,165,217,82,37,252,224,87,147>>},148},
{13404063,52758,zeros},
{13456821,95709,{dup,<<26,133,18,250,208,12,72,50,90,199,205,23,190,128,98,196>>},139},
{13552530,72049,zeros},
{13624579,46781,{dup,<<0,100,99,53,55,153,6,145,10,20,169,9,8,221,71,76>>},140},
{13671360,87261,{dup,<<135,145,222,45,78,30,125,36,115,223,20,219,252,22,96,47>>},147},
{13758621,32654,zeros},
{13791275,46173,{dup,<<203,122,57,233,195,167,126,165,164,106,114,54,133,179,25,99>>},141},
{13837448,27741,{dup,<<160,231,235,43,68,242,155,152,28,250,2,50,58,224,20,36>>},142},
{13865189,108437,zeros},
{13973626,38061,{dup,<<42,61,20,7,164,238,70,167,49,7,21,156,242,95,135,134>>},143},
{14011687,160570,zeros},
{14172257,41213,{dup,<<211,32,156,253,161,218,191,143,24,126,171,163,6,149,105,42>>},144},
{14213470,134316,zeros},
{14347786,5037,{dup,<<248,82,68,196,81,78,105,7,244,78,202,246,61,193,105,255>>},145},
{14352823,79909,zeros},
{14432732,406636,{dup,<<107,201,64,142,168,145,210,148,90,206,85,18,161,19,35,0>>},146},

I speculate the last compaction was maybe triggered by the ack of 137. This is supported by file timestamps as well. Both 137 and 139 are present in the rdq file (but 137 is not moved towards the beginning). There is an interesting 10MB zero-hole in the middle (right before 137) which is beyond the latest truncation line, so must be a result of a previous compaction. I haven't untangled from the order of SeqIds which compaction might have moved which message where and what could have been the original position of the missing message.

A theory

Theoretically it is possible that rabbit_msg_store:scan_file_for_valid_messages/1 misinterprets the content of an rdq file and this way the compaction loses and overwrites a live message. This can happen if a zero-hole is followed by a message, and there is a byte 255 in the appropriate place in the payload.

MsgId = binary:copy(<<"1">>, 16).
MC = 
#mc{protocol = mc_amqpl,
    data = #content{class_id = 60,properties = none,
                    properties_bin = <<48,0,0,0,0,0,2>>,
                    protocol = rabbit_framing_amqp_0_9_1,
                    payload_fragments_rev = [binary:copy(<<255>>, 26000)]},
    annotations = #{id => MsgId,
                    p => 0,
                    x => <<>>,
                    rk => [<<"queue1">>],
                    rts => 1726812956209}}.
BinMsg = term_to_binary(MC).
BinMsgSize = byte_size(BinMsg).
Entry = <<(BinMsgSize + 16):(8*8), MsgId:16/binary, BinMsg/binary, 255>>.
%% This is how `scan_file_for_valid_messages` matches on messages
<<Size2:(8*8), MsgId2:16/binary, Msg2:Size2/binary, 255, _/binary>> = <<0, Entry/binary>>.
MsgId2 == MsgId.

The payload size of 26000 bytes is selected (binary for <<0,0,0,0,0,102,68>>) so with an extra leading zero it can be interpreted as size 102 bytes. The payload in the term_to_binary format of a #mc{} starts after 85 bytes (plus 16 byte message id = 101).

This is just speculation and I don't know how likely it is to hit this coincidence.

@netrmqdev
Copy link
Author

Thanks. No the log message exists before 4.0 but only for CQv2, I suppose you were using CQv1.

Oh yes indeed, all our queues are in CQv1.

The first two crashes don't drop any message.

QUEUE_NAME_2 in vhost / dropped 48/0/0 persistent messages and 0 transient messages after unclean shutdown

This one tells us that the queue failed to find 48 of its messages in the shared message store (more precisely in its index).

Triggering a full restart by killing the node (I insist on killing the node, brutally shutting down the container or the node's process, as a normal restart won't do) will provide us with additional information in the logs to determine where those messages are.

I tried that, but I'm not sure it brings a lot more information, although we can see the 2 affected queues dropping messages on restart:

2024-09-25 10:16:25.931677+00:00 [warning] <0.581.0> Queue AFFECTED_QUEUE_1 in vhost / dropped 2/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:17:41.034955+00:00 [warning] <0.548.0> Queue AFFECTED_QUEUE_2 in vhost / dropped 4/0/0 persistent messages and 0 transient messages after unclean shutdown
Full log
2024-09-25 10:14:32.972368+00:00 [notice] <0.64.0> SIGTERM received - shutting down
2024-09-25 10:14:32.972368+00:00 [notice] <0.64.0> 
2024-09-25 10:14:32.976459+00:00 [warning] <0.715.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
2024-09-25 10:14:33.988757+00:00 [warning] <0.715.0> HTTP listener registry could not find context rabbitmq_management_tls
2024-09-25 10:14:34.010114+00:00 [info] <0.978.0> stopped TLS (SSL) listener on [::]:5671
2024-09-25 10:14:34.011933+00:00 [info] <0.869.0> stopped TCP listener on [::]:5672
2024-09-25 10:14:34.013862+00:00 [error] <0.937.0> Error on AMQP connection <0.937.0> (192.168.100.21:49848 -> 192.168.100.6:5672, vhost: '/', user: 'CLIENT_USER', state: running), channel 0:
2024-09-25 10:14:34.013862+00:00 [error] <0.937.0>  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
(...)
2024-09-25 10:14:34.604394+00:00 [info] <0.484.0> Virtual host '/' is stopping
2024-09-25 10:14:34.604593+00:00 [info] <0.320635.0> Closing all connections in vhost '/' on node 'USER@rabbitmq' because the vhost is stopping
2024-09-25 10:14:34.633023+00:00 [info] <0.498.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2024-09-25 10:14:35.220420+00:00 [info] <0.498.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2024-09-25 10:14:35.220727+00:00 [info] <0.494.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2024-09-25 10:14:35.230686+00:00 [info] <0.494.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2024-09-25 10:14:35.279493+00:00 [notice] <0.86.0>     alarm_handler: {clear,system_memory_high_watermark}
=INFO REPORT==== 25-Sep-2024::10:14:37.538698 ===
  alarm_handler: {set,{system_memory_high_watermark,[]}}
2024-09-25 10:14:40.344574+00:00 [notice] <0.44.0> Application syslog exited with reason: stopped
2024-09-25 10:14:40.350694+00:00 [notice] <0.254.0> Logging: switching to configured handler(s); following messages may not be visible in this log output
2024-09-25 10:14:40.358986+00:00 [notice] <0.254.0> Logging: configured log handlers are now ACTIVE
2024-09-25 10:14:40.372587+00:00 [info] <0.254.0> ra: starting system quorum_queues
2024-09-25 10:14:40.373522+00:00 [info] <0.254.0> starting Ra system: quorum_queues in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/quorum/USER@rabbitmq
2024-09-25 10:14:40.439201+00:00 [info] <0.271.0> ra system 'quorum_queues' running pre init for 0 registered servers
2024-09-25 10:14:40.449471+00:00 [info] <0.272.0> ra: meta data store initialised for system quorum_queues. 0 record(s) recovered
2024-09-25 10:14:40.463966+00:00 [notice] <0.277.0> WAL: ra_log_wal init, open tbls: ra_log_open_mem_tables, closed tbls: ra_log_closed_mem_tables
2024-09-25 10:14:40.490898+00:00 [info] <0.279.0> ra_system_recover: ra system 'quorum_queues' server recovery strategy rabbit_quorum_queue:system_recover
2024-09-25 10:14:40.491070+00:00 [info] <0.279.0> [rabbit_quorum_queue:system_recover/1] rabbit not booted, skipping queue recovery
2024-09-25 10:14:40.491830+00:00 [info] <0.254.0> ra: starting system coordination
2024-09-25 10:14:40.492566+00:00 [info] <0.254.0> starting Ra system: coordination in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/coordination/USER@rabbitmq
2024-09-25 10:14:40.493973+00:00 [info] <0.285.0> ra system 'coordination' running pre init for 1 registered servers
2024-09-25 10:14:40.508340+00:00 [info] <0.286.0> ra: meta data store initialised for system coordination. 1 record(s) recovered
2024-09-25 10:14:40.508630+00:00 [notice] <0.291.0> WAL: ra_coordination_log_wal init, open tbls: ra_coordination_log_open_mem_tables, closed tbls: ra_coordination_log_closed_mem_tables
2024-09-25 10:14:40.517090+00:00 [info] <0.254.0> ra: starting system coordination
2024-09-25 10:14:40.517816+00:00 [info] <0.254.0> starting Ra system: coordination in directory: /var/lib/rabbitmq/mnesia/USER@rabbitmq/coordination/USER@rabbitmq
2024-09-25 10:14:40.995221+00:00 [notice] <0.296.0> RabbitMQ metadata store: candidate -> leader in term: 4 machine version: 1
2024-09-25 10:14:41.255730+00:00 [info] <0.254.0> 
2024-09-25 10:14:41.255730+00:00 [info] <0.254.0>  Starting RabbitMQ 4.0.0+2.g6963f8b on Erlang 26.2.5.3 [jit]
2024-09-25 10:14:41.255730+00:00 [info] <0.254.0>  Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries
2024-09-25 10:14:41.255730+00:00 [info] <0.254.0>  Licensed under the MPL 2.0. Website: https://rabbitmq.com

##  ##      RabbitMQ 4.0.0+2.g6963f8b
##  ##
##########  Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries
######  ##
##########  Licensed under the MPL 2.0. Website: https://rabbitmq.com

Erlang:      26.2.5.3 [jit]
TLS Library: OpenSSL - OpenSSL 3.3.2 3 Sep 2024
Release series support status: see https://www.rabbitmq.com/release-information

Doc guides:  https://www.rabbitmq.com/docs
Support:     https://www.rabbitmq.com/docs/contact
Tutorials:   https://www.rabbitmq.com/tutorials
Monitoring:  https://www.rabbitmq.com/docs/monitoring
Upgrading:   https://www.rabbitmq.com/docs/upgrade

Logs: /var/log/rabbitmq/[email protected]
      <stdout>

Config file(s): /etc/rabbitmq/rabbitmq.conf
                /etc/rabbitmq/conf.d/10-defaults.conf

Starting broker...2024-09-25 10:14:41.257577+00:00 [info] <0.254.0> 
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  node           : USER@rabbitmq
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  home dir       : /var/lib/rabbitmq
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  config file(s) : /etc/rabbitmq/rabbitmq.conf
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>                 : /etc/rabbitmq/conf.d/10-defaults.conf
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  cookie hash    : HASH
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  log(s)         : /var/log/rabbitmq/[email protected]
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>                 : <stdout>
2024-09-25 10:14:41.257577+00:00 [info] <0.254.0>  data dir       : /var/lib/rabbitmq/mnesia/USER@rabbitmq
2024-09-25 10:14:41.670013+00:00 [info] <0.254.0> Running boot step pre_boot defined by app rabbit
2024-09-25 10:14:41.670137+00:00 [info] <0.254.0> Running boot step rabbit_global_counters defined by app rabbit
2024-09-25 10:14:41.670835+00:00 [info] <0.254.0> Running boot step rabbit_osiris_metrics defined by app rabbit
2024-09-25 10:14:41.671239+00:00 [info] <0.254.0> Running boot step rabbit_core_metrics defined by app rabbit
2024-09-25 10:14:41.673157+00:00 [info] <0.254.0> Running boot step rabbit_alarm defined by app rabbit
2024-09-25 10:14:41.686595+00:00 [info] <0.406.0> Memory high watermark set to 57869 MiB (60680875622 bytes) of 64299 MiB (67423195136 bytes) total
2024-09-25 10:14:41.690230+00:00 [info] <0.408.0> Enabling free disk space monitoring (disk free space: 244394713088, total memory: 67423195136)
2024-09-25 10:14:41.690340+00:00 [info] <0.408.0> Disk free limit set to 50MB
2024-09-25 10:14:41.692268+00:00 [info] <0.254.0> Running boot step code_server_cache defined by app rabbit
2024-09-25 10:14:41.692527+00:00 [info] <0.254.0> Running boot step file_handle_cache defined by app rabbit
2024-09-25 10:14:41.699863+00:00 [info] <0.411.0> Limiting to approx 1048479 file handles (943629 sockets)
2024-09-25 10:14:41.700086+00:00 [info] <0.412.0> FHC read buffering: OFF
2024-09-25 10:14:41.700234+00:00 [info] <0.412.0> FHC write buffering: ON
2024-09-25 10:14:41.701765+00:00 [info] <0.254.0> Running boot step worker_pool defined by app rabbit
2024-09-25 10:14:41.701962+00:00 [info] <0.339.0> Will use 16 processes for default worker pool
2024-09-25 10:14:41.702074+00:00 [info] <0.339.0> Starting worker pool 'worker_pool' with 16 processes in it
2024-09-25 10:14:41.702915+00:00 [info] <0.254.0> Running boot step database defined by app rabbit
2024-09-25 10:14:41.703582+00:00 [info] <0.254.0> Peer discovery: configured backend: rabbit_peer_discovery_classic_config
2024-09-25 10:14:41.705432+00:00 [info] <0.254.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2024-09-25 10:14:41.711920+00:00 [info] <0.254.0> Successfully synced tables from a peer
2024-09-25 10:14:41.727975+00:00 [info] <0.254.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2024-09-25 10:14:41.728235+00:00 [info] <0.254.0> Successfully synced tables from a peer
2024-09-25 10:14:41.728643+00:00 [info] <0.254.0> Running boot step tracking_metadata_store defined by app rabbit
2024-09-25 10:14:41.728830+00:00 [info] <0.438.0> Setting up a table for connection tracking on this node: tracked_connection
2024-09-25 10:14:41.728983+00:00 [info] <0.438.0> Setting up a table for per-vhost connection counting on this node: tracked_connection_per_vhost
2024-09-25 10:14:41.729126+00:00 [info] <0.438.0> Setting up a table for per-user connection counting on this node: tracked_connection_per_user
2024-09-25 10:14:41.729220+00:00 [info] <0.438.0> Setting up a table for channel tracking on this node: tracked_channel
2024-09-25 10:14:41.729356+00:00 [info] <0.438.0> Setting up a table for channel tracking on this node: tracked_channel_per_user
2024-09-25 10:14:41.729561+00:00 [info] <0.254.0> Running boot step networking_metadata_store defined by app rabbit
2024-09-25 10:14:41.729914+00:00 [info] <0.254.0> Running boot step feature_flags defined by app rabbit
2024-09-25 10:14:41.730199+00:00 [info] <0.254.0> Running boot step codec_correctness_check defined by app rabbit
2024-09-25 10:14:41.730419+00:00 [info] <0.254.0> Running boot step external_infrastructure defined by app rabbit
2024-09-25 10:14:41.730532+00:00 [info] <0.254.0> Running boot step rabbit_event defined by app rabbit
2024-09-25 10:14:41.730778+00:00 [info] <0.254.0> Running boot step rabbit_registry defined by app rabbit
2024-09-25 10:14:41.731048+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2024-09-25 10:14:41.731369+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_anonymous defined by app rabbit
2024-09-25 10:14:41.731796+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2024-09-25 10:14:41.732114+00:00 [info] <0.254.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2024-09-25 10:14:41.732285+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2024-09-25 10:14:41.732466+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2024-09-25 10:14:41.732652+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2024-09-25 10:14:41.732900+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_local_random defined by app rabbit
2024-09-25 10:14:41.733384+00:00 [info] <0.254.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2024-09-25 10:14:41.733602+00:00 [info] <0.254.0> Running boot step rabbit_priority_queue defined by app rabbit
2024-09-25 10:14:41.733726+00:00 [info] <0.254.0> Priority queues enabled, real BQ is rabbit_variable_queue
2024-09-25 10:14:41.733965+00:00 [info] <0.254.0> Running boot step kernel_ready defined by app rabbit
2024-09-25 10:14:41.734029+00:00 [info] <0.254.0> Running boot step pg_local_amqp_connection defined by app rabbit
2024-09-25 10:14:41.742742+00:00 [info] <0.254.0> Running boot step pg_local_amqp_session defined by app rabbit
2024-09-25 10:14:41.743194+00:00 [info] <0.254.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2024-09-25 10:14:41.743761+00:00 [info] <0.254.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2024-09-25 10:14:41.744814+00:00 [info] <0.448.0> epmd monitor knows us, inter-node communication (distribution) port: 25672
2024-09-25 10:14:41.745069+00:00 [info] <0.254.0> Running boot step guid_generator defined by app rabbit
2024-09-25 10:14:41.752456+00:00 [info] <0.254.0> Running boot step rabbit_node_monitor defined by app rabbit
2024-09-25 10:14:41.752984+00:00 [info] <0.452.0> Starting rabbit_node_monitor (in ignore mode)
2024-09-25 10:14:41.753303+00:00 [info] <0.254.0> Running boot step delegate_sup defined by app rabbit
2024-09-25 10:14:41.754039+00:00 [info] <0.254.0> Running boot step rabbit_fifo_dlx_sup defined by app rabbit
2024-09-25 10:14:41.754307+00:00 [info] <0.254.0> Running boot step core_initialized defined by app rabbit
2024-09-25 10:14:41.754408+00:00 [info] <0.254.0> Running boot step rabbit_channel_tracking_handler defined by app rabbit
2024-09-25 10:14:41.754667+00:00 [info] <0.254.0> Running boot step rabbit_classic_queue defined by app rabbit
2024-09-25 10:14:41.755279+00:00 [info] <0.254.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2024-09-25 10:14:41.755683+00:00 [info] <0.254.0> Running boot step rabbit_definitions_hashing defined by app rabbit
2024-09-25 10:14:41.756029+00:00 [info] <0.254.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2024-09-25 10:14:41.770350+00:00 [info] <0.254.0> Running boot step rabbit_policies defined by app rabbit
2024-09-25 10:14:41.770968+00:00 [info] <0.254.0> Running boot step rabbit_policy defined by app rabbit
2024-09-25 10:14:41.771294+00:00 [info] <0.254.0> Running boot step rabbit_quorum_memory_manager defined by app rabbit
2024-09-25 10:14:41.771471+00:00 [info] <0.254.0> Running boot step rabbit_quorum_queue defined by app rabbit
2024-09-25 10:14:41.772310+00:00 [info] <0.254.0> Running boot step rabbit_stream_coordinator defined by app rabbit
2024-09-25 10:14:41.772690+00:00 [info] <0.254.0> Running boot step rabbit_vhost_limit defined by app rabbit
2024-09-25 10:14:41.773025+00:00 [info] <0.254.0> Running boot step rabbit_mgmt_reset_handler defined by app rabbitmq_management
2024-09-25 10:14:41.773294+00:00 [info] <0.254.0> Running boot step rabbit_mgmt_db_handler defined by app rabbitmq_management_agent
2024-09-25 10:14:41.773473+00:00 [info] <0.254.0> Management plugin: using rates mode 'basic'
2024-09-25 10:14:41.774030+00:00 [info] <0.254.0> Running boot step recovery defined by app rabbit
2024-09-25 10:14:41.802472+00:00 [info] <0.484.0> Making sure data directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2024-09-25 10:14:41.821139+00:00 [info] <0.484.0> Starting message stores for vhost '/'
2024-09-25 10:14:41.826830+00:00 [info] <0.484.0> Started message store of type transient for vhost '/'
2024-09-25 10:14:41.837636+00:00 [warning] <0.498.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent" : recovery terms differ from present
2024-09-25 10:14:41.837636+00:00 [warning] <0.498.0> rebuilding indices from scratch
2024-09-25 10:16:22.565186+00:00 [info] <0.484.0> Started message store of type persistent for vhost '/'
2024-09-25 10:16:22.573534+00:00 [warning] <0.560.0> Queue QUEUE_1 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.573719+00:00 [warning] <0.551.0> Queue QUEUE_2 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.573730+00:00 [warning] <0.530.0> Queue QUEUE_3 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.579813+00:00 [warning] <0.569.0> Queue QUEUE_4 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.579854+00:00 [warning] <0.587.0> Queue QUEUE_5 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.579968+00:00 [warning] <0.590.0> Queue QUEUE_6 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.579960+00:00 [warning] <0.563.0> Queue QUEUE_7 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580742+00:00 [warning] <0.542.0> Queue QUEUE_8 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580705+00:00 [warning] <0.536.0> Queue QUEUE_9 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580797+00:00 [warning] <0.554.0> Queue QUEUE_10 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580984+00:00 [warning] <0.557.0> Queue QUEUE_11 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580842+00:00 [warning] <0.584.0> Queue QUEUE_12 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580964+00:00 [warning] <0.575.0> Queue QUEUE_13 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.580950+00:00 [warning] <0.533.0> Queue QUEUE_14 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.581669+00:00 [warning] <0.572.0> Queue QUEUE_15 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.581889+00:00 [warning] <0.545.0> Queue QUEUE_16 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.581997+00:00 [warning] <0.539.0> Queue QUEUE_17 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.582254+00:00 [warning] <0.578.0> Queue QUEUE_18 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.582898+00:00 [warning] <0.527.0> Queue QUEUE_19 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:22.582956+00:00 [warning] <0.566.0> Queue QUEUE_20 in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:16:25.931677+00:00 [warning] <0.581.0> Queue AFFECTED_QUEUE_1 in vhost / dropped 2/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:17:41.034955+00:00 [warning] <0.548.0> Queue AFFECTED_QUEUE_2 in vhost / dropped 4/0/0 persistent messages and 0 transient messages after unclean shutdown
2024-09-25 10:17:41.127729+00:00 [info] <0.484.0> Recovering 22 queues of type rabbit_classic_queue took 179323ms
2024-09-25 10:17:41.127895+00:00 [info] <0.484.0> Recovering 0 queues of type rabbit_quorum_queue took 0ms
2024-09-25 10:17:41.128107+00:00 [info] <0.484.0> Recovering 0 queues of type rabbit_stream_queue took 0ms
2024-09-25 10:17:41.177029+00:00 [info] <0.254.0> Running boot step empty_db_check defined by app rabbit
2024-09-25 10:17:41.177229+00:00 [info] <0.254.0> Will not seed default virtual host and user: have definitions to load...
2024-09-25 10:17:41.177583+00:00 [info] <0.254.0> Running boot step rabbit_observer_cli defined by app rabbit
2024-09-25 10:17:41.177928+00:00 [info] <0.254.0> Running boot step rabbit_core_metrics_gc defined by app rabbit
2024-09-25 10:17:41.178569+00:00 [info] <0.254.0> Running boot step background_gc defined by app rabbit
2024-09-25 10:17:41.178930+00:00 [info] <0.254.0> Running boot step routing_ready defined by app rabbit
2024-09-25 10:17:41.179014+00:00 [info] <0.254.0> Running boot step pre_flight defined by app rabbit
2024-09-25 10:17:41.179184+00:00 [info] <0.254.0> Running boot step notify_cluster defined by app rabbit
2024-09-25 10:17:41.179448+00:00 [info] <0.254.0> Running boot step networking defined by app rabbit
2024-09-25 10:17:41.179776+00:00 [info] <0.254.0> Running boot step rabbit_quorum_queue_periodic_membership_reconciliation defined by app rabbit
2024-09-25 10:17:41.180197+00:00 [info] <0.254.0> Running boot step definition_import_worker_pool defined by app rabbit
2024-09-25 10:17:41.180494+00:00 [info] <0.339.0> Starting worker pool 'definition_import_pool' with 16 processes in it
2024-09-25 10:17:41.181565+00:00 [info] <0.254.0> Running boot step cluster_name defined by app rabbit
2024-09-25 10:17:41.181809+00:00 [info] <0.254.0> Running boot step virtual_host_reconciliation defined by app rabbit
2024-09-25 10:17:41.182421+00:00 [info] <0.254.0> Running boot step direct_client defined by app rabbit
2024-09-25 10:17:41.182657+00:00 [info] <0.254.0> Running boot step rabbit_management_load_definitions defined by app rabbitmq_management
2024-09-25 10:17:41.183460+00:00 [info] <0.678.0> Resetting node maintenance status
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> Deprecated features: `management_metrics_collection`: Feature `management_metrics_collection` is deprecated.
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> By default, this feature can still be used for now.
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> Its use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> To continue using this feature when it is not permitted by default, set the following parameter in your configuration:
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0>     "deprecated_features.permit.management_metrics_collection = true"
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0> To test RabbitMQ as if the feature was removed, set this in your configuration:
2024-09-25 10:17:41.492172+00:00 [warning] <0.714.0>     "deprecated_features.permit.management_metrics_collection = false"
2024-09-25 10:17:41.571355+00:00 [info] <0.751.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2024-09-25 10:17:41.571722+00:00 [info] <0.779.0> Statistics database started.
2024-09-25 10:17:41.571922+00:00 [info] <0.778.0> Starting worker pool 'management_worker_pool' with 3 processes in it
2024-09-25 10:17:41.601022+00:00 [info] <0.793.0> Prometheus metrics: HTTP (non-TLS) listener started on port 15692
2024-09-25 10:17:41.601748+00:00 [info] <0.678.0> Applying definitions from regular file at /etc/rabbitmq/definitions.json
2024-09-25 10:17:41.603207+00:00 [info] <0.678.0> Applying definitions from file at '/etc/rabbitmq/definitions.json'
2024-09-25 10:17:41.603321+00:00 [info] <0.678.0> Asked to import definitions. Acting user: rmq-internal
2024-09-25 10:17:41.610657+00:00 [info] <0.678.0> Importing concurrently 16 users...
2024-09-25 10:17:41.620585+00:00 [info] <0.678.0> Importing concurrently 1 vhosts...
2024-09-25 10:17:41.627732+00:00 [info] <0.678.0> Importing concurrently 16 permissions...
2024-09-25 10:17:41.634987+00:00 [info] <0.676.0> Successfully set permissions for user 'USER_CLIENT' in virtual host '/' to '.*', '.*', '.*'
2024-09-25 10:17:41.642804+00:00 [info] <0.678.0> Importing concurrently 10 exchanges...
2024-09-25 10:17:41.643316+00:00 [info] <0.678.0> Importing sequentially 1 global runtime parameters...
2024-09-25 10:17:41.653060+00:00 [info] <0.678.0> Importing concurrently 20 queues...
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> Deprecated features: `transient_nonexcl_queues`: Feature `transient_nonexcl_queues` is deprecated.
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> By default, this feature can still be used for now.
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> Its use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> To continue using this feature when it is not permitted by default, set the following parameter in your configuration:
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0>     "deprecated_features.permit.transient_nonexcl_queues = true"
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0> To test RabbitMQ as if the feature was removed, set this in your configuration:
2024-09-25 10:17:41.655141+00:00 [warning] <0.661.0>     "deprecated_features.permit.transient_nonexcl_queues = false"
2024-09-25 10:17:41.657217+00:00 [info] <0.678.0> Importing concurrently 27 bindings...
2024-09-25 10:17:41.662350+00:00 [info] <0.678.0> Ready to start client connection listeners
2024-09-25 10:17:41.665544+00:00 [info] <0.861.0> started TCP listener on [::]:5672
2024-09-25 10:17:41.772098+00:00 [info] <0.881.0> started TLS (SSL) listener on [::]:5671
completed with 4 plugins.
2024-09-25 10:17:41.889731+00:00 [info] <0.678.0> Server startup complete; 4 plugins started.
2024-09-25 10:17:41.889731+00:00 [info] <0.678.0>  * rabbitmq_prometheus
2024-09-25 10:17:41.889731+00:00 [info] <0.678.0>  * rabbitmq_management
2024-09-25 10:17:41.889731+00:00 [info] <0.678.0>  * rabbitmq_management_agent
2024-09-25 10:17:41.889731+00:00 [info] <0.678.0>  * rabbitmq_web_dispatch
2024-09-25 10:17:41.987176+00:00 [info] <0.9.0> Time to start RabbitMQ: 184584 ms

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

Thanks all.

I think the working theory is correct. The odds of this happening are very low, but over the many RabbitMQ installations this is bound to happen regularly. I will think about a way to efficiently make sure the message found is a real message; a naive approach would be to check that the remainder here can be decoded with binary_to_term, if e.g. @gomoripeti you want to try.

@gomoripeti
Copy link
Contributor

sure, if you mean that I create a patch with the naive approach (So far I couldn't trigger a compaction in a system test env with naive approaches so first I need to figure that out to verify the patch. Is there anything more to it than waiting 15 seconds after more than half of the messages in an rdq file (which is not the last one) are consumed?)

As the size of a msg cannot be larger than 512MB which can be stored on 4 bytes, what if the first byte of the size would be set to 255 (as a beginning of msg mark), or some other magic value in the first few bytes? Would that help at all?

My concern is that calling binary_to_term is fine at startup during recovery, but it might create too much garbage/too resource intensive during normal operation. What if there was a "secondary index", another ets table keeping track of "rdq file" -> "msg id" mappings, so it can be cross checked or serve as an input for the scanning. (There used to be a query to return all the msg ids for a given file, but it used the index table and was inefficient)

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

Careful as the naive approach is expensive so only do if you got the right test files.

Yes 15s since file access and half of the data gone.

It might be easier to trigger by sending messages that are basically just all bytes 255.

Ideally we could just make a better format, but it's a bit complex. This is best done later if/when we want to stop having the index fully in memory.

Right now my approach to fix this is that when we encounter a potential message we still check if there's a potential message in the next 8 bytes. Then when we cross reference with the index we should get the invalid messages dropped and only keep the valid ones, including the message we currently lose. The cross reference is strict (includes size and offset) so I don't expect any errors after that.

@gomoripeti
Copy link
Contributor

we still check if there's a potential message in the next 8 bytes.

aha, so scanning could return overlapping messages, but cross checking would filter out the misinterpreted one

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

Right.

Got a test case:

diff --git a/deps/rabbit/test/backing_queue_SUITE.erl b/deps/rabbit/test/backing_queue_SUITE.erl
index 10129201b9..8b718ec81f 100644
--- a/deps/rabbit/test/backing_queue_SUITE.erl
+++ b/deps/rabbit/test/backing_queue_SUITE.erl
@@ -513,8 +513,7 @@ msg_store_file_scan(Config) ->
       ?MODULE, msg_store_file_scan1, [Config]).
 
 msg_store_file_scan1(Config) ->
-    Scan = fun (Blocks) ->
-        Expected = gen_result(Blocks),
+    ScanExp = fun (Blocks, Expected) ->
         Path = gen_msg_file(Config, Blocks),
         Result = rabbit_msg_store:scan_file_for_valid_messages(Path),
         case Result of
@@ -522,6 +521,9 @@ msg_store_file_scan1(Config) ->
             _ -> {expected, Expected, got, Result}
         end
     end,
+    Scan = fun (Blocks) ->
+        ScanExp(Blocks, gen_result(Blocks))
+    end,
     %% Empty files.
     ok = Scan([]),
     ok = Scan([{pad, 1024}]),
@@ -629,6 +631,10 @@ msg_store_file_scan1(Config) ->
     %% Messages with no content.
     ok = Scan([{bin, <<0:64, "deadbeefdeadbeef", 255>>}]),
     ok = Scan([{msg, gen_id(), <<>>}]),
+    %% Sneaky messages.
+    ok = ScanExp(
+        [{bin, <<0, 0:48, 17, 17, "idididididididid", 255, 0:5909/unit:8, 255>>}],
+        {ok, [{<<"idididididididid">>, 5920, 1}], 5921}),
     %% All good!!
     passed.
 

This fails with:

backing_queue_SUITE > backing_queue_tests > msg_store_file_scan
    #1. {error,
            {{exception,
                 {badmatch,
                     {expected,
                         {ok,[{<<"idididididididid">>,5920,1}],5921},
                         got,
                         {ok,[{<<17,105,100,105,100,105,100,105,100,105,100,
                                 105,100,105,100,105>>,
                               26,0}],
                             26}}},

So the first attempt with <<0:56,17>> is accepted as the message, and it ends up eating the real message. Now to write the fix I mentioned.

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

Actually we should probably do the index_lookup immediately, while scanning. We are going to do it anyway in both locations where that function is used, so we may as well lookup now.

The reason we want to do this is because even if we try the next 8 bytes, I don't know how much data I need to skip after doing that. While if we lookup immediately while scanning, I know whether the message is real or not.

I think what needs to happen is that the fun from the two lists:foldl calls after scanning must instead be called while scanning every time we find a potential message. These funs don't need to change much, just tell us whether the message is to be kept or not.

@gomoripeti
Copy link
Contributor

how does this work in case of node recovery? is the index already populated with msg ids from queue indexes before reading the rdq files?

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

Yes.

@lhoguin
Copy link
Contributor

lhoguin commented Sep 26, 2024

I have opened #12392 with a patch. I have only tested via unit tests so this will require more extensive testing.

I had to disable file scanning when we delete files, for now. Checking against the index in that case is too costly, because when the message isn't in the index we try the next byte, we don't skip the message. And deleted files are not supposed to have anything in the index (or in the file). This was basically just an assert so I am considering just removing it.

@lhoguin

This comment was marked as outdated.

@michaelklishin
Copy link
Member

@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)

@netrmqdev
Copy link
Author

netrmqdev commented Sep 27, 2024

@netrmqdev we can provide you a one-off build to test #12392. What kind of package do you use/need? Would you be able to test such a one-off build at all (it's fine if you wouldn't, of course)

I can deploy a special build on some environments where the issue happened, yes. This will give a good idea of the patch fixing things or not.

We deploy a custom Docker image based off the rabbitmq management one (with some additional certificates, conf and definitions):

FROM rabbitmq:3.12.14-management
(...)

If you can provide a tagged rabbitmq management image (based on 4.0.xx I imagine), that should be fairly easy for me I believe.

@michaelklishin
Copy link
Member

michaelklishin commented Sep 27, 2024

@netrmqdev this (x86-64) tag should include #12392 (acb04e9): https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/sha-acb04e929f9b1030dd077c62991147876b08fff5/images/sha256-353eb18c3b4d98fd393eacaf98234a90bfc582f0c6d1f38232f8a4317b0cf4be?context=explore

It is technically 4.1.0-based but there are no meaningful changes compared to 4.0.2 (except for this PR in the image, that is).

Please give it a shot and let us know how it goes.

@netrmqdev
Copy link
Author

netrmqdev commented Sep 27, 2024

Thanks @michaelklishin , yes 4.1 based is ok.
I will try to deploy the image on Monday and keep you updated on the results.

@netrmqdev
Copy link
Author

netrmqdev commented Sep 30, 2024

I deployed the image this morning on the same env from the previous logs in 4.0.x, and it did not go well. The RMQ server fully crashed soon after starting.
The stack trace is:

2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>     [{rabbit_msg_store,scan_file_for_valid_messages,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1487}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,scan_and_vacuum_message_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,2108}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,compact_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1943}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,attempt_action,3,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,110}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,handle_cast,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,68}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}
The "full" log (messages truncated) of the crash
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** Reason for termination ==
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}}
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** In 'terminate' callback with reason ==
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> ** {{case_clause,[{msg_location,<<...>>,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                                 2,0,16595041,123102},
(...)                
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                   {msg_location,<<...>>,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                                 2,0,20157,5435}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>     [{rabbit_msg_store,scan_file_for_valid_messages,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1487}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,scan_and_vacuum_message_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,2108}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store,compact_file,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                        [{file,"rabbit_msg_store.erl"},{line,1943}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,attempt_action,3,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,110}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {rabbit_msg_store_gc,handle_cast,2,
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>                           [{file,"rabbit_msg_store_gc.erl"},{line,68}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0>      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}
2024-09-30 09:48:52.318310+00:00 [error] <0.517.0> 
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>   crasher:
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     initial call: rabbit_msg_store:init/1
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     pid: <0.517.0>
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     registered_name: []
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     exception exit: {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}}
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1161)
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     ancestors: [<0.502.0>,<0.501.0>,rabbit_vhost_sup_sup,rabbit_sup,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   <0.254.0>]
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     message_queue_len: 4
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     messages: [{'$gen_cast',
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                       {write,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           11668,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           flow}},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {'$gen_cast',
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                       {write,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           7404,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           flow}},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {'$gen_cast',
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                       {write,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           11668,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           flow}},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {'$gen_cast',
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                       {remove,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           <<...>>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                           [{7404,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                             <<...>>}]}}]
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     links: [<0.502.0>]
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     dictionary: [{rand_seed,{#{max => 288230376151711743,type => exsplus,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                                 next => #Fun<rand.5.65977474>,
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                                 jump => #Fun<rand.3.65977474>},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                               [225333775202711325|206966345018381886]}},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {{credit_to,<0.530.0>},505},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {{credit_to,<0.554.0>},505},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {{credit_to,<0.890.0>},664},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {{credit_to,<0.566.0>},40},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {credit_blocked,[]},
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>                   {{credit_to,<0.569.0>},143}]
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     trap_exit: true
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     status: running
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     heap_size: 833026
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     stack_size: 28
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>     reductions: 3618356
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0>   neighbours:
2024-09-30 09:48:52.386690+00:00 [error] <0.517.0> 
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>     supervisor: {<0.502.0>,rabbit_vhost_sup}
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>     errorContext: child_terminated
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>     reason: {noproc,{gen_server2,call,[<0.520.0>,stop,infinity]}}
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>     offender: [{pid,<0.517.0>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {id,msg_store_persistent},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {mfargs,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                    {rabbit_msg_store,start_link,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                        [<<"/">>,msg_store_persistent,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                         "/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L",
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                         [],
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                         {#Fun<rabbit_classic_queue_index_v2.9.80357470>,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                          {start,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                              [{resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,<<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>}]}}]}},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {restart_type,transient},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {significant,false},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {shutdown,600000},
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0>                {child_type,worker}]
2024-09-30 09:48:52.389184+00:00 [error] <0.502.0> 
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>     supervisor: {<0.502.0>,rabbit_vhost_sup}
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>     errorContext: shutdown
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>     reason: reached_max_restart_intensity
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>     offender: [{pid,<0.517.0>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {id,msg_store_persistent},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {mfargs,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                    {rabbit_msg_store,start_link,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                        [<<"/">>,msg_store_persistent,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                         "/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L",
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                         [],
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                         {#Fun<rabbit_classic_queue_index_v2.9.80357470>,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                          {start,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                              [{resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,<<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                               {resource,<<"/">>,queue,
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                                   <<"QUEUE_NAME">>}]}}]}},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {restart_type,transient},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {significant,false},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {shutdown,600000},
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0>                {child_type,worker}]
2024-09-30 09:48:52.391846+00:00 [error] <0.502.0> 
2024-09-30 09:48:52.430038+00:00 [info] <0.513.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2024-09-30 09:48:52.438946+00:00 [info] <0.513.0> Message store for directory '/var/lib/rabbitmq/mnesia/USER@rabbitmq/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>     supervisor: {<0.501.0>,rabbit_vhost_sup_wrapper}
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>     errorContext: child_terminated
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>     reason: shutdown
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>     offender: [{pid,<0.502.0>},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {id,rabbit_vhost_sup},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {mfargs,{rabbit_vhost_sup_wrapper,start_vhost_sup,[<<"/">>]}},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {restart_type,permanent},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {significant,false},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {shutdown,infinity},
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0>                {child_type,supervisor}]
2024-09-30 09:48:52.454958+00:00 [error] <0.501.0> 
2024-09-30 09:48:52.455646+00:00 [info] <0.503.0> Virtual host '/' is stopping
2024-09-30 09:48:52.456043+00:00 [info] <0.1672.0> Closing all connections in vhost '/' on node 'USER@rabbitmq' because the vhost is stopping

@lhoguin
Copy link
Contributor

lhoguin commented Sep 30, 2024

Thanks! I have indeed missed something.

@lhoguin
Copy link
Contributor

lhoguin commented Oct 1, 2024

@netrmqdev
Copy link
Author

Please try the most up to date image which should be https://hub.docker.com/layers/pivotalrabbitmq/rabbitmq/loic-fix-cq-scan/images/sha256-ab439bb28985d6699f8b5a7c6ca91ca0d183202521f808b888c874b41146c02e?context=explore

Thanks a lot, I deployed it this morning on the first env and so far so good 👍
If it's still good tomorrow, I will deploy it on more environments and let it run for a few days.

@michaelklishin michaelklishin changed the title Classic queue can run into an exception after upgrade to RabbitMQ 3.13.xx (still happening in 4.0.1) Classic queue can run into an exception after upgrade to RabbitMQ 3.13.x (still happening in 4.0.1) Oct 1, 2024
@kaistierl
Copy link

kaistierl commented Oct 2, 2024

Hi all! We're observing the same issue. We went from 3.10.0 on CentOS7 to 3.13.2 on Debian 12 last Thursday and since then had those crashes every night in our canary environment where we mirror live traffic of our application and test new releases before staging them to the live environment. Last night I tried out 4.0.2 and the crashes were still happening, although RabbitMQ managed to automatically restart affected queues, what didn't work with 3.13.2. I now went back to 3.12.14 and will again observe for a night or two.

Long story short: If someone can provide a DEB package with the attempted fix, I can do some real world testing of the fix in our canary environment afterwards. Since we have no real users on there, It would be quite easy for me to do.

@lhoguin
Copy link
Contributor

lhoguin commented Oct 2, 2024

I'm sure @michaelklishin can help.

@michaelklishin
Copy link
Member

michaelklishin commented Oct 2, 2024

@kaistierl I will build a 64-bit Debian package (off of main plus this PR) later today.

@michaelklishin
Copy link
Member

@kaistierl would an ARM64 build do? Our x86-64 environment that I used to produce one-off builds needs to be re-created on different infrastructure for reasons outside of my control, and it will take some time.

But I can produce a local ARM64 Debian package.

@kaistierl
Copy link

kaistierl commented Oct 4, 2024

@michaelklishin unfortunately not - I would need a x86-64 package.

...by the way: After downgrading to 3.12.14 everything is running stable again for me, so my observations match what was reported by others.

@michaelklishin
Copy link
Member

Here is a generic binary build of #12392 which is platform-independent https://github.com/rabbitmq/rabbitmq-server/actions/runs/11120630513/artifacts/1999351253.

I cannot promise any ETA for one-off x86-64 builds.

@michaelklishin
Copy link
Member

Note that both 3.12 and 3.13 are out of community support so when a new patch release is out, you will have to move to 4.0.x, a 3.13.x release will be available later and to paying customers only.

@michaelklishin
Copy link
Member

@netrmqdev any updates from your clusters?

@michaelklishin michaelklishin changed the title Classic queue can run into an exception after upgrade to RabbitMQ 3.13.x (still happening in 4.0.1) Classic queue can run into an exception after upgrade to RabbitMQ 3.13.7 (or 4.0.2) Oct 4, 2024
@netrmqdev
Copy link
Author

@netrmqdev any updates from your clusters?

For reasons unrelated to RabbitMQ, I was unable to get as much testing as I wanted to on my envs. However I now have about 4 of them running the patch for ~12-24 hours each, with no issues so far. They will keep running over the week-end so on Monday I can give more feedback 👍

@gomoripeti
Copy link
Contributor

@kaistierl as the patch only modifies one file you can extract rabbit_msg_store.beam from the generic build Michael provided and copy it into a vanilla 4.0.2 install (under `/plugins/rabbit-4.0.2/ebin/>)
I'm mentioning this because testing by multiple parties would be highly appreciated (and we cannot reproduce it easily with real traffic)

@gomoripeti
Copy link
Contributor

@michaelklishin I think a 3.13 release is necessary otherwise there is no upgrade path from 3.12 to 4.0 for affected users of the open source version.
(there are other commits on 3.13.x that are waiting for a release and #12417 also suggested to have a final release of 3.13, but the strongest argument is to have this patch available)

@michaelklishin
Copy link
Member

@gomoripeti there is no 3.12 to 4.0 upgrade path. This issue does not affect an absolute majority of users and a brief upgrade to 3.13.7 will be enough in any case.

There will be some v3.13.x backports and we are discussing what to do about release notes but there won't be any new releases unless we discover something truly serious (such as a security vulnerability or a true upgrade blocker).

@gomoripeti
Copy link
Contributor

by upgrade path I meant to be able to jump from 3.12 -> 3.13.N -> 4.0 where 3.13.N is a version which does not exhibit this bug.

I was happy to see that the core team took this issue seriously (the shared message store is not able to read its own file format and loses messages) so I assumed this is an important fix to back port. I'm not sure it is fully convincing for a user to upgrade to 4.0 with just a short time on 3.13 which might corrupt the messages. And as 4.0 is a major release with breaking changes it will be more delay until users feel ready to upgrade to it than feeling ready to upgrade to a fixed 3.13.
But I understand the current guideline is that only the latest minor version has community-support and only that one gets open-source patch versions. I think it would be good to have a short overlap or at least release a final patch version of the previous minor series just before starting releasing the following minor series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants