Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mosquitto bridge connection deadlock waiting for message from Cumulocity IoT #3141

Open
reubenmiller opened this issue Sep 26, 2024 · 2 comments
Labels
bug Something isn't working theme:mqtt Theme: mqtt and mosquitto related topics

Comments

@reubenmiller
Copy link
Contributor

reubenmiller commented Sep 26, 2024

Describe the bug

When using the mosquitto persistence (e.g. the mosquitto.db file), the mosquitto bridge function stops working and no messages are being received or published to the cloud.

Symptoms

  • mosquitto health topic (te/device/main/service/mosquitto-c8y-bridge/status/health) toggles periodically between 0 and 1
  • tedge-mapper-c8y is not able to get a valid token (publishing to c8y/s/uat goes unanswered on c8y/s/dat)
  • Cumulocity operation sent to the device are marked with the "delivery" fragment indicating that the message has been received by the broker, however the message is not published by the bridge to the local broker
  • Restarting the tedge-mapper-c8y service does not have any effect
  • Restarting the mosquitto does not have any effect

Below shows the pattern where the bridge health is toggling and the requests by the mapper to get a new Cumulocity JWT.

$ tedge mqtt sub '#'
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[te/device/main/service/mosquitto-c8y-bridge/status/health] 0
[c8y/s/us] 102,rmi_cb001:device:main:service:mosquitto-c8y-bridge,service,mosquitto-c8y-bridge,down
[te/device/main/service/mosquitto-c8y-bridge/status/health] 1
[c8y/s/us] 102,rmi_cb001:device:main:service:mosquitto-c8y-bridge,service,mosquitto-c8y-bridge,up
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[c8y/s/uat] 
[te/device/main/service/mosquitto-c8y-bridge/status/health] 0
[c8y/s/us] 102,rmi_cb001:device:main:service:mosquitto-c8y-bridge,service,mosquitto-c8y-bridge,down
[te/device/main/service/mosquitto-c8y-bridge/status/health] 1
[c8y/s/us] 102,rmi_cb001:device:main:service:mosquitto-c8y-bridge,service,mosquitto-c8y-bridge,up
[c8y/s/uat] 
[c8y/s/uat] 

Workaround

The cloud connection

  1. Disconnect tedge (which stops mosquitto and tedge-mapper-c8y)

    tedge disconnect c8y
  2. Remove the mosquitto.db file

    rm /var/lib/mosquitto/mosquitto.db
  3. Connect tedge (which starts mosquitto and tedge-mapper-c8y)

    tedge connect c8y

To Reproduce

It is currently unknown how to reproduce the problem. It might be possible to use the attached mosquitto.db to reproduce the mosquitto bridge deadlock.

This procedure is not verified but you could try:

  1. Stop mosquitto

    systemctl stop mosquitto
  2. Copy the mosquitto.db.tgz file, and decompress it to /var/lib/mosquitto/mosquitto.db

  3. Change the permissions of the file

    mkdir -p /var/lib/mosquitto/
    chmod 755 /var/lib/mosquitto/
    chown mosquitto:mosquitto /var/lib/mosquitto/
    
    chmod 600 /var/lib/mosquitto/mosquitto.db
    chown mosquitto:mosquitto /var/lib/mosquitto/mosquitto.db
  4. Enable the mosquitto persistence (assuming you haven't already this setting to mosquitto)

    persistence true
    persistence_location /var/lib/mosquitto/
    
  5. Start mosquitto

    systemctl start mosquitto
  6. Monitor the local MQTT broker looking at the mosquitto health etc.

    tedge mqtt sub '#'

Expected behavior

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Alpine Linux v3.18
Hardware [incl. revision] docker
System-Architecture Linux 9bba511280c7 6.8.0-39-generic #39-Ubuntu SMP PREEMPT_DYNAMIC Sat Jul 6 02:50:39 UTC 2024 aarch64 Linux
thin-edge.io version tedge 1.3.0
mosquitto version 2.0.18

Additional context

Logs from mosquitto and tedge-mapper-c8y

tedge-mapper-c8y | 2024-09-26T15:01:38.444085498Z  INFO mqtt_channel::connection: MQTT connection established    
mosquitto | 1727362898: osujqlfosa 1 c8y/s/dat
tedge-mapper-c8y | 2024-09-26T15:01:38.468649108Z  INFO c8y_api::http_proxy: JWT token requested
tedge-mapper-c8y | 2024-09-26T15:01:58.471328117Z  INFO c8y_api::http_proxy: JWT token requested
tedge-mapper-c8y | 2024-09-26T15:02:18.474034123Z  INFO c8y_api::http_proxy: JWT token requested
tedge-mapper-c8y | 2024-09-26T15:02:38.476886867Z ERROR c8y_api::http_proxy: Fail to retrieve JWT token after 3 attempts
tedge-mapper-c8y | 2024-09-26T15:02:38.477218784Z  INFO mqtt_channel::connection: MQTT connection closed    
mosquitto | 1727362958: Client osujqlfosa disconnected.
tedge-mapper-c8y | 2024-09-26T15:02:38.477961577Z ERROR c8y_http_proxy::actor: An error occurred while retrieving internal Id, operation will retry in 20 seconds
tedge-mapper-c8y |  Error: CustomError("JWT token not available")    
tedge-mapper-c8y | 2024-09-26T15:02:58.479105094Z  INFO mqtt_channel::connection: MQTT connecting to broker: host=127.0.0.1:1883, session_name=None    
mosquitto | 1727362978: New connection from 127.0.0.1:53426 on port 1883.
mosquitto | 1727362978: New client connected from 127.0.0.1:53426 as ueidzucazb (p2, c1, k60).
tedge-mapper-c8y | 2024-09-26T15:02:58.47954322Z  INFO mqtt_channel::connection: MQTT connection established    
mosquitto | 1727362978: ueidzucazb 1 c8y/s/dat
tedge-mapper-c8y | 2024-09-26T15:02:58.50466554Z  INFO c8y_api::http_proxy: JWT token requested
tedge-mapper-c8y | 2024-09-26T15:03:18.507521871Z  INFO c8y_api::http_proxy: JWT token requested
mosquitto | 1727363006: Connecting bridge edge_to_c8y (thin-edge-io.eu-latest.cumulocity.com:8883)
tedge-mapper-c8y | 2024-09-26T15:03:38.509243964Z  INFO c8y_api::http_proxy: JWT token requested
tedge-mapper-c8y | 2024-09-26T15:03:58.513521985Z ERROR c8y_api::http_proxy: Fail to retrieve JWT token after 3 attempts
tedge-mapper-c8y | 2024-09-26T15:03:58.514573903Z ERROR c8y_http_proxy::actor: An error occurred while retrieving internal Id, operation will retry in 20 seconds
tedge-mapper-c8y |  Error: CustomError("JWT token not available")    
tedge-mapper-c8y | 2024-09-26T15:03:58.514723736Z  INFO mqtt_channel::connection: MQTT connection closed    
mosquitto | 1727363038: Client ueidzucazb disconnected.

Output from mosquitto_db_dump tool

Using the mosquitto_dump_db (built from https://github.com/eclipse/mosquitto/tree/master/apps/db_dump), the following shows the output:

% ./mosquitto_db_dump --stats mosquitto.db.bak
DB_CHUNK_CFG:        1
DB_CHUNK_MSG_STORE:  132
DB_CHUNK_CLIENT_MSG: 112
DB_CHUNK_RETAIN:     24
DB_CHUNK_SUB:        38
DB_CHUNK_CLIENT:     4
% ./mosquitto_db_dump --client-stats mosquitto.db.bak
SC: 0     SS: 0       MC: 108     MS: 33219   Cumulocity
SC: 28    SS: 1536    MC: 3       MS: 459     tedge-mapper-c8y
SC: 1     SS: 89      MC: 1       MS: 136     last_will_c8y_mapper
SC: 9     SS: 567     MC: 0       MS: 0       tedge-agent#te/device/main//

The following one-liner looks at the MQTT client ID of the Cumulocity IoT bridge, and gathers some statistics on

The bridge queue (e.g. how many are outbound and inbound) can be calculated using:

./mosquitto_db_dump mosquitto.db.bak | grep "Client ID: Cumulocity" -A 7 -B 2 | grep Direction | sort | uniq -c
   1 	Direction: 0
 107 	Direction: 1

The above shows that there is one inbound message (assuming that Direction: 0 means inbound, and 1 outbound). Below shows the meta information of the pending inbound message:

DB_CHUNK_CLIENT_MSG:
	Length: 26
	Client ID: Cumulocity
	Store ID: 9241
	MID: 1
	QoS: 2
	Retain: 0
	Direction: 0
	State: 7
	Dup: 0

The message states can also be aggregated to see how many message are in which state:

./mosquitto_db_dump mosquitto.db.bak | grep "Client ID: Cumulocity" -A 7 -B 2 | grep State | sort | uniq -c
  87 	State: 11
  20 	State: 3
   1 	State: 7
@reubenmiller reubenmiller added the bug Something isn't working label Sep 26, 2024
@reubenmiller
Copy link
Contributor Author

Current theory

After some research, the following mosquitto issue #1214 was found which says:

Running without --stats, and just looking at the client messages, I see there are a few messages with state 7, which is "wait for pubrel", then the rest of the 219k messages are in state 11, which is "queued". It seems what has happened here is that some part of the communication has got confused and is waiting for messages that will presumably never come. Because they don't come and messages must be delivered in order, the queue will never empty. Now this shouldn't happen of course, so I'm going to try to figure out how it happened and how to make it unstick. I think this may be a problem unique to bridges.

The description seems to match the symptoms and the collected evidence.

Likely root cause

  1. The bridge is waiting for a response (PUBREL) to be received and the server never sends it.
  2. The mosquitto bridge then continues waiting for the message and does not send any MQTT PING or MQTT PING acknowledgements in the meantime which results in the server disconnecting the client.
  3. The mosquitto bridge then retries the connection but does not seem to remove the PUBREL from the internal mosquitto.db which results in a connect/disconnect cycle.

@reubenmiller reubenmiller added the theme:mqtt Theme: mqtt and mosquitto related topics label Sep 26, 2024
@reubenmiller
Copy link
Contributor Author

FYI: The mosquitto message states are visible in the source code: lib/mosquitto_internal.h#L82

enum mosquitto_msg_direction {
	mosq_md_in = 0,
	mosq_md_out = 1
};

enum mosquitto_msg_state {
	mosq_ms_invalid = 0,
	mosq_ms_publish_qos0 = 1,
	mosq_ms_publish_qos1 = 2,
	mosq_ms_wait_for_puback = 3,
	mosq_ms_publish_qos2 = 4,
	mosq_ms_wait_for_pubrec = 5,
	mosq_ms_resend_pubrel = 6,
	mosq_ms_wait_for_pubrel = 7,
	mosq_ms_resend_pubcomp = 8,
	mosq_ms_wait_for_pubcomp = 9,
	mosq_ms_send_pubrec = 10,
	mosq_ms_queued = 11
};

enum mosquitto_client_state {
	mosq_cs_new = 0,
	mosq_cs_connected = 1,
	mosq_cs_disconnecting = 2,
	mosq_cs_active = 3,
	mosq_cs_connect_pending = 4,
	mosq_cs_connect_srv = 5,
	mosq_cs_disconnect_ws = 6,
	mosq_cs_disconnected = 7,
	mosq_cs_socks5_new = 8,
	mosq_cs_socks5_start = 9,
	mosq_cs_socks5_request = 10,
	mosq_cs_socks5_reply = 11,
	mosq_cs_socks5_auth_ok = 12,
	mosq_cs_socks5_userpass_reply = 13,
	mosq_cs_socks5_send_userpass = 14,
	mosq_cs_expiring = 15,
	mosq_cs_duplicate = 17, /* client that has been taken over by another with the same id */
	mosq_cs_disconnect_with_will = 18,
	mosq_cs_disused = 19, /* client that has been added to the disused list to be freed */
	mosq_cs_authenticating = 20, /* Client has sent CONNECT but is still undergoing extended authentication */
	mosq_cs_reauthenticating = 21, /* Client is undergoing reauthentication and shouldn't do anything else until complete */
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working theme:mqtt Theme: mqtt and mosquitto related topics
Projects
None yet
Development

No branches or pull requests

1 participant