Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unacknowledged local command requests are difficult to clear when MQTT broker persistence is not configured #2862

Open
reubenmiller opened this issue May 8, 2024 · 1 comment
Labels
bug Something isn't working theme:mqtt Theme: mqtt and mosquitto related topics

Comments

@reubenmiller
Copy link
Contributor

Describe the bug

If the local MQTT broker is not configured for persistence, and the user is creating local commands (e.g. commands not issued from the cloud), then if the MQTT broker is restarted, then the tedge-agent will reject any future operation requests of the same type if there is already an uncleared/unacknowledged operation. However since the local MQTT broker does not have persistence configured, it is difficult for the user to see which operations still need to be cleared so that the tedge-agent will process future requests.

The only option to see which commands topics are in progress is to either inspect the tedge-agent logs and look for the following output:

2024-05-08T06:24:35.332521Z ERROR tedge_agent::tedge_operation_converter::actor: software_list operation request cannot be processed: Two concurrent requests are under execution on the same topic: te/device/main///cmd/software_list/local-123

Or to inspect internal state file (/opt/homebrew/etc/tedge/.agent/workflows) which the tedge-agent uses to track the state (which is NOT RECOMMENDED of course).

To Reproduce

The situation can be reproduced using the following steps:

  1. Start mosquitto (without persistence configured)

  2. Start tedge-agent

  3. Create a local software_list operation

    tedge mqtt pub -r 'te/device/main///cmd/software_list/local-1234' '{"status":"init"}'
  4. Restart mosquitto

  5. Subscribe to the commands

    tedge mqtt sub 'te/device/main///cmd/+/+'

    No results should be shown (as mosquitto as not been configured to persist messages)

  6. Try submitting a new command request (with a different command id)

    tedge mqtt pub -r 'te/device/main///cmd/software_list/local-2222' '{"status":"init"}'

    The operation will not be processed and an error message will appear in the tedge-agent logs similar to:

    2024-05-08T06:24:35.332521Z ERROR tedge_agent::tedge_operation_converter::actor: software_list operation request cannot be processed: Two concurrent requests are under execution on the same topic: te/device/main///cmd/software_list/local-1234
    

Expected behavior

The expected behaviour is not yet defined, but could be defined by answering the following question:

How does a user know which operations need to be cleared so that the tedge-agent will continue to process new requests?

Screenshots

Environment (please complete the following information):

  • OS [incl. version]: Any
  • Hardware [incl. revision]: Any
  • System-Architecture [e.g. result of "uname -a"]: Any
  • thin-edge.io version [e.g. 0.1.0]: 1.0.2~273+gd05e9b6

Additional context

In the above situation, how does the user know there are unacknowledged command requests?

The only chance the user has to clear the operation is to inspect the tedge-agent logs and view the topic name.

2024-05-08T06:24:35.30569Z  INFO tedge_agent::tedge_operation_converter::actor: Waiting failed restart operation to be cleared
2024-05-08T06:24:35.321863Z  INFO tedge_agent::tedge_operation_converter::actor: Waiting successful software_list operation to be cleared
2024-05-08T06:24:35.327175Z  INFO tedge_agent::tedge_operation_converter::actor: Waiting successful software_list operation to be cleared
2024-05-08T06:24:35.332453Z  INFO tedge_agent::tedge_operation_converter::actor: Waiting successful software_list operation to be cleared
2024-05-08T06:24:35.332521Z ERROR tedge_agent::tedge_operation_converter::actor: software_list operation request cannot be processed: Two concurrent requests are under execution on the same topic: te/device/main///cmd/software_list/local-123
2024-05-08T06:24:35.364497Z ERROR tedge_agent::tedge_operation_converter::actor: software_list operation request cannot be processed: Two concurrent requests are under execution on the same topic: te/device/main///cmd/software_list/local-1234

Inspecting the internal files used by tedge-agent to persist the workflows to file also yields a list of commands which are known to the agent, however not to the user.

file: /opt/homebrew/etc/tedge/.agent/workflows

{
    "version": "V1",
    "commands": {
        "te/device/main///cmd/restart/local-1234": {
            "unix_timestamp": 1712050067,
            "status": "failed",
            "payload": {
                "reason": "Fail to trigger a restart: Command returned non 0 exit code: Command { std: \"/usr/bin/sudo\" \"sync\", kill_on_drop: false }",
                "status": "failed"
            }
        },
        "te/device/main///cmd/software_list/local-1234": {
            "unix_timestamp": 1709384026,
            "status": "successful",
            "payload": {
                "currentSoftwareList": [
                    {
                        "modules": [],
                        "type": ""
                    }
                ],
                "status": "successful"
            }
        },
        "te/device/main///cmd/software_list/local-123": {
            "unix_timestamp": 1715148092,
            "status": "successful",
            "payload": {
                "currentSoftwareList": [
                    {
                        "modules": [
                            {
                                "name": "tedge",
                                "version": "1.0.2-rc273+gd05e9b6"
                            }
                        ],
                        "type": "brew"
                    }
                ],
                "status": "successful"
            }
        },
        "te/device/main///cmd/software_list/local-222": {
            "unix_timestamp": 1709384026,
            "status": "successful",
            "payload": {
                "currentSoftwareList": [
                    {
                        "modules": [],
                        "type": ""
                    }
                ],
                "status": "successful"
            }
        }
    }
}
@reubenmiller reubenmiller added bug Something isn't working theme:mqtt Theme: mqtt and mosquitto related topics labels May 8, 2024
@didier-wenzek
Copy link
Contributor

How does a user know which operations need to be cleared so that the tedge-agent will continue to process new requests?

Since #3149, the agent re-publishes on start all the pending operations persisted in /etc/tedge/.agent/workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working theme:mqtt Theme: mqtt and mosquitto related topics
Projects
None yet
Development

No branches or pull requests

2 participants