Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make transition event topic reliable to avoid topic lost #1171

Open
wants to merge 3 commits into
base: rolling
Choose a base branch
from

Conversation

kjjpc
Copy link

@kjjpc kjjpc commented Jul 19, 2024

This PR is related to #1166.
Lifecycle action of launch_ros occasionally fails to transit because of topic event lost.
This PR make transition event topic reliable.
After the marge of this PR, I will make a PR for launch_ros.

@kjjpc kjjpc force-pushed the make-transition-event-reliable branch from c26adc8 to 614c989 Compare July 19, 2024 04:10
@@ -108,6 +108,9 @@ rcl_lifecycle_com_interface_publisher_init(

// initialize publisher
rcl_publisher_options_t publisher_options = rcl_publisher_get_default_options();
publisher_options.qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmw_qos_profile_default already sets this RMW_QOS_POLICY_RELIABILITY_RELIABLE, so no need to overwrite this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've removed the line.

Comment on lines +112 to +113
publisher_options.qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
publisher_options.qos.depth = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
publisher_options.qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
publisher_options.qos.depth = 1;
// transition event topic needs to be latched for the subscription joins later.
publisher_options.qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
publisher_options.qos.depth = 1;

changing publisher's QoS to transient local should not break any downstream subscriptions, https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html#id6.

IMO this transition event topic needs to be latched to deliver the latest state for the subscribers that join later. probably adding comments why we use transient local with depth 1 would be useful.

besides, Managed Node Design also says the following.

A topic should be provided to broadcast the new life cycle state when it changes. This topic must be latched.

@kjjpc kjjpc force-pushed the make-transition-event-reliable branch from 1f7f713 to aa503cd Compare July 22, 2024 05:26
kjjpc and others added 3 commits July 22, 2024 14:26
@mjcarroll
Copy link
Member

@kjjpc Thanks for reporting. We discussed this at the recent PMC maintainers' meeting and came to the following conclusion.

In this case, changing the durability of the transition event could have far reaching impacts on the system that are undesirable. The main concern is that if you had a late-joining lifecycle node, that it could be transitioned inadvertently by the latched topic. This would be pretty un-intuitive and could cause issues in larger systems.

The real issue here stems from the fact that we are triggering the transition only when the process becomes available, rather than waiting for the node to be in the unconfigured state.

I think that the real solution here would be detect when the node is started in the unconfigured state in your launch file, like:

    emit_active = RegisterEventHandler(
        launch_ros.event_handlers.OnStateTransition(
            target_lifecycle_node=lc_node,
            goal_state='unconfigured',
            ...
        ))

The other alternative would be to add on OnNodeReady event to launch_ros so that we are more intelligently waiting for the node to start.

@fujitatomoya
Copy link
Collaborator

@mjcarroll

i might be mistaken, but i was thinking opposite...

The main concern is that if you had a late-joining lifecycle node, that it could be transitioned inadvertently by the latched topic. This would be pretty un-intuitive and could cause issues in larger systems.

either that is lifecycle node or just a subscription, delivering the latest state change event via latched topic to the subscription that does specify the QoS with durability by the user application makes sense?
this is something user application wants to do that, so is specified by user.
if they do not want to receive the latest state change event via latched topic, they just do not need to specify the QoS durability on the subscription side? without this change, system cannot provide the choice for the user application.
besides, this does not break the communication between them, https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html#qos-compatibilities

@mjcarroll
Copy link
Member

either that is lifecycle node or just a subscription, delivering the latest state change event via latched topic to the subscription that does specify the QoS with durability by the user application makes sense?

Hmm, maybe my understanding of the issue here was wrong. I thought we were making the subscription transient_local, which would cause the issues I described.

@kjjpc
Copy link
Author

kjjpc commented Jul 29, 2024

@mjcarroll
Thanks for the discussion. The issue is that the transition event topic is volatile in the current implementation.
launch_ros occasionally fails to receive the change event topic because the topic is not latched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants