Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DASH HA #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

DASH HA #3

wants to merge 1 commit into from

Conversation

mukeshmv
Copy link
Owner

@mukeshmv mukeshmv commented Mar 5, 2024

No description provided.

* @flags CREATE_AND_SET
* @default 100
*/
SAI_DASH_HA_ATTR_VIP_CONVERGENCE_TIMEOUT,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this SWITCHOVER_TIMEOUT.
add a usage comment to indicate planned switchover

* @flags CREATE_ONLY
* @default empty
*/
SAI_DASH_DEVICE_ATTR_TRUSTED_VNI_LIST = SAI_DASH_DEVICE_ATTR_START,
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the VNI APIs in DASH, so we can use that to move forward without introducing the new one.

if any other things required on DPU level, they could go to switch extensions.

/** End of custom range base */
SAI_DASH_HA_ATTR_CUSTOM_RANGE_END

} sai_dash_ha_attr_t;
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the HA set, since it defines the DPU pair, so we could rename this to ha set, which will better align with the HLD.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One mismatch is that here the HA object has multiple VIP ID support for the same DPU pair.

* @flags CREATE_AND_SET
* @default empty
*/
SAI_DASH_HA_ATTR_VIP_INFO_LIST,
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a list with structs is not going to be DASH compatible, because we won't be able to model this in p4 - the action parameter doesn't take struct as input, hence both bmv2 and API generation will fail.

* @flags CREATE_AND_SET
* @default NULL
*/
SAI_SWITCH_ATTR_DASH_HA_FSM_EVENT_NOTIFY,
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since DASH is still SAI extension, it is better to move this into the switch extension header we have in DASH.

SAI_DASH_HA_STAT_HEARTBEAT_MSG_RECEIVED,

/** DASH HA stat flow sync requests sent */
SAI_DASH_HA_STAT_FLOW_SYNC_REQ_SENT,
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is better to track flow sync req on ENI level, otherwise it won't be useful for live site debugging, especially post mortem analysis for historical issues (a specific VM was having problem)

SAI_DASH_HA_STAT_HEARTBEAT_MSG_SENT,

/** DASH HA stat heartbeat messages received */
SAI_DASH_HA_STAT_HEARTBEAT_MSG_RECEIVED,
Copy link

@r12f r12f Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for data path probe, we will need record the failed count. either for debugging or alerting, the failed count is usually more accurate and intuitive.

(sent can be more or less than received to due to timing issues, which is painful for logging metrics - is 95% success rate a problem? or what does 105% success rate mean...)

/** DASH HA stat flow sync acknowledgements sent */
SAI_DASH_HA_STAT_FLOW_SYNC_ACK_RECEIVED,

} sai_dash_ha_stat_t;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please feel free to take a look of the counter design, which have the counters defined here included with many extra ones: https://github.com/r12f/DASH/blob/user/r12f/ha-session-api/documentation/high-avail/ha-api-hld.md#46-counters

* @flags CREATE_ONLY
* @default 0
*/
SAI_ENI_ATTR_HA_VIP_ID,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be better to avoid using VIP as the id to control the HA, since it creates a strong binding between VIP and HA scope.

* @type bool
* @flags MANDATORY_ON_CREATE | CREATE_ONLY
*/
SAI_ENI_ATTR_HA_OWNER,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name "HA owner" is a bit confusing, as it makes people think it is active side. maybe we could come up with a better name.

Copy link
Owner Author

@mukeshmv mukeshmv Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like FLOW_MASTER or CONFIG_OWNER

/** HA Stop followed by HA Start API needs to be invoked */
SAI_DASH_HA_FSM_EVENT_STOP_START_REQUIRED,

} sai_dash_ha_fsm_event_t;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a missing feature:

for every HA state change, we need it to be reported all the way back to SDN controller, and this event is missing here. this event is critical for debugging and monitoring, so it is better to add it.

sai_get_dash_ha_stats_ext_fn get_dash_ha_stats_ext;
sai_clear_dash_ha_stats_fn clear_dash_ha_stats;
sai_clear_dash_ha_all_stats_fn clear_dash_ha_all_stats;
sai_start_dash_ha_fsm_fn start_dash_ha_fsm;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these APIs will be non-standard APIs in SAI API, and it will be better to avoid these changes. otherwise, to support it there will be a lot of additional changes in sairedis, vendor sai, client/server sai, syncd and so on.

and as an example: start_dash_ha_fsm can be an sai set API call to update the oper_role or other attribuet on ha fsm.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for stop, activate, switchover.

for doing flow resimulation, we will provide the flow resimulation api in separate dedicated doc, so we don't need it here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These triggers are per vip-id in the API model.

*/
typedef enum _sai_dash_ha_oper_role_t
{
SAI_DASH_HA_OPER_ROLE_NONE,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments to describe each role

@mukeshmv mukeshmv force-pushed the master branch 3 times, most recently from 63f8d9c to 0773867 Compare July 24, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants