Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xcvrd to support 400G ZR optic #293

Merged
merged 9 commits into from
Oct 20, 2022
Merged

Conversation

abohanyang
Copy link
Contributor

@abohanyang abohanyang commented Sep 16, 2022

Fix Xcvrd code to support 400G ZR optic

Description

While bringing the 400G ZR module, I found an issue where some 400G ZR optics may take more than 60 secs to finish up the Datapath initialization. This is a problem because the xcvrd uses a hardcoded 60 secs Datapath initialization timeout for all CMIS modules and apparently, this 60 secs timeout is not sufficient for the 400G ZR module.
The CMIS spec has Datapath init/de-init timeout values defined in Byte 144. We can use these values from the EEPROM instead of presenting any hardcoded values.

Motivation and Context

This is needed to bring up 400G ZR module.

How Has This Been Tested?

These changes are tested against arista_7800r3a_36d2_lc and arista_7280cr3_32p4 platforms.

@abohanyang abohanyang marked this pull request as ready for review September 16, 2022 06:19
@abohanyang
Copy link
Contributor Author

@arlakshm

@prgeor
Copy link
Collaborator

prgeor commented Sep 16, 2022

@abohanyang can you please check if on_port_update_event() is triggered even after this PR #285. If so please share the syslog file.

@@ -923,6 +923,7 @@ class CmisManagerTask:

CMIS_MAX_RETRIES = 3
CMIS_DEF_EXPIRED = 60 # seconds, default expiration time
CMIS_DEF_EXPIRED_ZR = 200 # seconds, expiration time for ZR module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we read the module advertised datapath init time and use it so that we have it working for both ZR and non-ZR without hardcoding this here?

Copy link

@shyam77git shyam77git Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid hard-coding of timer values is tracked via following git issues:
#290
#294

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To read CMIS dp state duration, I create another PR in sonic-platform-common (sonic-net/sonic-platform-common#312).

Copy link

@shyam77git shyam77git Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abohanyang - How is #312 (above) different from #290 ?
Looks same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shyam77git , it's to address the same issue. but some of my code changes (sonic-net/sonic-platform-common#312 ) are in a different repo. I think we need a separate PR for that part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abohanyang
Is #312 sufficient to take care of what's mentioned in #290 or anything more required?
Note that #294 to take care for xcvrd side changes.

Copy link
Contributor Author

@abohanyang abohanyang Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. 312 contains the implementation of API, get_datapath_init_duration. I just commit more changes in this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. 312 contains the implementation of API, get_datapath_init_duration. I just commit more changes in this PR.

Thanks for the clarification.
This PR now is going to take care of git issue #293 - I'm reviewing this further

sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
@@ -1516,7 +1539,7 @@ def task_worker(self):
# D.1.3 Software Configuration and Initialization
api.set_datapath_init(host_lanes)
# TODO: Use fine grained timeout when the CMIS memory map is available
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED)
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_expired(api))
Copy link

@shyam77git shyam77git Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abohanyang : get_cmis_expired() - is this a temporary function defined to validate this fix - #293?

practically, get_datapath_init_duration() and get_datapath_deinit_duration() to be invoked instead.
Would you be taking care of this as part of this PR? or shall we take care of this in xcvrd.py via git issue #294 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just commit more changes in this PR to call get_datapath_init_duration and get the expiration time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just commit more changes in this PR to call get_datapath_init_duration and get the expiration time.

OK.
@abohanyang : can you please attach UT cases, logs for this case (400G ZR optics FSM working via xcrd.py as orchestrator with these timeouts all the way to interface in 'operational up' state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prgeor
prgeor previously approved these changes Sep 30, 2022
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
1. Clean up comments.
2. Revert port breakout fix and corresponding test change.
3. Check deinit duration for deinit case
4. Check datapath init pending
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
Comment on lines 1201 to 1226
def check_datapath_init_pending(self, api, channel):
"""
Check if the CMIS datapath init is pending

Args:
api:
XcvrApi object
channel:
Integer, a bitmask of the lanes on the host side
e.g. 0x5 for lane 0 and lane 2.

Returns:
Boolean, true if any lanes are pending datapath init, otherwise false
"""
pending = False
dpinit_pending_dict = api.get_dpinit_pending()
for lane in range(self.CMIS_NUM_CHANNELS):
if ((1 << lane) & channel) == 0:
continue
key = "DPInitPending{}".format(lane + 1)
if dpinit_pending_dict[key]:
pending = True
break

return pending

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check if DpInitPending is set for ALL host lanes and return TRUE. Return False if any of the lanes has DpInitPending=0

image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought and checking , please look further into this.
should this be interpreted as: as long as one of the lanes is found in DPInitPending state, outcome is pending (i.e. true). Caller of this function on getting 'true', would perform force_cmis_reinit() instead of setting state to CMIS_STATE_DP_INIT (& proceeding with normal workflow)

@@ -1494,8 +1542,11 @@ def task_worker(self):
self.force_cmis_reinit(lport, retries + 1)
continue

# TODO: Use fine grained time when the CMIS memory map is available
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED)
if self.check_datapath_init_pending(api, host_lanes):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any of the lanes has DPInitPending=0 i.e if this fn call return False then we should retry re-init

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code based on our discussion.

2. check DpInitPending on module which supports CMIS 5.0
3. Specify sec in return value
@shyam77git
Copy link

@abohanyang
could you update on the plan to merge this changeset/PR?
Awaiting on this for last 1-2 weeks. This is required to be part of repository for proper validation of 400G ZR optics (instead of adding and trying with workarounds etc.)

could you update on UT/IT? all done or anything still pending etc.? If so, whats the ETA.
Also, please do validate on 400G non-ZR optics to ensure they are not impacted with this changeset.

FYI: For 400G ZR optics, you may get in touch with 'Mihir Patel (MSFT)' as he validated 400G ZR optics with this changeset on his setup

@abohanyang
Copy link
Contributor Author

Hi @shyam77git - The code is pending review. I am waiting for @prgeor comments/approval on my latest commit.

@prgeor
Copy link
Collaborator

prgeor commented Oct 18, 2022

Fix Xcvrd code to support 400G DR and 400G ZR optics

Description

  1. Increase the initialization timeout value for the 400GBASE-ZR module as 400GBASE-ZR may take more than 60 sec to come up. Based on the spec, ZR may take up to 200 sec to become ready.
  2. on_port_update_event is triggered when 400G optics enter the ready state. To avoid the infinite initialization loop, in on_port_update_event, call force_cmis_reinit only when port configuration changes.
  3. on_port_config_change takes three arguments, not two. This issue is found when the port breakout mode changes.

Motivation and Context

  • Item 1 and Item 2 are needed to bring up 400G ZR/DR optics.
  • Item 3 fixes a potential issue during dynamic breakout mode change.

How Has This Been Tested?

  • These changes are tested against arista_7800r3a_36d2_lc and arista_7280cr3_32p4 platforms.

@abohanyang can you check the description. I think you are addressing only #1

@@ -1458,7 +1490,7 @@ def task_worker(self):
# TODO: Make sure this doesn't impact other datapaths
api.set_lpmode(False)
self.port_dict[lport]['cmis_state'] = self.CMIS_STATE_AP_CONF
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED)
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_dp_deinit_duration_secs(api))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you log the deinit duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I just update the code and the summary of this PR.

@@ -1515,8 +1553,7 @@ def task_worker(self):

# D.1.3 Software Configuration and Initialization
api.set_datapath_init(host_lanes)
# TODO: Use fine grained timeout when the CMIS memory map is available
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.CMIS_DEF_EXPIRED)
self.port_dict[lport]['cmis_expired'] = now + datetime.timedelta(seconds=self.get_cmis_dp_init_duration_secs(api))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please log the dp_init pending duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oct 19 19:10:24.470226 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=INSERTED, retries=0
Oct 19 19:10:24.584837 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: force Datapath reinit
Oct 19 19:10:25.589603 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_DEINIT, retries=0
Oct 19 19:10:26.621621 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256 DpDeinit duration 1.0 secs
Oct 19 19:10:27.627613 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=AP_CONFIGURED, retries=0
Oct 19 19:10:28.711402 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_INIT, retries=0
Oct 19 19:10:28.729570 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256 DpInit duration 300.0 secs
Oct 19 19:10:29.735655 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_TXON, retries=0
Oct 19 19:11:26.497042 cmp206-4 NOTICE pmon#xcvrd[38]: message repeated 56 times: [ CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_TXON, retries=0]
Oct 19 19:11:26.497149 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: Turning ON tx power
Oct 19 19:11:27.502474 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_ACTIVATION, retries=0
Oct 19 19:11:48.910446 cmp206-4 NOTICE pmon#xcvrd[38]: message repeated 21 times: [ CMIS: Ethernet256: 400G, lanemask=0xff, state=DP_ACTIVATION, retries=0]
Oct 19 19:11:48.910446 cmp206-4 NOTICE pmon#xcvrd[38]: CMIS: Ethernet256: READY

@abohanyang abohanyang changed the title Fix xcvrd to support 400G ZR/DR optics Fix xcvrd to support 400G ZR optic Oct 19, 2022
@prgeor prgeor merged commit 4ea12cf into sonic-net:master Oct 20, 2022
@prgeor
Copy link
Collaborator

prgeor commented Oct 20, 2022

@yxieca can you help cherry-pick to 202205?

yxieca pushed a commit that referenced this pull request Oct 25, 2022
* Fix xcvrd to support 400G ZR/DR optics

* Fix xcvrd to support 400G ZR/DR optics

* Revert changes in DomInfoUpdateTask::on_port_config_change

* Fix xcvrd test after modifying on_port_config_change

* Call get_datapath_init_duration to get the init expiration time.

* Address comments

1. Clean up comments.
2. Revert port breakout fix and corresponding test change.
3. Check deinit duration for deinit case
4. Check datapath init pending

* 1. Revert changes in on_port_update_event
2. check DpInitPending on module which supports CMIS 5.0
3. Specify sec in return value

* Add code coverage

* Add log for DpInit/DpDeinit duration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants