Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip generating auto_restart.conf for syncd #52

Open
wants to merge 2 commits into
base: 202205
Choose a base branch
from

Conversation

yanjundeng
Copy link

@yanjundeng yanjundeng commented Mar 31, 2023

[Issue Desc]
When kill critical process inside syncd container, it will trigger restart of syncd/swss and other containers.
During the process, we see the following ERR messages in syslog:

Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_LAG_ID = oid:0x200000000000007
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_PORT_ID = oid:0x10000000000000f
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- processSingleVid: failed to create object SAI_OBJECT_TYPE_LAG_MEMBER: SAI_STATUS_OBJECT_IN_USE

In fact, besides the above ERR messages, there are some other ERR messages related to syncd.

[Rootcause Analysis]
After killing one critical process, for example “syncd” inside syncd container, the supervisor monitor “supervisor-proc-exit-listener” gets the event.
The monitor then tries to send SIGTERM to its parent to stop all other processes inside syncd container.
When syncd is stopped, these tun interfaces “EthernetXX” inside host become Down and are removed. We can see many following logs during this period:

Mar 21 04:48:32.976780 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet232(ok:down) to state db
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981661 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet232(ok) from state db
Mar 21 04:48:33.032850 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035246 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet216(ok:down) to state db
Mar 21 04:48:33.035347 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035572 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet216(ok) from state db

Once the syncd container is stopped, the systemd tries to run its syncd stop script according to the “ExecStop” setting to do some postStop handling.
Later, the swss service tries stop itself by “docker-wait-any” since its peer syncd exited.
At the same time, the syncd tries to auto-restart by systemd since its auto-restart is TRUE. Yet it need lock “/tmp/swss-syncd-lock” before initializing.

After swss stops all process inside its container, it unlocks the “/tmp/swss-syncd-lock” and tries to stop its dependent and peer services in turn.
The syncd got the lock and begin to initializing. During this process of syncd initializing, we hit some syncd ERR msgs reported by this Jira.

After a while, the swss script tries to stop its peer service : syncd service.
After stopped all its dependent and peer services, the swss is auto-restarted by systemd.
After swss container is started, it then tries to start its dependent and peer service, include syncd service. Then the system recovers to normal.

According to the above analysis of the whole process, these syncd ERR messages happen during the period of the first syncd auto-restart.
As to this restart, the DB is not flushed, especially ASIC_DB. Since the stop of syncd triggers tun interfaces Down and remove, which could affect the state DB and ASIC DB.
Then, this incorrect ASIC DB could affect the initializing of syncd.

[Solution]
The syncd should avoid using this intermediate incorrect DB.
There are two possible solutions for this issue:

when stop swss, flush ASIC_DB for non-warm stop.
Don't auto-restart syncd by itself, let its peer swss watch and start it. ( same as 202012)
As to option 1, we can flush ASIC_DB when executing STOP for swss service if it’s not a WARM case. After flushing the ASIC_DB, the syncd could initialize from a clean DB.
As to option 2, it’s same as 202012 branch, we should remove the auto-restart config for syncd service, i.e, the following file:
/etc/systemd/system/syncd.service.d/auto_restart.conf

This auto_restart config file is generated by “src/sonic-host-services/scripts/hostcfgd”, imported by
[hostcfgd] Initialize Restart= in feature's systemd config by the v… · sonic-net/sonic-buildimage@8a76cdc (github.com)

The 202012 branch doesn’t have this merge.
We can simply skip “syncd” in “hostcfgd” when generating “auto_restart.conf”.

Since the status of syncd service is monitored by swss service, we can always let swss service take charge of the restart of syncd service.

We choose solution 2 here.

[Similar issue]
Marvell reported a similar issue, which should have the same rootcause:
sonic-net/SONiC#1248

Our fix solution could resolve this kinda issue together.

@kevinskwang
Copy link

Same with #51, please check the error at SAI/SDK level by first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants