Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

Merged
merged 3 commits into from
Dec 14, 2016

Conversation

whowutwut
Copy link
Member

@whowutwut whowutwut commented Dec 13, 2016

Attempt to resolve #2327

This pull request fixes a customer reported issue for CORAL where certain nodes take 15+ minutes to do hardware discovery.

The following log message points us to the 15 minute time difference inside the Genesis Kernel.

[Tue Nov 29 11:02:05 2016]cp: cannot stat '/usr/share/zoneinfo/posix/America/Los_Angeles': No such file or directory
[Tue Nov 29 11:02:07 2016]<166>Nov 29 11:02:06 xcat.genesis.doxcat: Acquired IPv4 address on enP5p7s0f1
[Tue Nov 29 11:02:07 2016]192.168.64.174/24
[Tue Nov 29 11:18:03 2016]ppc64   <-- 15 minutes later

At initial investigation, the ntpq -c rv offset seems to be the most likely place that causes a long loop. However, we were unable to easily re-create it. After further investigation, it turns out that most of the time, we start ntpd and immediately call the ntpq command that returns an offset of 0. My guess is the ntp daemon is not quite ready yet and so we skip over the while loop.

If we inject a sleep 15 after starting the ntpd then the offset becomes very large, and ntp does not sync again for 15 minutes, thus, causing the 15 minute delay that we have seen.

Since genesis is not intended to be a long running, I don't think it's necessary to start ntpd, but rather just use ntpdate to force the sync of the clock, if the server is configured on the xCAT MN

Testing results:

  • If no server is configured on xCAT mn, genesis will not be able to sync the clocks
[xCAT Genesis running on frame23cn18 /]# time ntpdate 192.168.3.25
13 Dec 10:39:37 ntpdate[1789]: no server suitable for synchronization found
 
real	0m8.102s
user	0m0.001s
sys	0m0.001s
  • If an external ntp server is configured....
[root@fs4 ~]# chdef -t site extntpservers="10.0.0.101" 
1 object definitions have been created or modified.
[root@fs4 ~]# makentp
configuring management node: fs4.
[root@fs4 ~]# ps -ef | grep ntp
ntp       44966      1  0 10:35 ?        00:00:00 /usr/sbin/ntpd -u ntp:ntp -g
root      44992  44823  0 10:35 pts/7    00:00:00 grep --color=auto ntp
[xCAT Genesis running on frame23cn18 /]# time ntpdate 192.168.3.25
13 Dec 10:35:54 ntpdate[1796]: step time server 192.168.3.25 offset -359.118775 sec

real	0m0.000s
user	0m0.001s
sys	0m0.002s
  • If ntp is configured on the mn node...
# tabdump site | grep ntp
"ntpservers","<xcatmaster>",,

[root@fs4 ~]# date
Tue Dec 13 10:37:05 PST 2016

force change the date

[root@fs4 ~]# date 12121000
Mon Dec 12 10:00:00 PST 2016

Sync in genesis:

[xCAT Genesis running on frame23cn18 /]# time ntpdate -b 192.168.3.25
12 Dec 10:01:52 ntpdate[1809]: step time server 192.168.3.25 offset -88661.293557 sec

real	0m0.000s
user	0m0.001s
sys	0m0.002s

Discovery Testing output

After re-running a small provisioning test case, with ntp running on the MN node, we see the following messages:

[root@fs4 consoles]# grep ntpdate node-8335-g*
node-8335-gtb-100470a:[Tue Dec 13 10:55:44 2016]13 Dec 10:55:44 ntpdate[1425]: step time server 192.168.3.25 offset 0.841350 sec
node-8335-gtb-2109e4a:[Tue Dec 13 10:56:12 2016]13 Dec 10:56:12 ntpdate[1691]: step time server 192.168.3.25 offset -357.007997 sec

…no need to start ntp daemon, simply attempt to sync with ntpdate.

There's an issue in this code where we query ntpq for the offset too quickly after starting the ntp daemon.  This causes the offset to always be 0 and we continue.  We have seen situations where ntpd starts before the first check and becomes > 1000.  Then it takes 15 minutes before the 2nd attempt and delays the hardware discovery process
@whowutwut whowutwut added this to the 2.13.1 milestone Dec 13, 2016
@whowutwut whowutwut requested review from zet809 and cxhong December 13, 2016 18:43
@whowutwut whowutwut changed the title Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ] [CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ] Dec 13, 2016
@whowutwut whowutwut changed the title [CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ] [CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time Dec 13, 2016
@zet809
Copy link

zet809 commented Dec 14, 2016

@whowutwut Looks good to me, I will merge it. It seems we need to rebuild genesis-base for both ppc64 and x86_64.

@zet809 zet809 merged commit 676ae64 into xcat2:master Dec 14, 2016
@zet809
Copy link

zet809 commented Dec 14, 2016

Opened bug xcat2/xcat-dep#16 for xcat-dep to trace the building of xcat-genesis-base.

@zet809
Copy link

zet809 commented Dec 14, 2016

@whowutwut Hi, Victor, I revert the merge in another pull request, we need to discuss the ntp problem in the interlock meeting tomorrow.

@whowutwut
Copy link
Member Author

@zet809 can you let me know a short reason why it was reverted so I can look at fixing it today?

@daniceexi
Copy link
Contributor

@whowutwut I suppose the reason is 'the ntpdate has been deprecated, and the ntpd should be used to replace it'. And in current code, the 'ntpd' is used.From @neo954 that we still not figure out why current code does not work.
@zet809 please correct me.

@whowutwut
Copy link
Member Author

OK, fair enough. I'll create another pull request for this. This is actually better because we do not need to re-upload the genesis base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CUSTOMER] - ntp code in doxcat causing discovery delays up to 15 minutes
3 participants