[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

whowutwut · 2016-12-13T18:43:13Z

Attempt to resolve #2327

This pull request fixes a customer reported issue for CORAL where certain nodes take 15+ minutes to do hardware discovery.

The following log message points us to the 15 minute time difference inside the Genesis Kernel.

[Tue Nov 29 11:02:05 2016]cp: cannot stat '/usr/share/zoneinfo/posix/America/Los_Angeles': No such file or directory
[Tue Nov 29 11:02:07 2016]<166>Nov 29 11:02:06 xcat.genesis.doxcat: Acquired IPv4 address on enP5p7s0f1
[Tue Nov 29 11:02:07 2016]192.168.64.174/24
[Tue Nov 29 11:18:03 2016]ppc64   <-- 15 minutes later

At initial investigation, the ntpq -c rv offset seems to be the most likely place that causes a long loop. However, we were unable to easily re-create it. After further investigation, it turns out that most of the time, we start ntpd and immediately call the ntpq command that returns an offset of 0. My guess is the ntp daemon is not quite ready yet and so we skip over the while loop.

If we inject a sleep 15 after starting the ntpd then the offset becomes very large, and ntp does not sync again for 15 minutes, thus, causing the 15 minute delay that we have seen.

Since genesis is not intended to be a long running, I don't think it's necessary to start ntpd, but rather just use ntpdate to force the sync of the clock, if the server is configured on the xCAT MN

Testing results:

If no server is configured on xCAT mn, genesis will not be able to sync the clocks

[xCAT Genesis running on frame23cn18 /]# time ntpdate 192.168.3.25
13 Dec 10:39:37 ntpdate[1789]: no server suitable for synchronization found
 
real	0m8.102s
user	0m0.001s
sys	0m0.001s

If an external ntp server is configured....

[root@fs4 ~]# chdef -t site extntpservers="10.0.0.101" 
1 object definitions have been created or modified.
[root@fs4 ~]# makentp
configuring management node: fs4.
[root@fs4 ~]# ps -ef | grep ntp
ntp       44966      1  0 10:35 ?        00:00:00 /usr/sbin/ntpd -u ntp:ntp -g
root      44992  44823  0 10:35 pts/7    00:00:00 grep --color=auto ntp

[xCAT Genesis running on frame23cn18 /]# time ntpdate 192.168.3.25
13 Dec 10:35:54 ntpdate[1796]: step time server 192.168.3.25 offset -359.118775 sec

real	0m0.000s
user	0m0.001s
sys	0m0.002s

If ntp is configured on the mn node...

# tabdump site | grep ntp
"ntpservers","<xcatmaster>",,

[root@fs4 ~]# date
Tue Dec 13 10:37:05 PST 2016

force change the date

[root@fs4 ~]# date 12121000
Mon Dec 12 10:00:00 PST 2016

Sync in genesis:

[xCAT Genesis running on frame23cn18 /]# time ntpdate -b 192.168.3.25
12 Dec 10:01:52 ntpdate[1809]: step time server 192.168.3.25 offset -88661.293557 sec

real	0m0.000s
user	0m0.001s
sys	0m0.002s

Discovery Testing output

After re-running a small provisioning test case, with ntp running on the MN node, we see the following messages:

[root@fs4 consoles]# grep ntpdate node-8335-g*
node-8335-gtb-100470a:[Tue Dec 13 10:55:44 2016]13 Dec 10:55:44 ntpdate[1425]: step time server 192.168.3.25 offset 0.841350 sec
node-8335-gtb-2109e4a:[Tue Dec 13 10:56:12 2016]13 Dec 10:56:12 ntpdate[1691]: step time server 192.168.3.25 offset -357.007997 sec

…no need to start ntp daemon, simply attempt to sync with ntpdate. There's an issue in this code where we query ntpq for the offset too quickly after starting the ntp daemon. This causes the offset to always be 0 and we continue. We have seen situations where ntpd starts before the first check and becomes > 1000. Then it takes 15 minutes before the 2nd attempt and delays the hardware discovery process

zet809 · 2016-12-14T03:07:30Z

@whowutwut Looks good to me, I will merge it. It seems we need to rebuild genesis-base for both ppc64 and x86_64.

zet809 · 2016-12-14T03:10:33Z

Opened bug xcat2/xcat-dep#16 for xcat-dep to trace the building of xcat-genesis-base.

zet809 · 2016-12-14T09:37:18Z

@whowutwut Hi, Victor, I revert the merge in another pull request, we need to discuss the ntp problem in the interlock meeting tomorrow.

whowutwut · 2016-12-14T12:07:07Z

@zet809 can you let me know a short reason why it was reverted so I can look at fixing it today?

daniceexi · 2016-12-14T13:24:01Z

@whowutwut I suppose the reason is 'the ntpdate has been deprecated, and the ntpd should be used to replace it'. And in current code, the 'ntpd' is used.From @neo954 that we still not figure out why current code does not work.
@zet809 please correct me.

whowutwut · 2016-12-14T14:20:18Z

OK, fair enough. I'll create another pull request for this. This is actually better because we do not need to re-upload the genesis base.

whowutwut added 3 commits December 13, 2016 11:15

Adding ntpdate to the list of commands to put into xCAT-genesis-base

23179ae

Use the -b option to force the time change

482f6b7

whowutwut added this to the 2.13.1 milestone Dec 13, 2016

whowutwut requested review from zet809 and cxhong December 13, 2016 18:43

whowutwut changed the title ~~Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ]~~ [CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ] Dec 13, 2016

whowutwut added component:coral component:discovery labels Dec 13, 2016

whowutwut changed the title ~~[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time [ DO NOT MERGE ]~~ [CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time Dec 13, 2016

whowutwut mentioned this pull request Dec 13, 2016

Clean up the errors in doxcat and add messages for tracing issues [ DO NOT MERGE ] #2278

Closed

zet809 merged commit 676ae64 into xcat2:master Dec 14, 2016

zet809 mentioned this pull request Dec 14, 2016

Revert "[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time" #2313

Merged

whowutwut mentioned this pull request Dec 14, 2016

[CUSTOMER] Attempt to force sync time with ntpd instead of starting the ntp daemon to reduce discovery time #2314

Closed

whowutwut deleted the ntpd_debug branch December 14, 2016 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

whowutwut commented Dec 13, 2016 •

edited

Loading

zet809 commented Dec 14, 2016

zet809 commented Dec 14, 2016

zet809 commented Dec 14, 2016

whowutwut commented Dec 14, 2016

daniceexi commented Dec 14, 2016

whowutwut commented Dec 14, 2016

[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307

Conversation

whowutwut commented Dec 13, 2016 • edited Loading

Testing results:

Discovery Testing output

zet809 commented Dec 14, 2016

zet809 commented Dec 14, 2016

zet809 commented Dec 14, 2016

whowutwut commented Dec 14, 2016

daniceexi commented Dec 14, 2016

whowutwut commented Dec 14, 2016

whowutwut commented Dec 13, 2016 •

edited

Loading