-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUSTOMER] Use ntpdate in genesis kernel instead of starting the ntp daemon to reduce discovery time #2307
Conversation
…no need to start ntp daemon, simply attempt to sync with ntpdate. There's an issue in this code where we query ntpq for the offset too quickly after starting the ntp daemon. This causes the offset to always be 0 and we continue. We have seen situations where ntpd starts before the first check and becomes > 1000. Then it takes 15 minutes before the 2nd attempt and delays the hardware discovery process
@whowutwut Looks good to me, I will merge it. It seems we need to rebuild genesis-base for both ppc64 and x86_64. |
Opened bug xcat2/xcat-dep#16 for xcat-dep to trace the building of xcat-genesis-base. |
@whowutwut Hi, Victor, I revert the merge in another pull request, we need to discuss the ntp problem in the interlock meeting tomorrow. |
@zet809 can you let me know a short reason why it was reverted so I can look at fixing it today? |
@whowutwut I suppose the reason is 'the ntpdate has been deprecated, and the ntpd should be used to replace it'. And in current code, the 'ntpd' is used.From @neo954 that we still not figure out why current code does not work. |
OK, fair enough. I'll create another pull request for this. This is actually better because we do not need to re-upload the genesis base. |
Attempt to resolve #2327
This pull request fixes a customer reported issue for CORAL where certain nodes take 15+ minutes to do hardware discovery.
The following log message points us to the 15 minute time difference inside the Genesis Kernel.
At initial investigation, the
ntpq -c rv offset
seems to be the most likely place that causes a long loop. However, we were unable to easily re-create it. After further investigation, it turns out that most of the time, we startntpd
and immediately call thentpq
command that returns an offset of 0. My guess is the ntp daemon is not quite ready yet and so we skip over the while loop.If we inject a
sleep 15
after starting thentpd
then the offset becomes very large, and ntp does not sync again for 15 minutes, thus, causing the 15 minute delay that we have seen.Since genesis is not intended to be a long running, I don't think it's necessary to start ntpd, but rather just use ntpdate to force the sync of the clock, if the server is configured on the xCAT MN
Testing results:
force change the date
Sync in genesis:
Discovery Testing output
After re-running a small provisioning test case, with ntp running on the MN node, we see the following messages: