Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZBUG-469: Apply patch to zmcontrol for VM resets caused by LDAP server hangs. #35

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

synacor-bsinger
Copy link

Copied from https://bugzilla.zimbra.com/show_bug.cgi?id=107769

When the LDAP master server or LDAP replica server is hung up (not a crash), current cluster monitoring which refers that LDAP server continues to connects to hung LDAP server.

This reason is, LDAP access is established in OS tcp stack even if slapd is haung state. Then this connection is waited until ldap_read_timeout and happened all virtual machines which refers hung LDAP server were caused reset during the system test.

The following code to zmcontrol resolves this issue. This code is provided from our partner who deployed the large ISP system just started the production service. So please review and consider for the later release.

To reproduce this issue, just send SIGSTOP to slapd process which referred from MBS servers. This situation is also possible when some kind of network issues happen.

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gordyt gordyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@fatkudu fatkudu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@plobbes
Copy link
Contributor

plobbes commented Jul 19, 2018

I can't help but feel like the troubleshooting for this (as described) is incomplete and that we're possibly addressing a portion of the problem, or maybe even a symptom of the problem and in the process changing slightly the semantics of zmcontrol (which may or may not be OK). In theory, based on the original problem description, a customer on vsphere using libexec/vmware-heartbeat to monitor basic host health found their VMs all restarted due to the monitoring failing to handle a LDAP master being down.

Looking at vmware-heartbeat we can see that a call to isZcsHealth() leads to a zmcontrol status call and unless zmcontrol exits with a zero (success) return code, isZcsHealthy will return false. So, the "fix" here is attempting to make zmcontrol more likely to return success - reasonable enough, but I want to talk through this...

What we don't know for sure, is the reason that zmcontrol status did not return success. There are a few variables potentially at play here, see https://wiki.zimbra.com/wiki/VMware_HA_script_in_Zimbra_Collaboration for details.

Consider this: If the vSphere HA failure interval was set to 30 secs, a slow zmcontrol status could cause a restart regardless of LDAP state. zmcontrol has it's own internal timer of 180 secs, in addition to that the Net::LDAP new call uses a timeout of 30 secs, so a slow LDAP server could also lead to problems.

The proposed change here does a few things:

  1. Use $ldap_url instead of $ldap_master_url to get a list of LDAP servers to talk to
  2. Lowers the timeout from 30 secs to 5 secs
  3. Fails to take advantage of the built in support for multiple LDAP hosts within Net::LDAP

I think we need to address issue #1 above, is it OK to avoid the ldap_master_url list? Was this used on purpose to try and ensure we're dealing with the most up to date LDAP server (in case LDAP replication is slow?). Perhaps this is a moot point now as we tend towards multimaster replication and an expectation that the masters are very close to in sync and $ldap_master_url and $ldap_url are perhaps identical? One option would be to use the combination of the two (with any duplicates removed).

Perhaps the most important change being proposed is #2 above, but we need to address #3 to do this fix properly.

Copy link
Contributor

@plobbes plobbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a long comment with my concerns.

See https://metacpan.org/pod/distribution/perl-ldap/lib/Net/LDAP.pod#CONSTRUCTOR for description of HOST usage.

@k-kato k-kato self-requested a review October 23, 2018 07:41
@Prashantsurana
Copy link
Contributor

Hello @bsinger Can you please address the review comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants