Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

Closed
rdner opened this issue Mar 23, 2024 · 14 comments
Closed
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Mar 23, 2024

Need to carefully inspect the logs, improve logging, if necessary, and find the root cause.
Could be related to #4356

This problem started to appear after #4461 got merged. I reverted the change in #4474

@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Mar 23, 2024
@rdner rdner self-assigned this Mar 23, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@rdner
Copy link
Member Author

rdner commented Mar 25, 2024

@cmacknz
Copy link
Member

cmacknz commented Mar 25, 2024

In the first 15 hour log and the most recent log #4475 (comment) we never see the SLES runner get past this point:

�_bk;t=1711326854275�>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /opt/buildkite-agent/builds/bk-agent-prod-gcp-1711324950210306583/elastic/elastic-agent/.integration-cache/id_rsa [email protected]`
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) Connected over SSH
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) Preparing instance
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) install devel_basis

What is interesting is that this operation has a timeout on the context that should have timed out:

// Prepare configures the host for running the test
func (SlesRunner) Prepare(ctx context.Context, sshClient SSHClient, logger Logger, arch string, goVersion string) error {
logger.Logf("install devel_basis")
stdOut, errOut, err := sshClient.ExecWithRetry(ctx, "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}, 15*time.Second)
if err != nil {
return fmt.Errorf("failed to run zypper install devel_basis: %w (stdout: %s, stderr: %s)", err, stdOut, errOut)
}

Following the context here the call stack is:

if !instance.Prepared {
// prepare the host to run the tests
logger.Logf("Preparing instance")
err = batch.OS.Runner.Prepare(ctx, client, logger, batch.OS.Arch, r.cfg.GOVersion)
if err != nil {
logger.Logf("Failed to prepare instance: %s", err)
return OSRunnerResult{}, fmt.Errorf("failed to prepare instance %s: %w", instance.Name, err)
}

result, err := r.runInstance(ctx, sshAuth, logger, repoArchive, batch, i)

results, err := r.runInstances(ctx, sshAuth, repoArchive, instances)

results, err := r.Run(ctx)

elastic-agent/magefile.go

Lines 1562 to 1576 in d558694

// Test runs integration tests on remote hosts
func (Integration) Test(ctx context.Context) error {
return integRunner(ctx, false, "")
}
// Matrix runs integration tests on a matrix of all supported remote hosts
func (Integration) Matrix(ctx context.Context) error {
return integRunner(ctx, true, "")
}
// Single runs single integration test on remote host
func (Integration) Single(ctx context.Context, testName string) error {
return integRunner(ctx, false, testName)
}

So the context we are using is the default one mage provides, which only has a timeout if mage was run with -t per https://magefile.org/targets/:

A default context is passed into any target with a context argument. This context will have a timeout if mage was run with -t, and thus will cancel the running targets and dependencies at that time.

We don't use mage -t in CI (or have a timeout on the integration test buildkite step at all apparently):

AGENT_VERSION="${OVERRIDE_TEST_AGENT_VERSION}" TEST_INTEG_CLEAN_ON_EXIT=true STACK_PROVISIONER="$STACK_PROVISIONER" SNAPSHOT=true mage $MAGE_TARGET $MAGE_SUBTARGET

So that's why this is hanging. It doesn't tell us why it's hanging, but it does look like we are getting stuck trying to do "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}.

@cmacknz
Copy link
Member

cmacknz commented Mar 25, 2024

zypper does have a --non-interactive we aren't using, that might be contributing to this, not sure why this wouldn't time out on every test run other wise though

https://doc.opensuse.org/documentation/leap/reference/html/book-reference/cha-sw-cl.html#sec-zypper-script

By default, Zypper asks for a confirmation before installing or removing a selected package, or when a problem occurs. You can override this behavior using the --non-interactive option. This option must be given before the actual command (install, remove, and patch), as can be seen in the following:

It looks like -y should be similar:

-y, --no-confirm
Don’t require user interaction. It’s recommended to use the --non-interactive global option instead. Global options are passed before the command (zypper --non-interactive COMMAND ...). Unlike the no-confirm command option, the global option can be used together with any zypper command.

There's also a --verbose we could turn on to try to get more details on why it might be stuck.

@leehinman
Copy link
Contributor

I'm looking at it, and I'm pretty sure one of the http connections that zypper uses to download packages & repo updates has been broken and we aren't detecting it. Hopefully I can get either a TCP or HTTP keepalive set.

But this does bring up a design question. We probably shouldn't be going out to the Internet to update our Linux boxes every time we run an integration test. One possible solution is to run a local deb & rpm repos, we can update those periodically, while all the integration tests pull from those local ones. Another solution is to push all this traffic through some kind of caching proxy.

@cmacknz
Copy link
Member

cmacknz commented Mar 25, 2024

Small PR to try to prevent the tests from hanging indefinitely when this happens by always specifying a top level context timeout: #4478

@rdner rdner assigned leehinman and unassigned rdner Mar 25, 2024
@rdner
Copy link
Member Author

rdner commented Mar 25, 2024

I'm re-assigning this to @leehinman since he's already looking into this and the new SLES runner introduced in #4461 most likely caused this.

@leehinman
Copy link
Contributor

Finally got this reproduce this behavior, and when I ssh onto the sles host, the SSH connection was still there, but the shell for the connection wasn't. And if I killed the ssh process on the sles host, the integration framework isn't detecting the TCP connection is gone.

>>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 2f6022c1423c4eff803a497b3d7df08b]
>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /Users/hinman/src/elastic-agent/.integration-cache/id_rsa [email protected]`
>>> (linux-amd64-sles-15-rpm) Connected over SSH
>>> (linux-amd64-sles-15-rpm) Preparing instance
>>> (linux-amd64-sles-15-rpm) install devel_basis

So I think this our SSH implementation not detecting the connection is broken, it is clearly gone from the OS point of view on both ends of the connection.

@leehinman
Copy link
Contributor

golang SSH doesn't do keep alive by default.

golang/go#21478
golang/go#19338

I'm going to try some of the work arounds in the above issues and see if that helps.

@rdner
Copy link
Member Author

rdner commented Mar 25, 2024

golang SSH doesn't do keep alive by default.

@leehinman There were some related suggestions here too #4410 (comment)

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

@leehinman
Copy link
Contributor

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

golang ssh doesn't have that, that is what those issues point to. :-) I have something that is sending ssh keepalives now, and it looks good. I'm going to let it run over night to see how it holds up.

@leehinman
Copy link
Contributor

just to share some info. So sometimes ssh was having i/o timeouts and adding keepalives and reconnects helped with that.

But there is still an issue where sometimes the ogc sles image comes up without any rpm repositories defined. This means that the zypper command would fail, and because we blindly try over and over until the overall context times out that would cause us to "hang" at the zypper install step.

@leehinman
Copy link
Contributor

#4498 with the second try, and it has the additional logging and TCP keepalives for SSH

@rdner
Copy link
Member Author

rdner commented Apr 5, 2024

Fixed by #4498

@rdner rdner closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

4 participants