[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

rdner · 2024-03-23T20:11:27Z

https://buildkite.com/elastic/elastic-agent/builds/7938#018e669f-d4f2-423c-84f7-e6df2caf22b5 is cancelled after 15 hours
https://buildkite.com/elastic/elastic-agent/builds/7939#018e675d-553b-444c-8521-e0739a0a2bed is cancelled after 12 hours
https://buildkite.com/elastic/elastic-agent/builds/7942#018e689d-374d-4154-9c5b-91e8729c11ae cancelled after 6 hours
https://buildkite.com/elastic/elastic-agent/builds/7942#018e6c31-3b66-40d1-929c-196d780315a3 cancelled after 4 hours

Need to carefully inspect the logs, improve logging, if necessary, and find the root cause.
Could be related to #4356

This problem started to appear after #4461 got merged. I reverted the change in #4474

elasticmachine · 2024-03-23T20:11:28Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

rdner · 2024-03-25T10:40:04Z

Another build got stuck (> 10h) https://buildkite.com/elastic/elastic-agent/builds/7946#018e72ea-209d-411e-8881-e83fe768fdc9

cmacknz · 2024-03-25T14:57:48Z

In the first 15 hour log and the most recent log #4475 (comment) we never see the SLES runner get past this point:

�_bk;t=1711326854275�>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /opt/buildkite-agent/builds/bk-agent-prod-gcp-1711324950210306583/elastic/elastic-agent/.integration-cache/id_rsa [email protected]`
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) Connected over SSH
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) Preparing instance
�_bk;t=1711326854906�>>> (linux-amd64-sles-15-rpm) install devel_basis

What is interesting is that this operation has a timeout on the context that should have timed out:

elastic-agent/pkg/testing/runner/sles.go

Lines 20 to 26 in d558694

    
           // Prepare configures the host for running the test 
        
           func (SlesRunner) Prepare(ctx context.Context, sshClient SSHClient, logger Logger, arch string, goVersion string) error { 
        
           	logger.Logf("install devel_basis") 
        
           	stdOut, errOut, err := sshClient.ExecWithRetry(ctx, "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}, 15*time.Second) 
        
           	if err != nil { 
        
           		return fmt.Errorf("failed to run zypper install devel_basis: %w (stdout: %s, stderr: %s)", err, stdOut, errOut) 
        
           	}

Following the context here the call stack is:

elastic-agent/pkg/testing/runner/runner.go

Lines 347 to 355 in d558694

    
           if !instance.Prepared { 
        
           	// prepare the host to run the tests 
        
           	logger.Logf("Preparing instance") 
        
           	err = batch.OS.Runner.Prepare(ctx, client, logger, batch.OS.Arch, r.cfg.GOVersion) 
        
           	if err != nil { 
        
           		logger.Logf("Failed to prepare instance: %s", err) 
        
           		return OSRunnerResult{}, fmt.Errorf("failed to prepare instance %s: %w", instance.Name, err) 
        
           	}

elastic-agent/pkg/testing/runner/runner.go

Line 309 in d558694

result, err := r.runInstance(ctx, sshAuth, logger, repoArchive, batch, i)

elastic-agent/pkg/testing/runner/runner.go

Line 251 in d558694

results, err := r.runInstances(ctx, sshAuth, repoArchive, instances)

elastic-agent/magefile.go

Line 2125 in d558694

results, err := r.Run(ctx)

elastic-agent/magefile.go

Lines 1562 to 1576 in d558694

    
           // Test runs integration tests on remote hosts 
        
           func (Integration) Test(ctx context.Context) error { 
        
           	return integRunner(ctx, false, "") 
        
           } 
        
           // Matrix runs integration tests on a matrix of all supported remote hosts 
        
           func (Integration) Matrix(ctx context.Context) error { 
        
           	return integRunner(ctx, true, "") 
        
           } 
        
           // Single runs single integration test on remote host 
        
           func (Integration) Single(ctx context.Context, testName string) error { 
        
           	return integRunner(ctx, false, testName) 
        
           }

So the context we are using is the default one mage provides, which only has a timeout if mage was run with -t per https://magefile.org/targets/:

A default context is passed into any target with a context argument. This context will have a timeout if mage was run with -t, and thus will cancel the running targets and dependencies at that time.

We don't use mage -t in CI (or have a timeout on the integration test buildkite step at all apparently):

elastic-agent/.buildkite/scripts/steps/integration_tests.sh

Line 26 in c9bb164

    
           AGENT_VERSION="${OVERRIDE_TEST_AGENT_VERSION}" TEST_INTEG_CLEAN_ON_EXIT=true  STACK_PROVISIONER="$STACK_PROVISIONER" SNAPSHOT=true mage $MAGE_TARGET $MAGE_SUBTARGET

So that's why this is hanging. It doesn't tell us why it's hanging, but it does look like we are getting stuck trying to do "sudo", []string{"zypper", "install", "-y ", "-t", "pattern", "devel_basis"}.

cmacknz · 2024-03-25T15:05:09Z

zypper does have a --non-interactive we aren't using, that might be contributing to this, not sure why this wouldn't time out on every test run other wise though

https://doc.opensuse.org/documentation/leap/reference/html/book-reference/cha-sw-cl.html#sec-zypper-script

By default, Zypper asks for a confirmation before installing or removing a selected package, or when a problem occurs. You can override this behavior using the --non-interactive option. This option must be given before the actual command (install, remove, and patch), as can be seen in the following:

It looks like -y should be similar:

-y, --no-confirm
Don’t require user interaction. It’s recommended to use the --non-interactive global option instead. Global options are passed before the command (zypper --non-interactive COMMAND ...). Unlike the no-confirm command option, the global option can be used together with any zypper command.

There's also a --verbose we could turn on to try to get more details on why it might be stuck.

leehinman · 2024-03-25T15:08:31Z

I'm looking at it, and I'm pretty sure one of the http connections that zypper uses to download packages & repo updates has been broken and we aren't detecting it. Hopefully I can get either a TCP or HTTP keepalive set.

But this does bring up a design question. We probably shouldn't be going out to the Internet to update our Linux boxes every time we run an integration test. One possible solution is to run a local deb & rpm repos, we can update those periodically, while all the integration tests pull from those local ones. Another solution is to push all this traffic through some kind of caching proxy.

cmacknz · 2024-03-25T17:34:20Z

Small PR to try to prevent the tests from hanging indefinitely when this happens by always specifying a top level context timeout: #4478

rdner · 2024-03-25T18:20:50Z

I'm re-assigning this to @leehinman since he's already looking into this and the new SLES runner introduced in #4461 most likely caused this.

leehinman · 2024-03-25T19:02:07Z

Finally got this reproduce this behavior, and when I ssh onto the sles host, the SSH connection was still there, but the shell for the connection wasn't. And if I killed the ssh process on the sles host, the integration framework isn't detecting the TCP connection is gone.

>>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 2f6022c1423c4eff803a497b3d7df08b]
>>> (linux-amd64-sles-15-rpm) Starting SSH; connect with `ssh -i /Users/hinman/src/elastic-agent/.integration-cache/id_rsa [email protected]`
>>> (linux-amd64-sles-15-rpm) Connected over SSH
>>> (linux-amd64-sles-15-rpm) Preparing instance
>>> (linux-amd64-sles-15-rpm) install devel_basis

So I think this our SSH implementation not detecting the connection is broken, it is clearly gone from the OS point of view on both ends of the connection.

leehinman · 2024-03-25T19:15:50Z

golang SSH doesn't do keep alive by default.

golang/go#21478
golang/go#19338

I'm going to try some of the work arounds in the above issues and see if that helps.

rdner · 2024-03-25T20:34:23Z

golang SSH doesn't do keep alive by default.

@leehinman There were some related suggestions here too #4410 (comment)

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

leehinman · 2024-03-25T20:38:21Z

Stuff like turning on multiplexing or tuning ServerAliveInterval on the client side might improve things substantially.

golang ssh doesn't have that, that is what those issues point to. :-) I have something that is sending ssh keepalives now, and it looks good. I'm going to let it run over night to see how it holds up.

leehinman · 2024-03-26T16:23:33Z

just to share some info. So sometimes ssh was having i/o timeouts and adding keepalives and reconnects helped with that.

But there is still an issue where sometimes the ogc sles image comes up without any rpm repositories defined. This means that the zypper command would fail, and because we blindly try over and over until the overall context times out that would cause us to "hang" at the zypper install step.

leehinman · 2024-03-28T18:03:17Z

#4498 with the second try, and it has the additional logging and TCP keepalives for SSH

rdner · 2024-04-05T13:45:18Z

Fixed by #4498

rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Mar 23, 2024

rdner self-assigned this Mar 23, 2024

cmacknz mentioned this issue Mar 25, 2024

Ensure integration test setup has a timeout. #4478

Merged

rdner assigned leehinman and unassigned rdner Mar 25, 2024

rdner closed this as completed Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

rdner commented Mar 23, 2024 •

edited

Loading

elasticmachine commented Mar 23, 2024

rdner commented Mar 25, 2024

cmacknz commented Mar 25, 2024

cmacknz commented Mar 25, 2024

leehinman commented Mar 25, 2024

cmacknz commented Mar 25, 2024

rdner commented Mar 25, 2024

leehinman commented Mar 25, 2024

leehinman commented Mar 25, 2024

rdner commented Mar 25, 2024

leehinman commented Mar 25, 2024

leehinman commented Mar 26, 2024

leehinman commented Mar 28, 2024

rdner commented Apr 5, 2024

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

[Flaky Test]: Integration tests keep running forever until manually cancelled #4475

Comments

rdner commented Mar 23, 2024 • edited Loading

elasticmachine commented Mar 23, 2024

rdner commented Mar 25, 2024

cmacknz commented Mar 25, 2024

cmacknz commented Mar 25, 2024

leehinman commented Mar 25, 2024

cmacknz commented Mar 25, 2024

rdner commented Mar 25, 2024

leehinman commented Mar 25, 2024

leehinman commented Mar 25, 2024

rdner commented Mar 25, 2024

leehinman commented Mar 25, 2024

leehinman commented Mar 26, 2024

leehinman commented Mar 28, 2024

rdner commented Apr 5, 2024

rdner commented Mar 23, 2024 •

edited

Loading