-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test]: Integration tests keep running forever until manually cancelled #4475
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Another build got stuck (> 10h) https://buildkite.com/elastic/elastic-agent/builds/7946#018e72ea-209d-411e-8881-e83fe768fdc9 |
In the first 15 hour log and the most recent log #4475 (comment) we never see the SLES runner get past this point:
What is interesting is that this operation has a timeout on the context that should have timed out: elastic-agent/pkg/testing/runner/sles.go Lines 20 to 26 in d558694
Following the context here the call stack is: elastic-agent/pkg/testing/runner/runner.go Lines 347 to 355 in d558694
elastic-agent/pkg/testing/runner/runner.go Line 309 in d558694
elastic-agent/pkg/testing/runner/runner.go Line 251 in d558694
Line 2125 in d558694
Lines 1562 to 1576 in d558694
So the context we are using is the default one mage provides, which only has a timeout if mage was run with
We don't use
So that's why this is hanging. It doesn't tell us why it's hanging, but it does look like we are getting stuck trying to do |
It looks like
There's also a |
I'm looking at it, and I'm pretty sure one of the http connections that zypper uses to download packages & repo updates has been broken and we aren't detecting it. Hopefully I can get either a TCP or HTTP keepalive set. But this does bring up a design question. We probably shouldn't be going out to the Internet to update our Linux boxes every time we run an integration test. One possible solution is to run a local deb & rpm repos, we can update those periodically, while all the integration tests pull from those local ones. Another solution is to push all this traffic through some kind of caching proxy. |
Small PR to try to prevent the tests from hanging indefinitely when this happens by always specifying a top level context timeout: #4478 |
I'm re-assigning this to @leehinman since he's already looking into this and the new SLES runner introduced in #4461 most likely caused this. |
Finally got this reproduce this behavior, and when I ssh onto the sles host, the SSH connection was still there, but the shell for the connection wasn't. And if I killed the ssh process on the sles host, the integration framework isn't detecting the TCP connection is gone.
So I think this our SSH implementation not detecting the connection is broken, it is clearly gone from the OS point of view on both ends of the connection. |
golang SSH doesn't do keep alive by default. golang/go#21478 I'm going to try some of the work arounds in the above issues and see if that helps. |
@leehinman There were some related suggestions here too #4410 (comment)
|
golang ssh doesn't have that, that is what those issues point to. :-) I have something that is sending ssh keepalives now, and it looks good. I'm going to let it run over night to see how it holds up. |
just to share some info. So sometimes ssh was having i/o timeouts and adding keepalives and reconnects helped with that. But there is still an issue where sometimes the ogc sles image comes up without any rpm repositories defined. This means that the |
#4498 with the second try, and it has the additional logging and TCP keepalives for SSH |
Fixed by #4498 |
Need to carefully inspect the logs, improve logging, if necessary, and find the root cause.
Could be related to #4356
This problem started to appear after #4461 got merged. I reverted the change in #4474
The text was updated successfully, but these errors were encountered: