-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to all network communication when running integration tests #4794
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Both of those being error
This one we should add retries. |
Spot checking some of the deployments that hit the For example the one from https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2 |
Actually those deployments are all terminated, hard to say if they were actually unhealthy, the activity feed only shows a shutdown. Looking at the most recent one in https://buildkite.com/elastic/elastic-agent/builds/8252#018ec342-292a-4a31-9ae4-c360c3c54166 it actually fails with an EOF error and then the deployment is terminated (I assume by our cleanup logic but there's no log for this).
From the admin console:
This one could have been helped by a retry perhaps. |
Looking at https://github.com/elastic/elastic-agent/blob/main/pkg/testing/ess/deployment.go#L250, it does seem like it needs some improvements. A few issues I see is:
|
Fixes for the comment I just made - #4798 |
This PR will take care of two of three problems mentioned in this issue's description:
Still left to do will be the third problem:
|
We found that t2 ARM instances on GCP have network connectivity issues. We're making some efforts to move away from those instances, including trying to temporarily move to AWS instance or try to get into GCP's private preview for Axion instances. Will deprioritize this issue here until we've learnt more from these efforts. |
@rdner From your recent CI reports, I believe we're no longer facing the ARM SSH issues. Would you mind confirming? If that's true, I think we can now close this issue. |
@rdner is in PTOs until Wednesday. |
I thought it was the other way around: #4794 (comment)
Ah, in that case, never mind then — we should leave this issue open. |
The only issue left here is to move to the new ARM instances and re-enable the ARM tests disabled in #4852. This work is being tracked in a private issue so I'm going to close this issue here. |
Our integration tests sometimes hit errors like so:
Error calling deployment retrieval API (quite often)
We should add a short timeout and retries.
Examples:
Network issues (SSH failure, TCP i/o timeout)
Should add retries.
Failed to check for ... to be ready
Should add retries.
Examples:
Describe the enhancement:
The integration test framework should be more resilient to such errors and have retries where possible, so the tests are not failing when they hit an issue similar to listed above.
Describe a specific use case for the enhancement or feature:
Our integration tests are unstable due to the complexity of the VM orchestration and high number of points of failure.
What is the definition of done?
All the network communication (SSH, API calls, etc) has error handling with retries and backoff.
The text was updated successfully, but these errors were encountered: