Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to all network communication when running integration tests #4794

Closed
rdner opened this issue May 22, 2024 · 12 comments
Closed

Add retries to all network communication when running integration tests #4794

rdner opened this issue May 22, 2024 · 12 comments
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@rdner
Copy link
Member

rdner commented May 22, 2024

Our integration tests sometimes hit errors like so:

Error calling deployment retrieval API (quite often)

We should add a short timeout and retries.

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/REDACTED": context deadline exceeded

Examples:

Network issues (SSH failure, TCP i/o timeout)

Should add retries.

Error: error running test: failed to connect to instance ogc-windows-amd64-2022-default-ff13: dial tcp REDACTED: i/o timeout

(linux-arm64-ubuntu-2204-fleet-airgapped) Failed to execute tests on instance: error running sudo tests: failed to fetched test output at $HOME/agent/build/TEST-go-remote-linux-arm64-ubuntu-2204-fleet-airgapped-sudo.integration.out

Failed to check for ... to be ready

Should add retries.

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: context deadline exceeded

Examples:

Describe the enhancement:

The integration test framework should be more resilient to such errors and have retries where possible, so the tests are not failing when they hit an issue similar to listed above.

Describe a specific use case for the enhancement or feature:

Our integration tests are unstable due to the complexity of the VM orchestration and high number of points of failure.

What is the definition of done?

All the network communication (SSH, API calls, etc) has error handling with retries and backoff.

@rdner rdner added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 22, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse
Copy link
Contributor

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/REDACTED": context deadline exceeded
Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: context deadline exceeded

Both of those being error context deadline exceeded that seems like it never was ready in the 10 minute window that the testing framework waits. There is already retries in this path and it keeps checking. A context deadline exceeded means that the cloud didn't get the deployment in the 10 minute window. Seems like in this case, the issue is on the side of cloud.

Error: error running test: failed to connect to instance ogc-windows-amd64-2022-default-ff13: dial tcp REDACTED: i/o timeout

This one we should add retries.

@blakerouse blakerouse removed their assignment May 22, 2024
@cmacknz
Copy link
Member

cmacknz commented May 22, 2024

Spot checking some of the deployments that hit the context deadline exceeded error, the first two show as unhealthy and then terminated in the admin console.

For example the one from https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2

@cmacknz
Copy link
Member

cmacknz commented May 22, 2024

Actually those deployments are all terminated, hard to say if they were actually unhealthy, the activity feed only shows a shutdown.

Looking at the most recent one in https://buildkite.com/elastic/elastic-agent/builds/8252#018ec342-292a-4a31-9ae4-c360c3c54166 it actually fails with an EOF error and then the deployment is terminated (I assume by our cleanup logic but there's no log for this).

2024-04-09 14:59:16 UTC | >>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 4ba5eafb84834a56bd7e3afbcc4ab49b]
2024-04-09 15:02:31 UTC | >>> (linux-arm64-ubuntu-2204-fleet) Failed for instance linux-arm64-ubuntu-2204-fleet (@ 34.70.250.63): failed to check for cloud 8.14.0-SNAPSHOT [stack_id: 8140-SNAPSHOT, deployment_id: 4ba5eafb84834a56bd7e3afbcc4ab49b] to be ready: error parsing deployment retrieval API response: EOF

From the admin console:

2024-04-09T15:05:48.610Z, took 6 milliseconds
Completed step stop-instances

This one could have been helped by a retry perhaps.

@blakerouse
Copy link
Contributor

Looking at https://github.com/elastic/elastic-agent/blob/main/pkg/testing/ess/deployment.go#L250, it does seem like it needs some improvements. A few issues I see is:

  1. The context cleanup is not correct, seems that each tick will create a new context and only clean it up at the end of the wait (not on each tick)
  2. An error from the API will cause it to return error showing that error. That should be changed to not do that and just retry again.

@blakerouse
Copy link
Contributor

Fixes for the comment I just made - #4798

@ycombinator
Copy link
Contributor

Fixes for the comment I just made - #4798

This PR will take care of two of three problems mentioned in this issue's description:

Error calling deployment retrieval API (quite often)
Failed to check for ... to be ready

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

@ycombinator
Copy link
Contributor

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

We found that t2 ARM instances on GCP have network connectivity issues. We're making some efforts to move away from those instances, including trying to temporarily move to AWS instance or try to get into GCP's private preview for Axion instances. Will deprioritize this issue here until we've learnt more from these efforts.

@ycombinator
Copy link
Contributor

ycombinator commented Jun 21, 2024

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

We found that t2 ARM instances on GCP have network connectivity issues. We're making some efforts to move away from those instances, including trying to temporarily move to AWS instance or try to get into GCP's private preview for Axion instances. Will deprioritize this issue here until we've learnt more from these efforts.

@rdner From your recent CI reports, I believe we're no longer facing the ARM SSH issues. Would you mind confirming? If that's true, I think we can now close this issue.

@pierrehilbert
Copy link
Contributor

@rdner is in PTOs until Wednesday.
If I remember correctly we only fixed the SSH issue but not the others.
Also, we not really fixed it but hid it as we are waiting to get access to the new ARM instances to be able to really fix it.

@ycombinator
Copy link
Contributor

If I remember correctly we only fixed the SSH issue but not the others.

I thought it was the other way around: #4794 (comment)

Also, we not really fixed it but hid it as we are waiting to get access to the new ARM instances to be able to really fix it.

Ah, in that case, never mind then — we should leave this issue open.

@ycombinator
Copy link
Contributor

The only issue left here is to move to the new ARM instances and re-enable the ARM tests disabled in #4852. This work is being tracked in a private issue so I'm going to close this issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

6 participants