Add retries to all network communication when running integration tests #4794

rdner · 2024-05-22T11:31:07Z

Our integration tests sometimes hit errors like so:

Error calling deployment retrieval API (quite often)

We should add a short timeout and retries.

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/REDACTED": context deadline exceeded

Examples:

Network issues (SSH failure, TCP i/o timeout)

Should add retries.

Error: error running test: failed to connect to instance ogc-windows-amd64-2022-default-ff13: dial tcp REDACTED: i/o timeout

(linux-arm64-ubuntu-2204-fleet-airgapped) Failed to execute tests on instance: error running sudo tests: failed to fetched test output at $HOME/agent/build/TEST-go-remote-linux-arm64-ubuntu-2204-fleet-airgapped-sudo.integration.out

Failed to check for ... to be ready

Should add retries.

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: context deadline exceeded

Examples:

Describe the enhancement:

The integration test framework should be more resilient to such errors and have retries where possible, so the tests are not failing when they hit an issue similar to listed above.

Describe a specific use case for the enhancement or feature:

Our integration tests are unstable due to the complexity of the VM orchestration and high number of points of failure.

What is the definition of done?

All the network communication (SSH, API calls, etc) has error handling with retries and backoff.

elasticmachine · 2024-05-22T11:31:09Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

blakerouse · 2024-05-22T13:45:44Z

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/REDACTED": context deadline exceeded

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: REDACTED] to be ready: context deadline exceeded

Both of those being error context deadline exceeded that seems like it never was ready in the 10 minute window that the testing framework waits. There is already retries in this path and it keeps checking. A context deadline exceeded means that the cloud didn't get the deployment in the 10 minute window. Seems like in this case, the issue is on the side of cloud.

Error: error running test: failed to connect to instance ogc-windows-amd64-2022-default-ff13: dial tcp REDACTED: i/o timeout

This one we should add retries.

cmacknz · 2024-05-22T13:50:28Z

Spot checking some of the deployments that hit the context deadline exceeded error, the first two show as unhealthy and then terminated in the admin console.

For example the one from https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2

cmacknz · 2024-05-22T13:57:39Z

Actually those deployments are all terminated, hard to say if they were actually unhealthy, the activity feed only shows a shutdown.

Looking at the most recent one in https://buildkite.com/elastic/elastic-agent/builds/8252#018ec342-292a-4a31-9ae4-c360c3c54166 it actually fails with an EOF error and then the deployment is terminated (I assume by our cleanup logic but there's no log for this).

2024-04-09 14:59:16 UTC | >>> Waiting for cloud stack 8.14.0-SNAPSHOT to be ready [stack_id: 8140-SNAPSHOT, deployment_id: 4ba5eafb84834a56bd7e3afbcc4ab49b]

2024-04-09 15:02:31 UTC | >>> (linux-arm64-ubuntu-2204-fleet) Failed for instance linux-arm64-ubuntu-2204-fleet (@ 34.70.250.63): failed to check for cloud 8.14.0-SNAPSHOT [stack_id: 8140-SNAPSHOT, deployment_id: 4ba5eafb84834a56bd7e3afbcc4ab49b] to be ready: error parsing deployment retrieval API response: EOF

From the admin console:

2024-04-09T15:05:48.610Z, took 6 milliseconds
Completed step stop-instances

This one could have been helped by a retry perhaps.

blakerouse · 2024-05-22T15:46:33Z

Looking at https://github.com/elastic/elastic-agent/blob/main/pkg/testing/ess/deployment.go#L250, it does seem like it needs some improvements. A few issues I see is:

The context cleanup is not correct, seems that each tick will create a new context and only clean it up at the end of the wait (not on each tick)
An error from the API will cause it to return error showing that error. That should be changed to not do that and just retry again.

blakerouse · 2024-05-22T15:51:01Z

Fixes for the comment I just made - #4798

ycombinator · 2024-05-22T21:49:25Z

Fixes for the comment I just made - #4798

This PR will take care of two of three problems mentioned in this issue's description:

Error calling deployment retrieval API (quite often)
Failed to check for ... to be ready

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

ycombinator · 2024-06-06T00:39:22Z

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

We found that t2 ARM instances on GCP have network connectivity issues. We're making some efforts to move away from those instances, including trying to temporarily move to AWS instance or try to get into GCP's private preview for Axion instances. Will deprioritize this issue here until we've learnt more from these efforts.

ycombinator · 2024-06-21T20:59:03Z

Still left to do will be the third problem:

Network issues (SSH failure, TCP i/o timeout)

We found that t2 ARM instances on GCP have network connectivity issues. We're making some efforts to move away from those instances, including trying to temporarily move to AWS instance or try to get into GCP's private preview for Axion instances. Will deprioritize this issue here until we've learnt more from these efforts.

@rdner From your recent CI reports, I believe we're no longer facing the ARM SSH issues. Would you mind confirming? If that's true, I think we can now close this issue.

pierrehilbert · 2024-06-22T12:55:50Z

@rdner is in PTOs until Wednesday.
If I remember correctly we only fixed the SSH issue but not the others.
Also, we not really fixed it but hid it as we are waiting to get access to the new ARM instances to be able to really fix it.

ycombinator · 2024-06-22T13:57:54Z

If I remember correctly we only fixed the SSH issue but not the others.

I thought it was the other way around: #4794 (comment)

Also, we not really fixed it but hid it as we are waiting to get access to the new ARM instances to be able to really fix it.

Ah, in that case, never mind then — we should leave this issue open.

ycombinator · 2024-10-01T17:57:18Z

The only issue left here is to move to the new ARM instances and re-enable the ARM tests disabled in #4852. This work is being tracked in a private issue so I'm going to close this issue here.

rdner added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 22, 2024

rdner assigned blakerouse May 22, 2024

rdner mentioned this issue May 22, 2024

[Flaky Test] VM orchestration is unstable in integration tests #4356

Closed

blakerouse removed their assignment May 22, 2024

blakerouse mentioned this issue May 22, 2024

Improve integration testing cloud client DeploymentIsReady. #4798

Merged

2 tasks

ycombinator assigned blakerouse May 22, 2024

rdner mentioned this issue Jun 3, 2024

[Flaky Test]: Failed to prepare an instance – could not run "unzip agent-repo.zip -d agent" though SSH: wait: remote command exited without exit status or exit signal #4810

Closed

mergify bot mentioned this issue Jun 4, 2024

[8.14](backport #4798) Improve integration testing cloud client DeploymentIsReady. #4841

Merged

2 tasks

ycombinator unassigned blakerouse Jun 21, 2024

ycombinator closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries to all network communication when running integration tests #4794

Add retries to all network communication when running integration tests #4794

rdner commented May 22, 2024 •

edited

Loading

elasticmachine commented May 22, 2024

blakerouse commented May 22, 2024

cmacknz commented May 22, 2024 •

edited

Loading

cmacknz commented May 22, 2024

blakerouse commented May 22, 2024

blakerouse commented May 22, 2024

ycombinator commented May 22, 2024

ycombinator commented Jun 6, 2024

ycombinator commented Jun 21, 2024 •

edited

Loading

pierrehilbert commented Jun 22, 2024

ycombinator commented Jun 22, 2024

ycombinator commented Oct 1, 2024

Add retries to all network communication when running integration tests #4794

Add retries to all network communication when running integration tests #4794

Comments

rdner commented May 22, 2024 • edited Loading

Error calling deployment retrieval API (quite often)

Network issues (SSH failure, TCP i/o timeout)

Failed to check for ... to be ready

elasticmachine commented May 22, 2024

blakerouse commented May 22, 2024

cmacknz commented May 22, 2024 • edited Loading

cmacknz commented May 22, 2024

blakerouse commented May 22, 2024

blakerouse commented May 22, 2024

ycombinator commented May 22, 2024

ycombinator commented Jun 6, 2024

ycombinator commented Jun 21, 2024 • edited Loading

pierrehilbert commented Jun 22, 2024

ycombinator commented Jun 22, 2024

ycombinator commented Oct 1, 2024

rdner commented May 22, 2024 •

edited

Loading

cmacknz commented May 22, 2024 •

edited

Loading

ycombinator commented Jun 21, 2024 •

edited

Loading