Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] VM orchestration is unstable in integration tests #4356

Closed
rdner opened this issue Mar 5, 2024 · 11 comments
Closed

[Flaky Test] VM orchestration is unstable in integration tests #4356

rdner opened this issue Mar 5, 2024 · 11 comments
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Mar 5, 2024

The failures can be categorized in following groups:

Firewall resource not found or already exists (quite often) (should be fixed by #4740)

This has been reported in the OGC repository adam-stokes/ogc#28

libcloud.common.google.ResourceNotFoundError: {'message': "The resource 'projects/elastic-platform-ingest/global/firewalls/linux-amd64-ubuntu-2204-upgrade' was not found", 'domain': 'global', 'reason': 'notFound'}

libcloud.common.google.ResourceExistsError: {'message': "The resource 'projects/elastic-platform-ingest/zones/us-central1-a/instances/ogc-linux-amd64-ubuntu-2204-fleet-airgapped-2315' already exists", 'domain': 'global', 'reason': 'alreadyExists'}

Examples:

I believe it might be some kind of race condition, we should investigate further.

Networking issues

Tracked by #4794

Permission errors (serverless)

Error: error running clean: got unexpected response code [403] from deployment shutdown API: {
   "errors": [
       {
           "message": "To access the resource [u:/deployments/cc41c0a61a474f3aa6d890df111925d5], the user must have the required authorization.",
           "code": "root.permission_denied"
       }
   ]
}

Examples:

SQL error

sqlite3.OperationalError: no such table: layouts

Examples:

GCP just fails with 500 (rare)

libcloud.common.google.GoogleBaseError: {'message': "Internal error. Please try again or contact Google Support. (Code: '')", 'domain': 'global', 'reason': 'backendError'}

Examples:

Job did not complete in 180 seconds

libcloud.common.types.LibcloudError: <LibcloudError in None 'Job did not complete in 180 seconds'>

Examples:

@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Mar 5, 2024
@rdner rdner self-assigned this Mar 5, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@blakerouse
Copy link
Contributor

Another OGC failure https://buildkite.com/elastic/elastic-agent/builds/7651#018e1f0f-c712-4721-baf7-f13f8ba8477e

Error: error running test: failed to prepare instance ogc-windows-amd64-2022-fleet-e3f0: failed to install curl: could not run "choco install -y curl" though SSH: Process exited with status 1 (stdout: , stderr: 'choco' is not recognized as an internal or external command, operable program or batch file.

Not OGC's fault, that is the integration testing framework preparing the instance. OGC doesn't do that.

@blakerouse
Copy link
Contributor

And another one https://buildkite.com/elastic/elastic-agent/builds/7654#018e1f49-c387-4973-b8a0-dd10dba598f2

Error: error running test: failed to check for cloud 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: <REDACTED>] to be ready: error calling deployment retrieval API: Get "https://cloud.elastic.co/api/v1/deployments/<REDACTED>": context deadline exceeded

Not OGC, OGC doesn't create or prepare any stack.

@blakerouse
Copy link
Contributor

Another OGC-related failure https://buildkite.com/elastic/elastic-agent/builds/7744#018e36cf-934b-4a2f-aaca-de800860be5e

Failed to execute tests on instance: error running sudo tests: failed to fetched test output at %home%\agent\build\TEST-go-remote-windows-amd64-2022-upgrade-sudo.integration.out

Not OGC, OGC doesn't run the tests or fetch the results.

@blakerouse
Copy link
Contributor

Just to be clear, OGC only creates the instance with the cloud providers nothing else. Everything else is done by the integration testing framework and is our code.

@rdner
Copy link
Member Author

rdner commented Mar 13, 2024

@blakerouse would "VM orchestration" be a better term? I will rename this issue then.

@rdner rdner changed the title [Flaky Test] OGC is unstable for integration tests [Flaky Test] VM orchestration is unstable in integration tests Mar 13, 2024
@rdner
Copy link
Member Author

rdner commented Apr 4, 2024

When it comes to OGC failures, sometimes we have something like this:

https://buildkite.com/elastic/elastic-agent/builds/8091#018ea9bd-ee66-4a10-b1bc-b8f8030d80bc

libcloud.common.google.GoogleBaseError: {'message': "Internal error. Please try again or contact Google Support. (Code: '')", 'domain': 'global', 'reason': 'backendError'}

Not sure we can do anything about it.

@pierrehilbert
Copy link
Contributor

For this one, yeah not sure we can do anything.

@rdner
Copy link
Member Author

rdner commented Apr 4, 2024

I updated the description to organize known failures by categories and clean up my comments on this issue.

@rdner rdner removed their assignment Apr 29, 2024
@belimawr
Copy link
Contributor

belimawr commented May 8, 2024

I believe this is a new VM orchestration issue:

Error: error running test: failed to connect to instance ogc-linux-amd64-ubuntu-2204-default-f3e5: error NewClientConn for ssh to "34.41.144.218:22" :ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

https://buildkite.com/elastic/elastic-agent/builds/8793#018f58b8-f8c9-4c73-9b40-6f2da4a73974

It is from a backport PR: #4709, I'll try re-running it.

@rdner
Copy link
Member Author

rdner commented May 22, 2024

I moved all the failures that we can actually recover from to #4794

Since we have not had new errors for a while now and there is nothing new to report here, I'm closing this issue in favor of the new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

5 participants