Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to split integration tests into different groups #3544

Merged
merged 19 commits into from
Nov 29, 2023

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Oct 5, 2023

What does this PR do?

Adds a required field Group to define.Requirements. Adds ability to select a specific group of tests with TEST_GROUPS environment variable. Groups batches by Group name splitting the testing load across multiple instances.

Why is it important?

Allows tests to be grouped to different instances. Allowing more tests to run concurrently to speed up the time it takes for integration tests to run. Allows the selection of running only a group of tests.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

How to test this PR locally

Run mage integration:test see that multiple instances are spawned instead of only one per platform OS/architecture.

Select a specific group with TEST_GROUPS="default" mage integration:test.

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team skip-changelog backport-v8.10.0 Automated backport with mergify backport-v8.11.0 Automated backport with mergify labels Oct 5, 2023
@blakerouse blakerouse requested a review from a team as a code owner October 5, 2023 15:56
@blakerouse blakerouse self-assigned this Oct 5, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 5, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-10T14:19:40.599+0000

  • Duration: 28 min 9 sec

Test stats 🧪

Test Results
Failed 0
Passed 6525
Skipped 59
Total 6584

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 5, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.795% (82/83) 👍
Files 67.114% (200/298) 👍
Classes 65.766% (365/555) 👍
Methods 53.009% (1154/2177) 👎 -0.024
Lines 38.598% (13174/34131) 👍 0.008
Conditionals 100.0% (0/0) 💚

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this and it the changes are surprisingly simple. Thanks!

I think the hardest thing here will be making sure that it is well understood and used properly. At minimum we will need some explanation of this in https://github.com/elastic/elastic-agent/blob/main/docs/test-framework-dev-guide.md as well.

I have two thoughts to try to help this:

  1. Consider making a ShardID mandatory and predefine a default shard ID for all tests to reuse. This is more likely to work properly if someone copy/pastes from an existing test IMO, because people will see there is always a shard ID and that some tests use a default. Without this it is harder to notice that ShardID is sometimes undefined, and you aren't always going to be lead to the ShardID concept depending on which test you start from.

  2. We should call this something other than ShardID. Giving framework users control over which tests run where makes sense, but the concept of sharding doesn't have a default definition for test frameworks. This also avoids requiring users to understand what sharding means in general. As for what else to call it, maybe TestMachineGroup to try to communicate that all tests in the same group will run on the same VM.

@blakerouse
Copy link
Contributor Author

@cmacknz I took your feedback and change it to Group as in its a group of tests. I made Group a required field in define.Requirements and then divided the tests up into groups. I also added TEST_GROUPS environment variable to allow the selection of only running a group of tests. I added to the integration framework developer guide as well, explaining groups and how to select specific groups with TEST_GROUPS.

@blakerouse blakerouse changed the title Add ability to shard integration tests across multiple hosts Add ability to split integration tests into different groups Oct 6, 2023
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, a bunch of typos and some small clarifications to make before approval though.

docs/test-framework-dev-guide.md Outdated Show resolved Hide resolved
pkg/testing/define/batch.go Outdated Show resolved Hide resolved
pkg/testing/define/requirements.go Outdated Show resolved Hide resolved
pkg/testing/define/requirements.go Outdated Show resolved Hide resolved
pkg/testing/define/requirements.go Outdated Show resolved Hide resolved
docs/test-framework-dev-guide.md Outdated Show resolved Hide resolved
testing/integration/fqdn_test.go Outdated Show resolved Hide resolved
testing/integration/proxy_url_test.go Outdated Show resolved Hide resolved
testing/integration/upgrade_broken_package_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside from typos craig pointed out.

@blakerouse
Copy link
Contributor Author

We had some really old firewall rules that where created by OGC around in GCP. This was causing this to fail because it now creates more instances than it used to. OGC creates a firewall rule per instance and this was causing the quota to hit over 500 because there was around 400 old firewall rules around. I have cleaned those up, with hopes it allows the next run to pass.

@blakerouse
Copy link
Contributor Author

buildkite test this

@blakerouse
Copy link
Contributor Author

buildkite test this

@blakerouse
Copy link
Contributor Author

buildkite test this

@elastic-sonarqube
Copy link

SonarQube Quality Gate

Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@mergify
Copy link
Contributor

mergify bot commented Oct 24, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b shard-integration-tests upstream/shard-integration-tests
git merge upstream/main
git push upstream shard-integration-tests

@pierrehilbert pierrehilbert marked this pull request as draft November 15, 2023 12:39
@pierrehilbert
Copy link
Contributor

Bringing it back to draft as it has been open for a long time without any movement.

@blakerouse blakerouse removed backport-v8.10.0 Automated backport with mergify backport-v8.11.0 Automated backport with mergify labels Nov 27, 2023
Copy link
Contributor

mergify bot commented Nov 27, 2023

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@blakerouse blakerouse marked this pull request as ready for review November 27, 2023 18:51
@blakerouse
Copy link
Contributor Author

blakerouse commented Nov 27, 2023

I can see that the runner got stuck similar to the issue we have been seeing where it just stays running forever. With the groups it makes it easier to narrow it down to which group of tests is getting stuck:

I went through the logs and split out the required information to see what ran or still is running. Manually correlating each line to its runner. The breakdown is below and you can tell that windows-amd64-2022-default is stuck on Running non-sudo tests...

>>> (linux-arm64-ubuntu-2204-endpoint) Connected over SSH
>>> (linux-arm64-ubuntu-2204-endpoint) Running sudo tests...
>>> (linux-arm64-ubuntu-2204-endpoint) Test output (sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-endpoint-sudo.integration Test Passed

>>> (linux-arm64-ubuntu-2204-default) Connected over SSH
>>> (linux-arm64-ubuntu-2204-default) Running non-sudo tests...
>>> (linux-arm64-ubuntu-2204-default) Test output (non-sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-default.integration Test Passed
>>> (linux-arm64-ubuntu-2204-default) Running sudo tests...
>>> (linux-arm64-ubuntu-2204-default) Test output (sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-default-sudo.integration Test Passed

>>> (linux-arm64-ubuntu-2204-upgrade) Connected over SSH
>>> (linux-arm64-ubuntu-2204-upgrade) Running sudo tests...
>>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-upgrade-sudo.integration Test Passed

>>> (linux-arm64-ubuntu-2204-fleet-airgapped) Connected over SSH
>>> (linux-arm64-ubuntu-2204-fleet-airgapped) Running sudo tests...
>>> (linux-arm64-ubuntu-2204-fleet-airgapped) Test output (sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-fleet-airgapped-sudo.integration Test Passed

>>> (linux-arm64-ubuntu-2204-fleet) Connected over SSH
>>> (linux-arm64-ubuntu-2204-fleet) Running sudo tests...
>>> (linux-arm64-ubuntu-2204-fleet) Test output (sudo) (stdout): >> go test: remote-linux-arm64-ubuntu-2204-fleet-sudo.integration Test Passed

>>> (linux-amd64-ubuntu-2204-endpoint) Connected over SSH
>>> (linux-amd64-ubuntu-2204-endpoint) Running sudo tests...
>>> (linux-amd64-ubuntu-2204-endpoint) Test output (sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-endpoint-sudo.integration Test Passed

>>> (linux-amd64-ubuntu-2204-fleet) Connected over SSH
>>> (linux-amd64-ubuntu-2204-fleet) Running sudo tests...
>>> (linux-amd64-ubuntu-2204-fleet) Test output (sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-fleet-sudo.integration Test Passed

>>> (linux-amd64-ubuntu-2204-fleet-airgapped) Connected over SSH
>>> (linux-amd64-ubuntu-2204-fleet-airgapped) Running sudo tests...
>>> (linux-amd64-ubuntu-2204-fleet-airgapped) Test output (sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-fleet-airgapped-sudo.integration Test Passed

>>> (linux-amd64-ubuntu-2204-upgrade) Connected over SSH
>>> (linux-amd64-ubuntu-2204-upgrade) Running sudo tests...
>>> (linux-amd64-ubuntu-2204-upgrade) Test output (sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-upgrade-sudo.integration Test Passed

>>> (linux-amd64-ubuntu-2204-default) Connected over SSH
>>> (linux-amd64-ubuntu-2204-default) Running non-sudo tests...
>>> (linux-amd64-ubuntu-2204-default) Test output (non-sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-default.integration Test Passed
>>> (linux-amd64-ubuntu-2204-default) Running sudo tests...
>>> (linux-amd64-ubuntu-2204-default) Test output (sudo) (stdout): >> go test: remote-linux-amd64-ubuntu-2204-default-sudo.integration Test Passed

>>> (windows-amd64-2022-endpoint) Running sudo tests...
>>> (windows-amd64-2022-endpoint) Test output (sudo) (stdout): >> go test: remote-windows-amd64-2022-endpoint-sudo.integration Test Passed

>>> (windows-amd64-2022-upgrade) Running sudo tests...
>>> (windows-amd64-2022-upgrade) Test output (sudo) (stdout): >> go test: remote-windows-amd64-2022-upgrade-sudo.integration Test Passed

>>> (windows-amd64-2022-fleet) Running sudo tests...
>>> (windows-amd64-2022-fleet) Test output (sudo) (stdout): >> go test: remote-windows-amd64-2022-fleet-sudo.integration Test Passed

>>> (windows-amd64-2022-default) Connected over SSH
>>> (windows-amd64-2022-default) Running non-sudo tests...

Last line from windows-amd64-2022-default is:

>>> (windows-amd64-2022-default) Test output (non-sudo) (stdout): -test.shuffle 1701112357723785000

Copy link

SonarQube Quality Gate

Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@blakerouse blakerouse merged commit 8a8abd0 into elastic:main Nov 29, 2023
9 checks passed
@blakerouse blakerouse deleted the shard-integration-tests branch November 29, 2023 04:30
@blakerouse blakerouse added backport-v8.11.0 Automated backport with mergify and removed backport-skip labels Nov 29, 2023
mergify bot pushed a commit that referenced this pull request Nov 29, 2023
* Add ability to shard integration tests across muliple hosts.

* Change to Group.

* Add to dev guide.

* Fix magefile.

* Apply suggestions from code review

Co-authored-by: Craig MacKenzie <[email protected]>

* Add define.Default constant.

* Remove isolate.

* Fix tests.

* Update groups.

* Adjust groups.

* Adjust some groups, add debugging SSH key.

* Only windows default group with all logs.

* Remove windows specific.

* Fix add_cloud_metadata allowed errors.

* Another allowed error.

* Yet another error.

---------

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit 8a8abd0)

# Conflicts:
#	magefile.go
#	pkg/testing/runner/config.go
#	pkg/testing/runner/runner.go
#	testing/integration/beats_serverless_test.go
#	testing/integration/install_unprivileged_test.go
#	testing/integration/logs_ingestion_test.go
#	testing/integration/upgrade_fleet_test.go
blakerouse added a commit that referenced this pull request Dec 2, 2023
…fferent groups (#3841)

* Add ability to split integration tests into different groups (#3544)

* Add ability to shard integration tests across muliple hosts.

* Change to Group.

* Add to dev guide.

* Fix magefile.

* Apply suggestions from code review

Co-authored-by: Craig MacKenzie <[email protected]>

* Add define.Default constant.

* Remove isolate.

* Fix tests.

* Update groups.

* Adjust groups.

* Adjust some groups, add debugging SSH key.

* Only windows default group with all logs.

* Remove windows specific.

* Fix add_cloud_metadata allowed errors.

* Another allowed error.

* Yet another error.

---------

Co-authored-by: Craig MacKenzie <[email protected]>
(cherry picked from commit 8a8abd0)

# Conflicts:
#	magefile.go
#	pkg/testing/runner/config.go
#	pkg/testing/runner/runner.go
#	testing/integration/beats_serverless_test.go
#	testing/integration/install_unprivileged_test.go
#	testing/integration/logs_ingestion_test.go
#	testing/integration/upgrade_fleet_test.go

* Fix conflicts.

---------

Co-authored-by: Blake Rouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify skip-changelog Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants