Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
This PR adds new integration test for AWS EFA.
This is pretty similar to exsting EKS daemonset test suite such as GPU and Neuron. One addition is that there is an
initContainer
in the agent pod which creates an empty directory then build EFA data structure with a dummy device (device1
). These newly created files and directories get mounted to the agent container where EFA receiver is expecting them. Below is the structure that EFA receiver expects:Additionally, the init container runs
chmod
oninfiniband/
as the receiver expects the folder to be writable by the root user only.Pod and Container level metrics are NOT tested with this new test since it mocks EFA device. The missing part to test pod/container level EFA metrics is
podresourcestore
which is responsible for fetching and mapping allocated devices (eg. gpu or efa) and their owning pod/container. The challenge is to setting up an actual workload that utilizes EFA using NCCL or MPI (link). They usually requires P instance types (p3dn, p4d or p5) which are usually not available (?) with the limited capacity in the main AZs used by the integ tests. This will be a problem since it will make this test flaky.podresourcestore
functionality is tested with Neuron integ test, so we are not completely ignoring that portion. We could improve this test with an actual EFA workload as a future improvement.License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Integ test run in the main agent repo: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/11894849476/job/33145075141
Local test: