Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integration test for EFA #431

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

movence
Copy link
Contributor

@movence movence commented Nov 18, 2024

Description of changes

This PR adds new integration test for AWS EFA.

This is pretty similar to exsting EKS daemonset test suite such as GPU and Neuron. One addition is that there is an initContainer in the agent pod which creates an empty directory then build EFA data structure with a dummy device (device1). These newly created files and directories get mounted to the agent container where EFA receiver is expecting them. Below is the structure that EFA receiver expects:

/sys/class/infiniband/<device name>
└── ports
    └── 1
        └── hw_counters
            ├── rdma_read_bytes
            ├── rdma_write_bytes
            ├── rdma_write_recv_bytes
            ├── rx_bytes
            ├── rx_drops
            └── tx_bytes

Additionally, the init container runs chmod on infiniband/ as the receiver expects the folder to be writable by the root user only.

Pod and Container level metrics are NOT tested with this new test since it mocks EFA device. The missing part to test pod/container level EFA metrics is podresourcestore which is responsible for fetching and mapping allocated devices (eg. gpu or efa) and their owning pod/container. The challenge is to setting up an actual workload that utilizes EFA using NCCL or MPI (link). They usually requires P instance types (p3dn, p4d or p5) which are usually not available (?) with the limited capacity in the main AZs used by the integ tests. This will be a problem since it will make this test flaky. podresourcestore functionality is tested with Neuron integ test, so we are not completely ignoring that portion. We could improve this test with an actual EFA workload as a future improvement.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Integ test run in the main agent repo: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/11894849476/job/33145075141

Local test:

null_resource.validator (local-exec): 2024/11/17 23:33:12 >>>>>>>>>>>>>><<<<<<<<<<<<<<
null_resource.validator (local-exec): 2024/11/17 23:33:12 >>>>>>>>>>>>>>Successful<<<<<<<<<<<<<<
null_resource.validator (local-exec): 2024/11/17 23:33:12 ==============EKS_EFA==============
null_resource.validator (local-exec): 2024/11/17 23:33:12 ==============Successful==============
null_resource.validator (local-exec): ClusterName                       Successful
null_resource.validator (local-exec): node_efa_rx_bytes                 Successful
null_resource.validator (local-exec): node_efa_tx_bytes                 Successful
null_resource.validator (local-exec): node_efa_rx_dropped               Successful
null_resource.validator (local-exec): node_efa_rdma_read_bytes          Successful
null_resource.validator (local-exec): node_efa_rdma_write_bytes         Successful
null_resource.validator (local-exec): node_efa_rdma_write_recv_bytes    Successful
null_resource.validator (local-exec): ClusterName-InstanceId-NodeName   Successful
null_resource.validator (local-exec): node_efa_rx_bytes                 Successful
null_resource.validator (local-exec): node_efa_tx_bytes                 Successful
null_resource.validator (local-exec): node_efa_rx_dropped               Successful
null_resource.validator (local-exec): node_efa_rdma_read_bytes          Successful
null_resource.validator (local-exec): node_efa_rdma_write_bytes         Successful
null_resource.validator (local-exec): node_efa_rdma_write_recv_bytes    Successful
null_resource.validator (local-exec): emf-logs                          Successful
null_resource.validator (local-exec): 2024/11/17 23:33:12 ==============================
null_resource.validator (local-exec): 2024/11/17 23:33:12 >>>>>>>>>>>>>>><<<<<<<<<<<<<<<
null_resource.validator (local-exec): >>>> Finished EFA Container Insights TestSuite
null_resource.validator (local-exec): --- PASS: TestEfaSuite (183.20s)
null_resource.validator (local-exec):     --- PASS: TestEfaSuite/TestAllInSuite (183.20s)
null_resource.validator (local-exec): PASS
null_resource.validator (local-exec): ok  	github.com/aws/amazon-cloudwatch-agent-test/test/efa	56s

@movence movence requested a review from a team as a code owner November 18, 2024 15:10
Copy link
Contributor

@Paramadon Paramadon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants