Add GPU unit test #456

weicongw · 2024-07-09T01:43:03Z

Issue #, if available:

Description of changes:

Add GPU unit tests. The tests contain following:
- test_sysinfo.sh :: Validate basic system configuration by comparing it with config
  - test_numa_topo_topo :: check cpu/numa topology
  - test_nvidia_gpu_count :: fail if one of GPUs is broken or is not visiable
  - test_nvidia_fabric_status :: fail if fabric manager is not active
  - test_nvidia_smi_topo :: fail if nvidia-smi topology is differ
  - test_nvidia_persistence_status :: validate persistence state
  - test_nvidia_gpu_unused :: Check that no other process are using GPUs, fail is a signal system misconfiguration.
- 10_test_basic_cuda.sh :: Execute trivial cuda binaries, fail if cuda subsys is not healthy
  Use demo-suite binaries https://docs.nvidia.com/cuda/demo-suite/index.html and DCGM Diagnostics https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#run-levels-and-tests
  If this test suite fail this is a sign that cuda subsystem is not usable at all.
  Usually this is side effect of system misconfiguration (driver or fabric manager is not loaded)
  - test_01_device_query
  - test_02_vector_add
  - test_03_bandwidth
  - test_04_bus_grind
  - test_05_dcgm_diagnostics
Reuse the nccl test docker file for the unit tests

Test example:

Example of successful test execution:

ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
ok - test_numa_topo_topo
ok - test_nvidia_gpu_count
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo

Example of failed test execution (when the GPU count doesn't match our config data):

# Running tests in gpu_unit_tests/tests/test_basic.sh
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh

ok - test_numa_topo_topo
not ok - test_nvidia_gpu_count
# Unexpected gpu count
#  test data value diff:
# --- test_sysinfo.sh.data/p3.2xlarge/gpu_count.txt     2024-07-09 01:28:17.000000000 +0000
# +++ /tmp/test_sysinfo.sh.actual-data.4MA/gpu_count.txt        2024-07-09 01:29:37.278476754 +0000
# @@ -1,2 +1,2 @@
#  name, index, pci.bus_id
# -Tesla A100-SXM2-16GB, 0, 00000000:00:1E.0
# +Tesla V100-SXM2-16GB, 0, 00000000:00:1E.0
# common.sh:32:_assert_data()
# common.sh:37:assert_data()
# test_sysinfo.sh:39:test_nvidia_gpu_count()
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cartermckinnon

This looks fine overall

cartermckinnon · 2024-07-09T20:36:20Z

e2e2/test/cases/nvidia/manifests/job-unit-test-single-node.yaml

+    spec:
+      containers:
+      - name: unit-test-container
+        image: public.ecr.aws/o5d5x8n6/weicongw:nvidia


How will we swap this out?

This was for my dev test, and I've changed it to use template variables. The image comes from e2e2/test/images/nvidia/Dockerfile, and it's almost the same as our previous NCCL test image.
I am also thinking of using the same docker image for all NVIDIA tests and putting the Dockerfile in e2e2/test/images. If you agree with this approach, I'll remove e2e2/test/images/Dockerfile.aws-efa-nccl-tests in the next PR.

Yeah that makes sense to me, we should be able to get by with a single image. let's do that in a followup

weicongw marked this pull request as ready for review July 9, 2024 01:47

weicongw force-pushed the main branch from af1f4aa to 4887027 Compare July 9, 2024 20:10

cartermckinnon reviewed Jul 9, 2024

View reviewed changes

Add GPU unit test

6a0fd61

weicongw force-pushed the main branch from 4887027 to 6a0fd61 Compare July 9, 2024 20:41

cartermckinnon approved these changes Jul 10, 2024

View reviewed changes

cartermckinnon merged commit 5ae025d into aws:main Jul 10, 2024
5 checks passed

weicongw mentioned this pull request Jul 10, 2024

Add test case for unit test and delete the duplicated docker file. #457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU unit test #456

Add GPU unit test #456

weicongw commented Jul 9, 2024 •

edited

Loading

cartermckinnon left a comment

cartermckinnon Jul 9, 2024

weicongw Jul 9, 2024

cartermckinnon Jul 9, 2024

Add GPU unit test #456

Add GPU unit test #456

Conversation

weicongw commented Jul 9, 2024 • edited Loading

cartermckinnon left a comment

Choose a reason for hiding this comment

cartermckinnon Jul 9, 2024

Choose a reason for hiding this comment

weicongw Jul 9, 2024

Choose a reason for hiding this comment

cartermckinnon Jul 9, 2024

Choose a reason for hiding this comment

weicongw commented Jul 9, 2024 •

edited

Loading