Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU unit test #456

Merged
merged 1 commit into from
Jul 10, 2024
Merged

Add GPU unit test #456

merged 1 commit into from
Jul 10, 2024

Conversation

weicongw
Copy link
Contributor

@weicongw weicongw commented Jul 9, 2024

Issue #, if available:

Description of changes:

  • Add GPU unit tests. The tests contain following:

    • test_sysinfo.sh :: Validate basic system configuration by comparing it with config

      • test_numa_topo_topo :: check cpu/numa topology
      • test_nvidia_gpu_count :: fail if one of GPUs is broken or is not visiable
      • test_nvidia_fabric_status :: fail if fabric manager is not active
      • test_nvidia_smi_topo :: fail if nvidia-smi topology is differ
      • test_nvidia_persistence_status :: validate persistence state
      • test_nvidia_gpu_unused :: Check that no other process are using GPUs, fail is a signal system misconfiguration.
    • 10_test_basic_cuda.sh :: Execute trivial cuda binaries, fail if cuda subsys is not healthy
      Use demo-suite binaries https://docs.nvidia.com/cuda/demo-suite/index.html and DCGM Diagnostics https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#run-levels-and-tests
      If this test suite fail this is a sign that cuda subsystem is not usable at all.
      Usually this is side effect of system misconfiguration (driver or fabric manager is not loaded)

      • test_01_device_query
      • test_02_vector_add
      • test_03_bandwidth
      • test_04_bus_grind
      • test_05_dcgm_diagnostics
  • Reuse the nccl test docker file for the unit tests

Test example:

Example of successful test execution:

ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh
ok - test_numa_topo_topo
ok - test_nvidia_gpu_count
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo

Example of failed test execution (when the GPU count doesn't match our config data):

# Running tests in gpu_unit_tests/tests/test_basic.sh
ok - test_01_device_query
ok - test_02_vector_add
ok - test_03_bandwidth
ok - test_04_bus_grind
ok - test_05_dcgm_diagnostics
# Running tests in gpu_unit_tests/tests/test_sysinfo.sh

ok - test_numa_topo_topo
not ok - test_nvidia_gpu_count
# Unexpected gpu count
#  test data value diff:
# --- test_sysinfo.sh.data/p3.2xlarge/gpu_count.txt     2024-07-09 01:28:17.000000000 +0000
# +++ /tmp/test_sysinfo.sh.actual-data.4MA/gpu_count.txt        2024-07-09 01:29:37.278476754 +0000
# @@ -1,2 +1,2 @@
#  name, index, pci.bus_id
# -Tesla A100-SXM2-16GB, 0, 00000000:00:1E.0
# +Tesla V100-SXM2-16GB, 0, 00000000:00:1E.0
# common.sh:32:_assert_data()
# common.sh:37:assert_data()
# test_sysinfo.sh:39:test_nvidia_gpu_count()
ok - test_nvidia_gpu_throttled
ok - test_nvidia_gpu_unused
ok - test_nvidia_persistence_status
ok - test_nvidia_smi_topo

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review July 9, 2024 01:47
Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine overall

spec:
containers:
- name: unit-test-container
image: public.ecr.aws/o5d5x8n6/weicongw:nvidia
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we swap this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for my dev test, and I've changed it to use template variables. The image comes from e2e2/test/images/nvidia/Dockerfile, and it's almost the same as our previous NCCL test image.
I am also thinking of using the same docker image for all NVIDIA tests and putting the Dockerfile in e2e2/test/images. If you agree with this approach, I'll remove e2e2/test/images/Dockerfile.aws-efa-nccl-tests in the next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense to me, we should be able to get by with a single image. let's do that in a followup

@cartermckinnon cartermckinnon merged commit 5ae025d into aws:main Jul 10, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants