Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resnet50 test #42

Open
JPRichings opened this issue Aug 23, 2024 · 1 comment
Open

Fix resnet50 test #42

JPRichings opened this issue Aug 23, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@JPRichings
Copy link
Contributor

JPRichings commented Aug 23, 2024

Test written with hard coded paths to python environments that need to moved to a central location or build as part of the test.

@RuiApostolo
Copy link
Contributor

I've ran the resnet50 tests with the most recent commit (9f80583). On ARCHER2, the performance checks fail on CPU and there's a warning:

WARNING: reference unit ('archer2:compute:Throughput') for the performance variable 'Throughput' does not match the unit specified in the performance function ('inputs/s'): 'inputs/s' will be used

On Cirrus, the cpu tests aren't being run because there's not valid program environment on the test, but the gpu test passes.

ARCHER2:

rapostol@ln03:~/work/epcc-reframe> reframe -R -r -C configuration/archer2.py -c tests/mlperf/resnet50/
[ReFrame Setup]
  version:           4.2.1
  command:           '/work/y07/shared/utils/core/reframe/4.2.1/bin/reframe -R -r -C configuration/archer2.py -c tests/mlperf/resnet50/'
  launched by:       rapostol@ln03
  working directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe'
  settings files:    '<builtin>', 'configuration/archer2.py'
  check search path: (R) '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/tests/mlperf/resnet50'
  stage directory:   '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage'
  output directory:  '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/output'
  log files:         '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'

[==========] Running 2 check(s)
[==========] Started on Fri Sep 27 11:08:59 2024

[----------] start processing checks
[ RUN      ] ResNet50CPUCheck /baf749c9 @archer2:compute+PrgEnv-gnu
[ RUN      ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
[       OK ] (1/2) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
P: Throughput: 280.82487335205076 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 8.829109907150269 s (r:0, l:None, u:None)
P: Delta Loss: -0.16565847396850586  (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 1.9883466614000003 s (r:0, l:None, u:None)
WARNING: reference unit ('archer2:compute:Throughput') for the performance variable 'Throughput' does not match the unit specified in the performance function ('inputs/s'): 'inputs/s' will be used
[     FAIL ] (2/2) ResNet50CPUCheck /baf749c9 @archer2:compute+PrgEnv-gnu
P: Throughput: 40.2118480682373 inputs/s (r:200, l:-0.05, u:0.1)
P: Epoch Length: 50.940342330932616 s (r:500, l:-0.05, u:0.05)
P: Delta Loss: -0.21529674530029297  (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0.3, l:-0.1, u:0.1)
P: Total IO Time: 0.5498948521000001 s (r:8.4, l:-0.1, u:0.1)
==> test failed during 'performance': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/ResNet50CPUCheck'
[----------] all spawned checks have finished

[  FAILED  ] Ran 2/2 test case(s) from 2 check(s) (1 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:20:01 2024
=================================================================================================================================================
SUMMARY OF FAILURES
-------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for ResNet50CPUCheck (run: 1/1)
  * Description: ResNet50 CPU Benchmark
  * System partition: archer2:compute
  * Environment: PrgEnv-gnu
  * Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/ResNet50CPUCheck
  * Node list:
  * Job type: batch job (id=7689469)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: performance
  * Rerun with '-n /baf749c9 -p PrgEnv-gnu --system archer2:compute -r'
  * Reason: performance error: failed to meet reference: Throughput=40.2118480682373, expected 200 (l=190.0, u=220.00000000000003)
-------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'

CIRRUS:

rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/
[ReFrame Setup]
  version:           4.6.0-dev.1
  command:           '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/'
  launched by:       rapostol@cirrus-login2
  working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
  settings files:    '<builtin>', 'configuration/cirrus.py'
  check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/resnet50'
  stage directory:   '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
  output directory:  '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
  log files:         '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'

[==========] Running 1 check(s)
[==========] Started on Fri Sep 27 11:10:07 2024+0100

[----------] start processing checks
[ RUN      ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
[       OK ] (1/1) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
P: Throughput: 45.628501892089844 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 44.92460060119629 s (r:0, l:None, u:None)
P: Delta Loss: 0.08161067962646484  (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 0.2923150888 s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:19:33 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants