-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resnet50 test #42
Labels
bug
Something isn't working
Comments
I've ran the
On Cirrus, the cpu tests aren't being run because there's not valid program environment on the test, but the gpu test passes. ARCHER2: rapostol@ln03:~/work/epcc-reframe> reframe -R -r -C configuration/archer2.py -c tests/mlperf/resnet50/
[ReFrame Setup]
version: 4.2.1
command: '/work/y07/shared/utils/core/reframe/4.2.1/bin/reframe -R -r -C configuration/archer2.py -c tests/mlperf/resnet50/'
launched by: rapostol@ln03
working directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe'
settings files: '<builtin>', 'configuration/archer2.py'
check search path: (R) '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/tests/mlperf/resnet50'
stage directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage'
output directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/output'
log files: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'
[==========] Running 2 check(s)
[==========] Started on Fri Sep 27 11:08:59 2024
[----------] start processing checks
[ RUN ] ResNet50CPUCheck /baf749c9 @archer2:compute+PrgEnv-gnu
[ RUN ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
[ OK ] (1/2) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
P: Throughput: 280.82487335205076 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 8.829109907150269 s (r:0, l:None, u:None)
P: Delta Loss: -0.16565847396850586 (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 1.9883466614000003 s (r:0, l:None, u:None)
WARNING: reference unit ('archer2:compute:Throughput') for the performance variable 'Throughput' does not match the unit specified in the performance function ('inputs/s'): 'inputs/s' will be used
[ FAIL ] (2/2) ResNet50CPUCheck /baf749c9 @archer2:compute+PrgEnv-gnu
P: Throughput: 40.2118480682373 inputs/s (r:200, l:-0.05, u:0.1)
P: Epoch Length: 50.940342330932616 s (r:500, l:-0.05, u:0.05)
P: Delta Loss: -0.21529674530029297 (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0.3, l:-0.1, u:0.1)
P: Total IO Time: 0.5498948521000001 s (r:8.4, l:-0.1, u:0.1)
==> test failed during 'performance': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/ResNet50CPUCheck'
[----------] all spawned checks have finished
[ FAILED ] Ran 2/2 test case(s) from 2 check(s) (1 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:20:01 2024
=================================================================================================================================================
SUMMARY OF FAILURES
-------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for ResNet50CPUCheck (run: 1/1)
* Description: ResNet50 CPU Benchmark
* System partition: archer2:compute
* Environment: PrgEnv-gnu
* Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/ResNet50CPUCheck
* Node list:
* Job type: batch job (id=7689469)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: performance
* Rerun with '-n /baf749c9 -p PrgEnv-gnu --system archer2:compute -r'
* Reason: performance error: failed to meet reference: Throughput=40.2118480682373, expected 200 (l=190.0, u=220.00000000000003)
-------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log' CIRRUS: rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/
[ReFrame Setup]
version: 4.6.0-dev.1
command: '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/'
launched by: rapostol@cirrus-login2
working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
settings files: '<builtin>', 'configuration/cirrus.py'
check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/resnet50'
stage directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
output directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
log files: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'
[==========] Running 1 check(s)
[==========] Started on Fri Sep 27 11:10:07 2024+0100
[----------] start processing checks
[ RUN ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
[ OK ] (1/1) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
P: Throughput: 45.628501892089844 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 44.92460060119629 s (r:0, l:None, u:None)
P: Delta Loss: 0.08161067962646484 (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 0.2923150888 s (r:0, l:None, u:None)
[----------] all spawned checks have finished
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:19:33 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Test written with hard coded paths to python environments that need to moved to a central location or build as part of the test.
The text was updated successfully, but these errors were encountered: