extract code from print_cfl into dss_hvtensor routine #6723

mt5555 · 2024-10-31T00:08:11Z

HOMME fortran init code refactor: important code hidden in print_cfl() is extracted into a new routine, dss_hvtensor

dss_hvtensor improves HV results by making the HV coefficients smooth accross elements. It also does a bilinear projection of the coefficients within each element to minimize oscillations in the tensor coefficieints.

[BFB]

rljacob · 2024-12-19T18:16:26Z

@tcclevenger I think this is ready

tcclevenger · 2024-12-19T18:23:20Z

@rljacob Currently trying to setup homme standalone tests to verify this fix. Will merge to next once I complete that.

HOMME fortran init code refactor: important code hidden in print_cfl() is extracted into a new routine, dss_hvtensor. dss_hvtensor improves HV results by making the HV coefficients smooth accross elements. It also does a bilinear projection of the coefficients within each element to minimize oscillations in the tensor coefficieints. [BFB]

tcclevenger · 2025-01-02T18:28:09Z

Merged to next

tcclevenger · 2025-01-03T13:46:54Z

I suspect this PR could be the reason for all the testing DIFFs. Reverting from next to investigate.

mt5555 · 2025-01-03T15:21:59Z

one guess: keep the new routine in the original file, global_norms_mod.F90, so it will be compiled with the same optimization.

tcclevenger · 2025-01-03T18:31:13Z

one guess: keep the new routine in the original file, global_norms_mod.F90, so it will be compiled with the same optimization.

Normalized errors seem to be too high for just optimization differences. I'm seeing them as large as 2e+1 in some cases.

rljacob · 2025-01-09T18:26:07Z

@tcclevenger Any new info?

tcclevenger · 2025-01-09T18:37:48Z

@tcclevenger Any new info?

Not yet, I'm going to carve out time next week to solve this.

rljacob · 2025-01-24T16:40:51Z

@tcclevenger any update?

tcclevenger · 2025-01-24T18:33:37Z

@tcclevenger any update?

Looked at this last week and it not obvious whats going wrong. I'm able to reproduce the error, hoping to have some time in the next week or two to figure it out.

HOMME fortran init code refactor: important code hidden in print_cfl() is extracted into a new routine, dss_hvtensor dss_hvtensor improves HV results by making the HV coefficients smooth accross elements. It also does a bilinear projection of the coefficients within each element to minimize oscillations in the tensor coefficieints.

tcclevenger · 2025-02-10T15:46:29Z

@mt5555 You were correct, it was fixed by keeping the impl in the same file. Looking back through the CDASH from when I merged to next, it looks to be only intel compiler that triggered the bug. An example of DIFFs I saw in SMS.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetanh_ftype1

1636  x2l_Faxa_snowc   (doml_nx,doml_ny,time)  t_index =      1     1
1637          27      384  (   341,     1,     1) (     3,     1,     1) (   341,     1,     1) (    53,     1,     1)
1638                  201   3.503695987285971E-05   0.000000000000000E+00 3.5E-05  3.503695987285971E-05 9.9E-02  4.259984908567328-110
1639                  201   1.829463226967772E-05   0.000000000000000E+00          0.000000000000000E+00          0.000000000000000E+00
1640                  384  (   374,     1,     1) (     3,     1,     1)
1641           avg abs field values:    2.751477713696158E-07    rms diff: 2.6E-06   avg rel diff(npos):  9.9E-02
1642                                    1.347734549979935E-07                        avg decimal digits(ndif):  0.3 worst:  0.0
1643  RMS x2l_Faxa_snowc                   2.5518E-06            NORMALIZED  1.2450E+01

1681  x2l_Faxa_swndf   (doml_nx,doml_ny,time)  t_index =      1     1
1682          77      384  (   278,     1,     1) (     3,     1,     1) (    77,     1,     1) (    77,     1,     1)
1683                  201   1.939413049475074E+02   0.000000000000000E+00 1.6E+02  1.701407560647017E+02 6.9E-02  1.701407560647017E+02
1684                  201   1.946504984163617E+02   0.000000000000000E+00          1.282623020327691E+01          1.282623020327691E+01
1685                  384  (   130,     1,     1) (     3,     1,     1)
1686           avg abs field values:    1.478772437679993E+01    rms diff: 1.7E+01   avg rel diff(npos):  6.9E-02
1687                                    1.320859658869269E+01                        avg decimal digits(ndif):  1.1 worst:  0.0
1688  RMS x2l_Faxa_swndf                   1.7031E+01            NORMALIZED  1.2167E+00

Should we be concerned about diffs this large just because of a file change?

If not, I'll merge to next today and create a PR adding dss_hvtensor() call to EAMxx.

mt5555 · 2025-02-10T23:28:48Z

So for now - you put the routine back in the original file (but kept the modularization) and it's now BFB?

Roundoff level changes (such as from optimization changes which can depend on the file) do grow exponentially. So by just looking at the numbers above, not possible to confirm if that is acceptable growth or not.

Can we be lazy and just leave this routine in global_norms_mod.F90 (and rely on BFB testing)?

If we really want to move it out of this routine, lets do a second PR that only moves the subroutine (so it will be obvious just from inspection that the change is roundoff and doesn't need more sophisticated testing).

tcclevenger · 2025-02-11T15:48:42Z

Can we be lazy and just leave this routine in global_norms_mod.F90 (and rely on BFB testing)?

For sure, PM-CPU didn't report on cdash so may have to wait a day, but otherwise looks BFB.

rljacob · 2025-02-11T19:38:04Z

This was placed on next yesterday (with the wrong PR number in the commit title) and its causing testing fails on an odd assortment of cases. Maybe a threading error?

ERROR: edgeVpack: Buffer overflow: edge%nlyr_max too small for kptr

 2: e3sm.exe           0000000006581FC7  shr_abort_mod_mp_         114  shr_abort_mod.F90
154  2: e3sm.exe           00000000025D085D  parallel_mod_mp_a         279  parallel_mod.F90
155  2: e3sm.exe           00000000024E1FF1  edge_mod_base_mp_         523  edge_mod_base.F90
156  2: e3sm.exe           000000000259446C  global_norms_mod_         444  global_norms_mod.F90
157  2: e3sm.exe           00000000025ECDDF  prim_driver_base_         986  prim_driver_base.F90
158  2: e3sm.exe           0000000002F4EDDF  dyn_comp_mp_dyn_i         380  dyn_comp.F90
159  2: libiomp5.so        000015554C4033F3  __kmp_invoke_micr     Unknown  Unknown
160  2: libiomp5.so        000015554C387273  Unknown               Unknown  Unknown
161  2: libiomp5.so        000015554C38621E  Unknown               Unknown  Unknown
162  2: libiomp5.so        000015554C4038CC  Unknown               Unknown  Unknown
163  2: libpthread-2.28.s  00001555483EA1CA  Unknown               Unknown  Unknown
164  2: libc-2.28.so       0000155548056E73  clone                 Unknown  Unknown

cases:
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel.allactive-mach_mods
SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_LNDATM_CNPRDCTC_20TR.chrysalis_intel
SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_LNDATM_CNPRDCTC_1850.chrysalis_intel
SMS_Ln5.ne30pg2_ne30pg2.F2010-SCREAM-LR-DYAMOND2.chrysalis_intel
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-mach-pet
PEM_Ln5.ne4pg2_oQU480.F2010.chrysalis_intel
NCK.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel

tcclevenger · 2025-02-11T19:48:37Z

Reverted from next while I debug.

rljacob · 2025-02-11T19:50:36Z

Found a different error in a debugging case.

46: forrtl: severe (408): fort: (7): Attempt to use pointer DESC when it is not associated with a target
46:
46: Image PC Routine Line Source
46: libpnetcdf.so.3.0 0000155554F111A2 for_emit_diagnost Unknown Unknown
46: e3sm.exe 00000000041C8B23 edge_mod_base_mp_ 413 edge_mod_base.F90
46: e3sm.exe 000000000431FD27 global_norms_mod_ 444 global_norms_mod.F90
46: e3sm.exe 000000000443F683 prim_driver_base_ 986 prim_driver_base.F90
46: e3sm.exe 0000000005C2FF83 dyn_comp_mp_dyn_i 380 dyn_comp.F90
46: libiomp5.so 00001555531443F3 __kmp_invoke_micr Unknown Unknown
46: libiomp5.so 00001555530C8273 Unknown Unknown Unknown
46: libiomp5.so 00001555530C721E Unknown Unknown Unknown
46: libiomp5.so 00001555531448CC Unknown Unknown Unknown
46: libpthread-2.28.s 00001555483EA1CA Unknown Unknown Unknown
46: libc-2.28.so 0000155548056E73 clone Unknown Unknown
46: [chr-0506:2841896:0:2842014] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

cases:
SMS_D_Ln5.conusx4v1pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel

rljacob · 2025-02-11T19:54:21Z

@mt5555 should also help since it his branch.

mt5555 added the HOMME label Oct 31, 2024

mt5555 requested a review from tcclevenger October 31, 2024 00:08

mt5555 self-assigned this Oct 31, 2024

tcclevenger approved these changes Nov 13, 2024

View reviewed changes

rljacob assigned tcclevenger and unassigned mt5555 Nov 21, 2024

rljacob marked this pull request as draft January 24, 2025 18:36

mt5555 and others added 4 commits February 10, 2025 09:47

Fix: get dss_hvtensor from correct module

fd13a1b

Move dss_hvtensor back to global_norms_mod.F90

6695e29

Some cleanup

5789dc2

tcclevenger force-pushed the dss_hvtensor branch from 16d4e1b to 5789dc2 Compare February 10, 2025 14:55

tcclevenger marked this pull request as ready for review February 10, 2025 15:12

tcclevenger marked this pull request as draft February 13, 2025 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract code from print_cfl into dss_hvtensor routine #6723

extract code from print_cfl into dss_hvtensor routine #6723

mt5555 commented Oct 31, 2024

rljacob commented Dec 19, 2024

tcclevenger commented Dec 19, 2024

tcclevenger commented Jan 2, 2025

tcclevenger commented Jan 3, 2025

mt5555 commented Jan 3, 2025

tcclevenger commented Jan 3, 2025

rljacob commented Jan 9, 2025

tcclevenger commented Jan 9, 2025

rljacob commented Jan 24, 2025

tcclevenger commented Jan 24, 2025

tcclevenger commented Feb 10, 2025

mt5555 commented Feb 10, 2025

tcclevenger commented Feb 11, 2025

rljacob commented Feb 11, 2025 •

edited

Loading

tcclevenger commented Feb 11, 2025

rljacob commented Feb 11, 2025

rljacob commented Feb 11, 2025

extract code from print_cfl into dss_hvtensor routine #6723

Are you sure you want to change the base?

extract code from print_cfl into dss_hvtensor routine #6723

Conversation

mt5555 commented Oct 31, 2024

rljacob commented Dec 19, 2024

tcclevenger commented Dec 19, 2024

tcclevenger commented Jan 2, 2025

tcclevenger commented Jan 3, 2025

mt5555 commented Jan 3, 2025

tcclevenger commented Jan 3, 2025

rljacob commented Jan 9, 2025

tcclevenger commented Jan 9, 2025

rljacob commented Jan 24, 2025

tcclevenger commented Jan 24, 2025

tcclevenger commented Feb 10, 2025

mt5555 commented Feb 10, 2025

tcclevenger commented Feb 11, 2025

rljacob commented Feb 11, 2025 • edited Loading

tcclevenger commented Feb 11, 2025

rljacob commented Feb 11, 2025

rljacob commented Feb 11, 2025

rljacob commented Feb 11, 2025 •

edited

Loading