Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract code from print_cfl into dss_hvtensor routine #6723

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mt5555
Copy link
Contributor

@mt5555 mt5555 commented Oct 31, 2024

HOMME fortran init code refactor: important code hidden in print_cfl() is extracted into a new routine, dss_hvtensor

dss_hvtensor improves HV results by making the HV coefficients smooth accross elements. It also does a bilinear projection of the coefficients within each element to minimize oscillations in the tensor coefficieints.

[BFB]

@mt5555 mt5555 added the HOMME label Oct 31, 2024
@mt5555 mt5555 requested a review from tcclevenger October 31, 2024 00:08
@mt5555 mt5555 self-assigned this Oct 31, 2024
@rljacob rljacob assigned tcclevenger and unassigned mt5555 Nov 21, 2024
@rljacob
Copy link
Member

rljacob commented Dec 19, 2024

@tcclevenger I think this is ready

@tcclevenger
Copy link
Contributor

@rljacob Currently trying to setup homme standalone tests to verify this fix. Will merge to next once I complete that.

tcclevenger added a commit that referenced this pull request Jan 2, 2025
HOMME fortran init code refactor: important code hidden in print_cfl() is extracted
into a new routine, dss_hvtensor.

dss_hvtensor improves HV results by making the HV coefficients smooth accross elements.
It also does a bilinear projection of the coefficients within each element to minimize
oscillations in the tensor coefficieints.

[BFB]
@tcclevenger
Copy link
Contributor

Merged to next

@tcclevenger
Copy link
Contributor

I suspect this PR could be the reason for all the testing DIFFs. Reverting from next to investigate.

@mt5555
Copy link
Contributor Author

mt5555 commented Jan 3, 2025

one guess: keep the new routine in the original file, global_norms_mod.F90, so it will be compiled with the same optimization.

@tcclevenger
Copy link
Contributor

one guess: keep the new routine in the original file, global_norms_mod.F90, so it will be compiled with the same optimization.

Normalized errors seem to be too high for just optimization differences. I'm seeing them as large as 2e+1 in some cases.

@rljacob
Copy link
Member

rljacob commented Jan 9, 2025

@tcclevenger Any new info?

@tcclevenger
Copy link
Contributor

@tcclevenger Any new info?

Not yet, I'm going to carve out time next week to solve this.

@rljacob
Copy link
Member

rljacob commented Jan 24, 2025

@tcclevenger any update?

@tcclevenger
Copy link
Contributor

@tcclevenger any update?

Looked at this last week and it not obvious whats going wrong. I'm able to reproduce the error, hoping to have some time in the next week or two to figure it out.

@rljacob rljacob marked this pull request as draft January 24, 2025 18:36
mt5555 and others added 4 commits February 10, 2025 09:47
HOMME fortran init code refactor: important code hidden in print_cfl()
is extracted into a new routine, dss_hvtensor

dss_hvtensor improves HV results by making the HV coefficients
smooth accross elements.  It also does a bilinear projection
of the coefficients within each element to minimize oscillations
in the tensor coefficieints.
@tcclevenger tcclevenger marked this pull request as ready for review February 10, 2025 15:12
@tcclevenger
Copy link
Contributor

@mt5555 You were correct, it was fixed by keeping the impl in the same file. Looking back through the CDASH from when I merged to next, it looks to be only intel compiler that triggered the bug. An example of DIFFs I saw in SMS.ne4pg2_oQU480.F2010.pm-cpu_intel.eam-thetanh_ftype1

1636  x2l_Faxa_snowc   (doml_nx,doml_ny,time)  t_index =      1     1
1637          27      384  (   341,     1,     1) (     3,     1,     1) (   341,     1,     1) (    53,     1,     1)
1638                  201   3.503695987285971E-05   0.000000000000000E+00 3.5E-05  3.503695987285971E-05 9.9E-02  4.259984908567328-110
1639                  201   1.829463226967772E-05   0.000000000000000E+00          0.000000000000000E+00          0.000000000000000E+00
1640                  384  (   374,     1,     1) (     3,     1,     1)
1641           avg abs field values:    2.751477713696158E-07    rms diff: 2.6E-06   avg rel diff(npos):  9.9E-02
1642                                    1.347734549979935E-07                        avg decimal digits(ndif):  0.3 worst:  0.0
1643  RMS x2l_Faxa_snowc                   2.5518E-06            NORMALIZED  1.2450E+01

1681  x2l_Faxa_swndf   (doml_nx,doml_ny,time)  t_index =      1     1
1682          77      384  (   278,     1,     1) (     3,     1,     1) (    77,     1,     1) (    77,     1,     1)
1683                  201   1.939413049475074E+02   0.000000000000000E+00 1.6E+02  1.701407560647017E+02 6.9E-02  1.701407560647017E+02
1684                  201   1.946504984163617E+02   0.000000000000000E+00          1.282623020327691E+01          1.282623020327691E+01
1685                  384  (   130,     1,     1) (     3,     1,     1)
1686           avg abs field values:    1.478772437679993E+01    rms diff: 1.7E+01   avg rel diff(npos):  6.9E-02
1687                                    1.320859658869269E+01                        avg decimal digits(ndif):  1.1 worst:  0.0
1688  RMS x2l_Faxa_swndf                   1.7031E+01            NORMALIZED  1.2167E+00

Should we be concerned about diffs this large just because of a file change?

If not, I'll merge to next today and create a PR adding dss_hvtensor() call to EAMxx.

@mt5555
Copy link
Contributor Author

mt5555 commented Feb 10, 2025

So for now - you put the routine back in the original file (but kept the modularization) and it's now BFB?

Roundoff level changes (such as from optimization changes which can depend on the file) do grow exponentially. So by just looking at the numbers above, not possible to confirm if that is acceptable growth or not.

Can we be lazy and just leave this routine in global_norms_mod.F90 (and rely on BFB testing)?

If we really want to move it out of this routine, lets do a second PR that only moves the subroutine (so it will be obvious just from inspection that the change is roundoff and doesn't need more sophisticated testing).

@tcclevenger
Copy link
Contributor

Can we be lazy and just leave this routine in global_norms_mod.F90 (and rely on BFB testing)?

For sure, PM-CPU didn't report on cdash so may have to wait a day, but otherwise looks BFB.

@rljacob
Copy link
Member

rljacob commented Feb 11, 2025

This was placed on next yesterday (with the wrong PR number in the commit title) and its causing testing fails on an odd assortment of cases. Maybe a threading error?

ERROR: edgeVpack: Buffer overflow: edge%nlyr_max too small for kptr

 2: e3sm.exe           0000000006581FC7  shr_abort_mod_mp_         114  shr_abort_mod.F90
154  2: e3sm.exe           00000000025D085D  parallel_mod_mp_a         279  parallel_mod.F90
155  2: e3sm.exe           00000000024E1FF1  edge_mod_base_mp_         523  edge_mod_base.F90
156  2: e3sm.exe           000000000259446C  global_norms_mod_         444  global_norms_mod.F90
157  2: e3sm.exe           00000000025ECDDF  prim_driver_base_         986  prim_driver_base.F90
158  2: e3sm.exe           0000000002F4EDDF  dyn_comp_mp_dyn_i         380  dyn_comp.F90
159  2: libiomp5.so        000015554C4033F3  __kmp_invoke_micr     Unknown  Unknown
160  2: libiomp5.so        000015554C387273  Unknown               Unknown  Unknown
161  2: libiomp5.so        000015554C38621E  Unknown               Unknown  Unknown
162  2: libiomp5.so        000015554C4038CC  Unknown               Unknown  Unknown
163  2: libpthread-2.28.s  00001555483EA1CA  Unknown               Unknown  Unknown
164  2: libc-2.28.so       0000155548056E73  clone                 Unknown  Unknown

cases:
SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel.allactive-mach_mods
SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_LNDATM_CNPRDCTC_20TR.chrysalis_intel
SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_LNDATM_CNPRDCTC_1850.chrysalis_intel
SMS_Ln5.ne30pg2_ne30pg2.F2010-SCREAM-LR-DYAMOND2.chrysalis_intel
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-mach-pet
PEM_Ln5.ne4pg2_oQU480.F2010.chrysalis_intel
NCK.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel

@tcclevenger
Copy link
Contributor

Reverted from next while I debug.

@rljacob
Copy link
Member

rljacob commented Feb 11, 2025

Found a different error in a debugging case.

46: forrtl: severe (408): fort: (7): Attempt to use pointer DESC when it is not associated with a target
46:
46: Image PC Routine Line Source
46: libpnetcdf.so.3.0 0000155554F111A2 for_emit_diagnost Unknown Unknown
46: e3sm.exe 00000000041C8B23 edge_mod_base_mp_ 413 edge_mod_base.F90
46: e3sm.exe 000000000431FD27 global_norms_mod_ 444 global_norms_mod.F90
46: e3sm.exe 000000000443F683 prim_driver_base_ 986 prim_driver_base.F90
46: e3sm.exe 0000000005C2FF83 dyn_comp_mp_dyn_i 380 dyn_comp.F90
46: libiomp5.so 00001555531443F3 __kmp_invoke_micr Unknown Unknown
46: libiomp5.so 00001555530C8273 Unknown Unknown Unknown
46: libiomp5.so 00001555530C721E Unknown Unknown Unknown
46: libiomp5.so 00001555531448CC Unknown Unknown Unknown
46: libpthread-2.28.s 00001555483EA1CA Unknown Unknown Unknown
46: libc-2.28.so 0000155548056E73 clone Unknown Unknown
46: [chr-0506:2841896:0:2842014] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

cases:
SMS_D_Ln5.conusx4v1pg2_r05_IcoswISC30E3r5.F2010.chrysalis_intel

@rljacob
Copy link
Member

rljacob commented Feb 11, 2025

@mt5555 should also help since it his branch.

@tcclevenger tcclevenger marked this pull request as draft February 13, 2025 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants