-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SIGFPE crash (855) in rotxxx by adding volatile in aloha_functions.f #857
Conversation
…t channelId for susy_gg_t1t1 (fix issue madgraph5#826)
…ot channelId (and note that iconfig=1 is ok)
…g_t1t1 (will give zero cross section madgraph5#826)
… test a different iconfig In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean This also triggers a similar SIGFPE (initially reported in madgraph5#826) ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
…SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it The SIGFPE crash madgraph5#855 does seem to disappear in ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean However, there is now a DIFFERENT issue, an lhe file mismatch between fortran and cpp (madgraph5#856) This is probably due to the iconfig/channel mapping issue reported by Olivier in madgraph5#852
…ebug SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it The SIGFPE crash madgraph5#855 does seem to disappear in ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean Then no cross section is printed also for this iconfig (same as madgraph5#826 for iconfig 1), but this is a DIFFERENT issue
…: note that SIGFPE madgraph5#855 is still fixed because volatile has been added
…adgraph5#855 and prepare codegen backport
…dgraph5#855 in rotxxx The issue was observed and fixed in gg_ttgg (iconfig 104) and susy_gg_t1t1 (iconfig 2), the backport as usual is from gg_tt Note that aloha_functions.f is now added to the list of files to include when preparing patch.common ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/dsample.f gg_tt.mad/Source/DHELAS/aloha_functions.f gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/bin/internal/banner.py gg_tt.mad/bin/internal/gen_ximprove.py gg_tt.mad/bin/internal/madevent_interface.py >> CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad
…syggt1t1 to test madgraph5#855 fix while still exposing madgraph5#826 and madgraph5#856
… my O/S - no change, no new files
…pt on my O/S - three new jpg/html files Changes to be committed: modified: susy_gg_t1t1.mad/CODEGEN_mad_susy_gg_t1t1_log.txt new file: susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/card.jpg new file: susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/diagrams.html new file: susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/matrix11.jpg
…grams.html and matrix*.jpg files (created if ghostscript is installed)
…es three jpg/html files generated if ghostscript is installed
…o fix SIGFPE madgraph5#855 in rotxxx
STARTED AT Sun Jun 2 07:28:58 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Jun 2 09:05:36 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sun Jun 2 09:23:42 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sun Jun 2 09:31:49 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sun Jun 2 09:34:34 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sun Jun 2 09:37:17 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Sun Jun 2 09:40:05 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Sun Jun 2 10:08:08 PM CEST 2024 [Status=0]
…res in heft madgraph5#833, susy madgraph5#825 madgraph5#826 and ggttgg madgraph5#856 for iconfig 104) STARTED AT Sun Jun 2 10:08:08 PM CEST 2024 (SM tests) ENDED(1) AT Mon Jun 3 02:38:15 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Mon Jun 3 02:47:18 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
…de with no volatile, to rerun tmad and expose SIGFPE madgraph5#855 git checkout upstream/master susy_gg_t1t1.mad gg_ttgg.mad
…se SIGFPE madgraph5#855 - will revert ./tmad/teeMadX.sh -mix -makeclean +10x -ggttgg -susyggt1t1
…h confirmed that SIGFPE madgraph5#855 was present and is now fixed Revert "[tmad] temporarely rerun tmad tests for ggttgg and susyggt1t1 to expose SIGFPE madgraph5#855 - will revert" This reverts commit 4fa1790. Revert "[tmad] in gg_ttgg.mad and susy_gg_t1t1.mad, temporarely go back to code with no volatile, to rerun tmad and expose SIGFPE madgraph5#855" This reverts commit 2f32ffd.
Hi @oliviermattelaer I completed my tests and I confirm that adding volatile fixes this SIGFPE in rotxxx (the code was crashing without and is not crashing with volatile). There is no performance change, or at most a 1-2% negligible change. Can you please review and approve? Thanks Andrea |
On this specific point: note that there is one difference:
It is far fetched, but one possible explanation is
|
…adgraph5#794 and valgrind fixes madgraph5#869): no change in the code
I have merged the latest upstream/master and have rerun the CI which now includes tmad tests. In upstream/master, the CI had 12 failures: 9 from rotxxx crashes, and 3 from no cross section in susy. In this branch, now the 9 rotxxx failures have disappeared, while the 3 no cross sections have stayed. (Note also: after removing rotxxx crashes, I was expecting other issues to appear, but they did not apart from #872 above. In particular I thought that #856 would appear. Maybe these only exist in manual tests, I will cross check... or maybe for that one I need to use iconfig 104.) |
…raph5#855 (prepare to move upstream to mg5amcnlo gpucpp)
…raph5#855 crashes in rotxxx (move this upstream as suggested by Olivier)
…is is now upstream in mg5amcnlo as suggested by Olivier) NB: patches should now be generated without including aloha_functions.f: ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/dsample.f gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/bin/internal/banner.py gg_tt.mad/bin/internal/gen_ximprove.py gg_tt.mad/bin/internal/madevent_interface.py >> CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad
…latile patch from patch.common to mg5amcnlo upstream
I have now completed the move of 'volatile' from cudacpp patch.common to the upstream mg5amcnlo in the gpucpp branch, pending this PR mg5amcnlo/mg5amcnlo#113 The generated code is the same (well I just changed a comment), the results do not change As mentiond above, the rotxxx crashes disappear from the CI, while a new xsec discrepancy #872 appears. I will also try to make issue #856 appear in the CI (and possibly another one). This will soon be ready for review. |
…he comment about volatile
…in gg_ttgg tmad tests (LHE color mismatch madgraph5#856?)
I configured the tmad test for gg_ttgg to use iconfig=104. Issue #856 (LHE color mismatch in gg_ttgg for iconfig=104) is now appearing in the CI. (NB THIS ONLY APPEARS IF ROTXXX IS FIXED) See #856 (comment) NB there are now 9 errors in the CI, three fptype for each of the following three issues
|
…of known issues (e.g. remove rotxxx crashes): enable bypasses, tests should succeed
…ng of tmad known issues, the CI will fail, signaling these issues are pending
I have disabled bypasses, now 9 tests fail as expected. This is ready to be merged. @oliviermattelaer can you please review/approve? |
Yes let's merge this, Olivier |
Thanks! I will merge this now Andrea |
…ng rotxxx crashes) into susy Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
…t on Olivier's latest fix_826 commit d23e773 1) Note about Olivier's latest fix_826 commit d23e773 Olivier's 75c05c5 includes his initial 6 commits in fix_826: git log upstream/master --oneline -n1 0992927 (upstream/master, origin/color2, origin/actions) Merge pull request madgraph5#857 from valassi/tmad git log --oneline 0992927..75c05c5 75c05c5 Merge branch 'master_june24' into fix_826 92a8284 better comment in coloramps 2bcea76 trying to fix git issue 63494ef change to Andrea convention of naming (but removing step variable) 5b6d065 increase readibility and move from map to array 41ddc38 fix a issue for omp compilation bed2e12 try to fix the segfault on issue 826 Olivier's d23e773 is then a merge of the latest upstream/master in 75c05c5, fixing the MG5AMC conflict by setting it to 74fd166c1 git show d23e773 Merge: 75c05c5 0992927 update this branch with andrea fix in master diff --cc MG5aMC/mg5amcnlo - Subproject commit 10378b3c0971e1a241fd9dc365e592c92d1f13ba -Subproject commit f274cab55d5d983c5612ca7ab3417ee796aa1a8c ++Subproject commit 74fd166c1e22bde2dfe01b2e001ac3b177628165 2) Note that, in MG5AMC, 74fd166c1 (obsolete branch gpucpp_826) is the same as 09c96dd17 (branch gpucpp): git diff 74fd166c1 09c96dd17 [NO DIFF] git log --oneline e428e38c6..09c96dd17 09c96dd17 (origin/gpucpp) allow for second exporter to have access to all variable used in the fortran exporter 9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations 0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109 4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f) f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg) f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97) 619f5db45 avoid that some parameter switch type when loading model git log --oneline e428e38c6..74fd166c 74fd166c1 (HEAD, origin/gpucpp_826, gpucpp_826) Merge remote-tracking branch 'origin/gpucpp' (PR madgraph5#113 for madgraph5#855 crash in rotxxx) into gpucpp_826 9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations e4d9df4ab Merge remote-tracking branch 'origin/gpucpp' (PRs madgraph5#110 and madgraph5#112 for issues madgraph5#109 and madgraph5#111) into gpucpp_826 0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109 4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f) f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg) 10378b3c0 allow for second exporter to have access to all variable used in the fortran exporter f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97) 619f5db45 avoid that some parameter switch type when loading model 3) Note that color includes the following submodule updates, passing through 09c96dd17 to ba54a4153 git show --oneline upstream/master..color ../../MG5aMC/ 4b29496 [color] update MG5AMC to ba54a4153 in th egpuccp branch, with a minor fix in a comment for my icolamp patch Submodule MG5aMC/mg5amcnlo 99e064157..ba54a4153: > minor fix in a printout in my previous patch in export_cpp.py 1c2a02d [color] update MG5AMC to 99e064157, fixing bug madgraph5#856 (and related ones) about the icolamp array in coloramps.h Submodule MG5aMC/mg5amcnlo 09c96dd17..99e064157: > In export_cpp.py fix bug madgraph5#114 in get_icolamp_lines, resulting in different icolamp arrays for F77 and CPP (see madgraph5#873) 0a60262 [color] update MG5AMC to 09c96dd17: this is the latest gpucpp branch, now including Olivier's extra commit previously in gpucpp_826 Submodule MG5aMC/mg5amcnlo 10378b3c0...09c96dd17: > allow for second exporter to have access to all variable used in the fortran exporter > Merge pull request madgraph5#113 from valassi/valassi_volatile > Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 > Merge pull request madgraph5#110 from valassi/valassi_leak109 < allow for second exporter to have access to all variable used in the fortran exporter 16ff942 try to fix the segfault on issue 826 Submodule MG5aMC/mg5amcnlo f9f957918..10378b3c0: > allow for second exporter to have access to all variable used in the fortran exporter 4b12e79 [color] temporarely downgrade back MG5AMC to the common base of gpucpp and gpucpp_826, to allow cherry-picking Olivier's fix_826 changes > Submodule MG5aMC/mg5amcnlo f274cab55..f9f957918 (rewind): < Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations < Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 < Merge pull request madgraph5#110 from valassi/valassi_leak109 => Therefore I can simply merge origin/color into color2 and fix the MG5AMC conflict by setting it to ba54a4153 (valassi_icolamp114, before more recent changes)
Hi @oliviermattelaer I think this is ready for review, can you please have a look?
This fixes the SIGFPE crash in rotxxx (#855, also mentioned in #826 - and maybe this is also related to what you discuss as a crash in #852?). It uses an ugly workaround of just adding 'volatile' to disable optimizations (from SIMD?) deep inside the fortran code. Ugly, but it works, and has no performance penalties I think here. And this is also exactly the same technique I used to fix many SIGFPEs in cudacpp in the ixx/oxx functions last here.
It also slightly modifies my test scripts to enable testing a different config which is not 1. I do not understand why the SIGFPE crash only seems to happen with some specific iconfig (which may also have other issues), but it seems wuite clear that the issues are independent. Rephrasing: I would add volatile in any case, and then go on and investigate channel/iconfig mappings as in your #853.
Note, this IMO brings us one step of progress, but other issues remain open
As mentioned above, I would go ahead and add volatile, and then investigate the rest.
I will still run large scale tests tonight, but the code will almost certainly not change. So this is ready for review. Thanks.
Andrea