Different output for v3.0 and v3.1 #166

ec147 · 2023-05-24T11:28:47Z

I have made two calculations with CT-HYB, one with the version 3.0 and one with the version 3.1. Both have strictly the same parameters and same G0(w) as inputs. Yet the G(tau) output of version 3.1 is very noisy and highly non-physical (first picture) while the output of version 3.0 is satisfactory (second picture). The calculation is parallelized over 2048 CPUs.

Do you have any idea of the cause of this discrepancy between both versions ?

I'm putting attached the C++ code I used ; which is part of the DFT code Abinit, which gives me the G0(w) and U matrix as input for CTHYB.

the-hampel · 2023-05-25T08:33:43Z

Dear @ec147,

that is indeed quite odd. I had a brief look into your code and from a first glance this looks all good. In principle the only changes happened from triqs 3.0 to 3.1 that could really influence this are the stat changes in TRIQS itself (@Wentzell correct me if I am wrong). Within cthyb the changes are minimal.

We have several benchmark scripts: https://github.com/TRIQS/benchmarks and I think they have been tested with 3.1.x without problems. Moreover, your 3.1.x result looks really wrong, so that something must be wrong here.

Did I see correctly that you stored the G0_iw to text file. Are those identical? Can you provide the std output from the solver. I would like to check if the solver worked with the same local Hamiltonian, detected the same number of subspaces, and reported similar acceptance rates.

Best,
Alex

ec147 · 2023-05-25T14:29:24Z

Thanks for your feedback. I found the issue and easily fixed it ; in the latest version of the mpi dependency, the MPI environment is activated with the variable has_env, which is set to True if one of the following environment variables is found: OMPI_COMM_WORLD_RANK, PMI_RANK or CRAY_MPICH_VERSION. However, I'm using a SLURM environment which has a different environment variable (SLURM_PROCID I think).

the-hampel · 2023-05-26T08:12:54Z

Glad to hear that the issue is resolved for you. May I ask how you solved it? In principle we rely on this MPI detection feature to work. If there is any cluster environment where it does not work out of the box please let us know. We are happy to add additional environment variable checks.
Best,
Alex

ec147 · 2023-05-26T11:40:53Z

Sure ; I simply replaced the line 44 of the mpi.hpp header file by "if (std::getenv("SLURM_PROCID") != nullptr or std::getenv("OMPI_COMM_WORLD_RANK") != nullptr or std::getenv("PMI_RANK") != nullptr or std::getenv("CRAY_MPICH_VERSION") != nullptr)" .

the-hampel · 2023-05-27T10:52:05Z

Interesting. I understand that SLURM_PROCID will work here, but it is a bit dangerous to add this generally for us since SLURM_PROCID could also be used in combination with non MPI jobs when using srun (correct me if I am wrong). This is just the process ID allocated from slurm. Are you using MPICH, openmpi, or similar?

@Wentzell do you understand why our MPI detection fails in this case?

ec147 · 2023-05-30T07:27:05Z

Yes, I just checked and it seems like the environment variable SLURM_PROCID is also set even for sequential runs, so my way is not the proper way to fix the issue. I just wanted to find an easy workaround without thinking too much about it, and this is not a problem for me since I'm always parallelizing my runs, so I always want the MPI environment to be activated. I'm really not an expert on SLURM environments, so I cannot really help you further unfortunately.

I'm using openmpi.

Wentzell · 2023-05-30T22:33:04Z

I agree that SLURM_PROCID is the wrong solution here. Which version of openmpi are you using?
It looks like OMPI_COMM_WORLD_RANK is not set, while it should be?

ec147 · 2023-05-31T07:32:23Z

I'm using v4.1.4.4 of openmpi.
If my understanding is correct, the variable OMPI_COMM_WORLD_RANK is set when the command mpirun is launched. However, my environment uses an abstraction layer (Bridge) to SLURM, and the MPI run is launched by the command ccc_mprun, thus the variable OMPI_COMM_WORLD_RANK is not set. This is very specific to my company, so I do not think this is a major issue for you.

the-hampel · 2023-05-31T07:46:20Z

Okay, I see. I wonder if we should add a cmake flag to enforce the MPI init, skipping the detection of an MPI environment (like the way it was before we introduced this check) to have a quick workaround in those cases?

Wentzell · 2023-05-31T15:28:19Z

@the-hampel Maybe we could just check if TRIQS_FORCE_MPI_INIT was set in the environment (in the same line where we have the other checks)?

the-hampel · 2023-06-02T19:38:10Z

I think I like that idea. Let me add this and try it out.

the-hampel · 2023-06-05T09:16:27Z

I added two PR's to add the feature. One in triqs: TRIQS/triqs#883 to check in the Python layer, and one in triqs/mpi itself: TRIQS/mpi#11 . This allows to do this:

(triqs-dev) >python sumk_test.py
Warning: could not identify MPI environment!
Starting serial run at: 2023-06-05 05:06:20.907482

(triqs-dev) >export TRIQS_FORCE_MPI_INIT=1

(triqs-dev) >python sumk_test.py
Starting run with 1 MPI rank(s) at : 2023-06-05 05:06:27.285073

If this looks good please merge.

Wentzell · 2023-06-05T16:24:31Z

Thank you @the-hampel, these pull requests have both been merged.
This resolves the Problem described here, so I am closing the issue.

ec147 added the bug label May 24, 2023

ec147 changed the title ~~Bug report~~ Different output for v3.0 and v3.1 May 24, 2023

ec147 closed this as completed May 25, 2023

the-hampel reopened this May 27, 2023

This was referenced Jun 5, 2023

[mpi] add envar check to force MPI init TRIQS/mpi#11

Merged

[mpi] add envar check to force MPI init TRIQS/triqs#883

Merged

Wentzell closed this as completed Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different output for v3.0 and v3.1 #166

Different output for v3.0 and v3.1 #166

ec147 commented May 24, 2023

the-hampel commented May 25, 2023

ec147 commented May 25, 2023

the-hampel commented May 26, 2023

ec147 commented May 26, 2023

the-hampel commented May 27, 2023

ec147 commented May 30, 2023

Wentzell commented May 30, 2023

ec147 commented May 31, 2023

the-hampel commented May 31, 2023

Wentzell commented May 31, 2023 •

edited

Loading

the-hampel commented Jun 2, 2023

the-hampel commented Jun 5, 2023

Wentzell commented Jun 5, 2023

Different output for v3.0 and v3.1 #166

Different output for v3.0 and v3.1 #166

Comments

ec147 commented May 24, 2023

the-hampel commented May 25, 2023

ec147 commented May 25, 2023

the-hampel commented May 26, 2023

ec147 commented May 26, 2023

the-hampel commented May 27, 2023

ec147 commented May 30, 2023

Wentzell commented May 30, 2023

ec147 commented May 31, 2023

the-hampel commented May 31, 2023

Wentzell commented May 31, 2023 • edited Loading

the-hampel commented Jun 2, 2023

the-hampel commented Jun 5, 2023

Wentzell commented Jun 5, 2023

Wentzell commented May 31, 2023 •

edited

Loading