Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nightwatch and/or desiconda mismatch between NERSC and KPNO #335

Closed
sybenzvi opened this issue Feb 6, 2023 · 16 comments
Closed

nightwatch and/or desiconda mismatch between NERSC and KPNO #335

sybenzvi opened this issue Feb 6, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@sybenzvi
Copy link
Contributor

sybenzvi commented Feb 6, 2023

On 20230205, Becky Canning pointed out that in exposure 166300, amp b4d is masked resulting in missing B fluxes for fibers 2257-2499. The missing amp is visible in the KPNO Nightwatch QA pages for this exposure.

However, the same exposure at NERSC does not show b4d masked out. Check that the versions of Nightwatch and desiconda on desi-7 and cori/perlmutter are identical.

@sybenzvi sybenzvi added the bug Something isn't working label Feb 6, 2023
@sbailey
Copy link
Collaborator

sbailey commented Feb 6, 2023

Another possibility is that the underlying CCD calibration files in $DESI_SPECTRO_CALIB and/or $DESI_SPECTRO_DARK weren't in sync. I think we've been pretty good about keeping CALIB in sync (in svn, easier), but I'm not sure about DARK (at NERSC, not in svn, needs a different sync procedure and I'm not sure anyone is doing that).

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Feb 6, 2023

Thanks for the quick check @sbailey. It does look like $DESI_SPECTRO_CALIB at KPNO is one commit behind. Here is the output of svn info on desi-7,

Working Copy Root Path: /software/datasystems/desi_spectro_calib
...
Revision: 1067
Node Kind: directory
Schedule: normal
Last Changed Author: EddieSchlafly
Last Changed Rev: 1067
Last Changed Date: 2023-01-10 13:10:40 -0700 (Tue, 10 Jan 2023)

vs at NERSC:

Working Copy Root Path: /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk
...
Revision: 1068
Node Kind: directory
Schedule: normal
Last Changed Author: rongpu
Last Changed Rev: 1068
Last Changed Date: 2023-01-12 13:23:50 -0800 (Thu, 12 Jan 2023)

Should Jose or I run svn up manually?

@sbailey
Copy link
Collaborator

sbailey commented Feb 6, 2023

Yes, please go ahead and update KPNO. We never purposefully have the two out of sync.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Feb 6, 2023

I updated DESI_SPECTRO_CALIB on desi-7 and deleted+reprocessed night/expid 20230205/166300. The b4d mask did not go away, but at least the working copy of the calibrations is synced up with the repository.

Next guess: the desiconda projects are out of date. The DESICONDA_VERSION on desi-7 is 20200924. While it looks like desispec has been updated since that release, the version of desispec on desi-7 is ten months out of date (commit c8535d388bce194e020b244654b54f7256214ec8, 2022-Apr-13). Most of the other projects don't match the versions used on NERSC -- for example, fiberassign is v4.0.0 at KPNO vs. git main at NERSC.

Can/should we attempt an update of a few projects? Or all of desiconda?

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Feb 6, 2023

Check of $DESI_SPECTRO_DARK: this variable is not defined at KPNO. Unsetting the variable at NERSC does not cause b4d to be masked out in expid 166300. A mismatch in desiconda packages seems more likely. Will try to set up against an older version of desiconda at NERSC to see if the masking error is reproduced.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Feb 6, 2023

Downgrading desiconda to 22.2 at NERSC does not reproduce the error but it does create processing problems for multiple exposures.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Mar 7, 2023

Getting closer to solving this problem, with the 2.1.0-dev version of desiconda installed at KPNO with minor changes (see desihub/desiconda#59). I'm able to run nightwatch against the new installation but we will need $DESI_SPECTRO_DARK available on desi-7 as well.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Mar 8, 2023

A copy of $DESI_SPECTRO_DARK is now set up on desi-7 and I am able to run nightwatch almost to completion on exposure 166300 on 20230205. Unfortunately, as processing finishes up the following error is produced over and over:

OMP: Error #13: Assertion failure at kmp_affinity.cpp(4313).

OMP: Hint Please submit a bug report with this message, compile and run commands used,
and machine configuration info including native compiler and operating system versions.
Faster response will be obtained by including all program sources. For information on
submitting this issue, please see http://www.intel.com/software/products/support/.

I'm not certain where this is arising; it could be in the Python multiprocessing module.

@sbailey
Copy link
Collaborator

sbailey commented Mar 8, 2023

@marcelo-alvarez @tskisner @craigwarner-ufastro do you recognize this? Nightwatch at KPNO uses multiprocessing but not MPI, and not GPU, but it does touch some code with numba JIT kernels and uses numpy with OpenMP parallelization under-the-hood. At NERSC we set KMP_AFFINITY=disabled, but I don't think we ever needed to mess with that at KPNO.

@marcelo-alvarez
Copy link

marcelo-alvarez commented Mar 8, 2023

@sbailey, I have not seen this. KMP_AFFINITY=disabled is set (redundantly from the point of view of source /global/common/software/desi/desi_environment.sh) on loading of the desispec, fastspecfit, and redrock modules at NERSC, which causes it to be set as well at run time for nightwatch at NERSC, since desispec and redrock modules are automatically loaded with source /global/common/software/desi/desi_environment.sh.

I am not familiar enough with how the desiconda environment is set up at KPNO to know if desispec and redrock modules are used, or if modules are used at all. @sybenzvi was the environment variable KMP_AFFINITY set to disabled at runtime when you obtained the error your reported above at KPNO? If not, you could try that and see if it fixes it.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented Mar 8, 2023

@marcelo-alvarez, I had not defined KMP_AFFINITY so I just tried running

$> KMP_AFFINITY=disabled nightwatch run -o /exposures/nightwatch -n 20230205 -e 166300

and I get the same error as before.

In case it helps, I'm attaching the installation log for desiconda, which I installed using the README instructions (but with the hpsspy and mpi4py installations disabled).

install.log.gz

@marcelo-alvarez
Copy link

@sybenzvi I don't see anything from the installation log that would explain the OMP: Error #13: Assertion failure at kmp_affinity.cpp(4313) message. If you provide the commands to reproduce the environment at KPNO (here or offline) in which it fails, then I can try to debug it.

@sybenzvi
Copy link
Contributor Author

sybenzvi commented May 19, 2023

Potential update on this old ticket: today @jose-bermejo and I are testing the installation of desiconda with Rob Knop and we encountered this same OMP assertion issue. Googling around we found this workaround based on setting the following environment variable:

KMP_INIT_AT_FORK=false

Will try this at desi-7 and report back. It seems to be related to the version of the intel compiler and may be fixed in newer versions of the compiler.

@sybenzvi
Copy link
Contributor Author

Confirming that KMP_INIT_AT_FORK=false does eliminate the OMP assertion error on desi-7.

@marcelo-alvarez
Copy link

@sybenzvi great Googling. In retrospect we should have anticipated this, since it's also set at NERSC via desimodules, i.e.

# may solve some OpenMP instabilities at NERSC
setenv KMP_INIT_AT_FORK FALSE

If you now have a desiconda that is working in practice at KPNO, it might make sense to close this issue and return to desihub/desiconda#60. What do you think?

@sybenzvi
Copy link
Contributor Author

@marcelo-alvarez, I agree, let's close this issue in Nightwatch. I was clearly skipping a step in the setup at KPNO so all that's really needed is to test the install again using desimodules to configure the environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants