Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control temporary folder behavoir? #814

Open
Dooruk opened this issue Aug 15, 2024 · 7 comments
Open

Control temporary folder behavoir? #814

Dooruk opened this issue Aug 15, 2024 · 7 comments
Labels
question Further information is requested

Comments

@Dooruk
Copy link

Dooruk commented Aug 15, 2024

I received an error while running coupled GEOSgcm on SLES15 and Milan nodes.

This is in the STDERR:

A call to mkdir was unable to create the desired directory:

  Directory: /tmp/ompi.borgk001.429860097
  Error:     No space left on device

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[borgk001:261267] [[24093,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 107
[borgk001:261267] [[24093,0],0] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 346
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

I reached out to NCCS, and this was their response:

OpenMPI was trying to create a directory under /tmp and the filesystem was full for some reason. While I do see some older files there, it is far from full but I am going to clean up some older data there (the directory regularly gets scrubbed).
You will want to ensure you are not writing other data to /tmp, it is local to the system drive of the compute nodes and is not large.
There are other directories like $LOCAL_TMPDIR and $TSE_TMPDIR that can be used for temporary space rather than /tmp.

In SWELL, I execute bash scripts as a Python subprocess, so I'm not sure if that forces the use of /tmp or is this a GEOSgcm level issue?

@Dooruk Dooruk added the question Further information is requested label Aug 15, 2024
@mathomp4
Copy link
Member

@Dooruk Yeah. The simplest thing might be to do something like:

mkdir $NOBACKUP/tmpdir
export TMPDIR=$NOBACKUP/tmpdir

say and put that in your .bashrc.

I think Open MPI respects $TMPDIR and you shouldn't run out in nobackup...usually.

@Dooruk
Copy link
Author

Dooruk commented Aug 15, 2024

Thanks @mathomp4 , are these settings in gcm_run.j relevant to this?

setenv OMPI_MCA_sharedfp "^lockedfile,individual"
setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0

@mathomp4
Copy link
Member

mathomp4 commented Aug 15, 2024

Thanks @mathomp4 , are these settings in gcm_run.j relevant to this?

setenv OMPI_MCA_sharedfp "^lockedfile,individual"
setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0

No. The first one I think helps performance or something. The other one suppresses a warning if you are running on an NFS mount.

Oddly, on a mac you have to use /tmp as your TMPDIR because otherwise you hit some other weird bug.

Just pointing TMPDIR somewhere else should help. Some actually see a TMPDIR-is-tmp issue with gcm_setup as well. Sometimes the mktemp we use in there goes nuts because /tmp has some issue

@Dooruk
Copy link
Author

Dooruk commented Aug 15, 2024

Hmm, I'm running GEOS with the .bashrc change, but $NOBACKUP/tmpdir is empty currently.

@mathomp4
Copy link
Member

Hmm, I'm running GEOS with the .bashrc change, but $NOBACKUP/tmpdir is empty currently.

Well, if Open MPI does its thing correctly, it should clean up after itself.

@mathomp4
Copy link
Member

mathomp4 commented Sep 6, 2024

It is possible this might be helped by GEOS-ESM/GEOSgcm_App#644 by @weiyuan-jiang .

He was having other issues with /tmp and/or TMPDIR. Not sure.

@Dooruk
Copy link
Author

Dooruk commented Sep 11, 2024

It is possible this might be helped by GEOS-ESM/GEOSgcm_App#644 by @weiyuan-jiang .

He was having other issues with /tmp and/or TMPDIR. Not sure.

Hmm, that is going to be a future version right? I can see if that helps. In the meantime we could make changes to the sbatch commands with Swell/Cylc.

This is somewhat related but I noticed another /tmp location reference in Swell:

https://github.com/GEOS-ESM/swell/blob/b021b93ddd18cc4ac4a5af3b9e4fb514b681ec95/src/swell/utilities/scripts/task_question_dicts_defaults.py#L205

I noticed the "temporary" YAML files are still there, e.g., /tmp/geos_ocean_task_questions_9dQUGdyw.yaml

I wonder if there is an actual /tmp directory on Discover that gets scrubbed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants