Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to v2.0.0 release container for Gaea #167

Open
5 tasks done
EdwardSnyder-NOAA opened this issue Oct 30, 2024 · 3 comments
Open
5 tasks done

Updates to v2.0.0 release container for Gaea #167

EdwardSnyder-NOAA opened this issue Oct 30, 2024 · 3 comments
Assignees

Comments

@EdwardSnyder-NOAA
Copy link
Collaborator

EdwardSnyder-NOAA commented Oct 30, 2024

The Land DA release v2.0.0 container can run on Gaea after a number of modifications. The following are the steps that are needed to run on Gaea:

  1. module use /ncrc/proj/epic/rocoto/modulefiles/
  2. module load rocoto
  3. for setup_container.sh script use: -c=intel-classic/2023.2.0 -m=cray-mpich/8.1.28
  4. sed -i 's|which mpiexec| which srun|g' land-DA_workflow/scripts/exlandda_*
  5. sed -i 's|${RUN_CMD} -n ${NPROCS_FORECAST}|${RUN_CMD} -n ${NPROCS_FORECAST} --mpi=pmi2 |g' land-DA_workflow/scripts/exlandda_forecast.sh
  6. sed -i 's|${RUN_CMD} -n ${NPROCS_ANALYSIS}|${RUN_CMD} -n ${NPROCS_ANALYSIS} } --mpi=pmi2|g' land-DA_workflow/scripts/exlandda_analysis.sh
  7. sed -i '30 i module reset' land-DA_workflow/parm/task_load_modules_run_jjob.sh
  8. sed -i 's|which singularity|"/usr/bin/singularity"|g' land-DA_workflow/parm/run_container_executable.sh
  9. sed -i 's|<queue>batch</queue>|<native> --clusters=c5 --partition=batch --export=NONE</native>|g' land_analysis.xml
  10. Update binding bins in run_container_executable.sh to: -B $BINDDIR:/contrib -B $CONTAINERBASE:/contrib

These steps are being added to the release notes. Action steps breakdown:

  • Steps 1-3 are fine being documented.
  • Steps 4-8 can be added to the setup_container.sh script.
  • Step 9 would require the uw tools being upgraded to 2.4.2 to address the native option bug. On Gaea, the slurm configuration needs the task to define which partition you want to run, and in order to do that, the native and core options needs to be able to be defined together in the workflow. The uw tools version that is in the develop and release branch is using 2.2.2, which contains the bug that doesn't allow these options to be defined together. Suggestion would be to upgrade to 2.4.2 whenever we can. In the meantime, users need to run that sed command before running the rocotorun command when running the experiment.
  • Step 10 would require the container to be rebuilt with the /gpfs directory in the container.
  • After all these steps are completed, then the container needs to be rebuilt, tested, and placed on all T1 platforms.
@EdwardSnyder-NOAA
Copy link
Collaborator Author

EdwardSnyder-NOAA commented Jan 8, 2025

Added the Gaea changes to the dockerfile and setup script and rebuilt the container and pushed it to dockerhub, and converted into a singularity container on AWS here: /contrib/Edward.Snyder/landda-container/aug-update/release/gaea-update/ubuntu22.04-intel-landda-release-public-v2.0.0.img. The container can now run on Gaea if steps 1-3, and 9 and executed. Experiment dir: /gpfs/f5/epic/scratch/Edward.Snyder/landda-cont/release-v2/gaea-changes/land-DA_workflow/parm

   CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
200001040000                prep_obs                   135321093           SUCCEEDED                   0         1          40.0
200001040000                pre_anal                   135321091           SUCCEEDED                   0         1          45.0
200001040000                analysis                   135321095           SUCCEEDED                   0         1          52.0
200001040000               post_anal                   135321096           SUCCEEDED                   0         1          42.0
200001040000                forecast                   135321097           SUCCEEDED                   0         1          76.0
200001040000              plot_stats                   135321103           SUCCEEDED                   0         1         104.0
================================================================================================================================
200001050000                prep_obs                   135321092           SUCCEEDED                   0         1           7.0
200001050000                pre_anal                   135321104           SUCCEEDED                   0         1          43.0
200001050000                analysis                   135321106           SUCCEEDED                   0         1          50.0
200001050000               post_anal                   135321107           SUCCEEDED                   0         1          43.0
200001050000                forecast                   135321108           SUCCEEDED                   0         1          88.0
200001050000              plot_stats                   135321109           SUCCEEDED                   0         1          51.0

@EdwardSnyder-NOAA
Copy link
Collaborator Author

Tested and passed the new container on:

  • Hera: /scratch1/NCEPDEV/stmp4/Edward.Snyder/land-da/land-DA_workflow/parm
  • Orion: /work/noaa/epic/esnyder/landda-container/aug-update/gaea-update/land-DA_workflow/parm
  • AWS: /contrib/Edward.Snyder/landda-container/aug-update/release/gaea-update/land-DA_workflow/parm

Moved the containers to all T1 platforms and the s3 bucket.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

Created a pull request that address Steps 1-3 and Step 9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant