Skip to content

Commit

Permalink
double check to make sure Gillian's commends and suggestions are in
Browse files Browse the repository at this point in the history
  • Loading branch information
weihuang-jedi committed Nov 5, 2024
1 parent 51f3faa commit 634a6f0
Showing 1 changed file with 109 additions and 105 deletions.
214 changes: 109 additions & 105 deletions docs/source/noaa_csp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ Configuring NOAA Cloud Service Providers
########################################

The NOAA Cloud Service Providers (CSP) support the forecast-only,
coupled, and gefs configurations for the global workflow.
coupled, and GEFS configurations for the global workflow.
Once a suitable CSP instance and cluster is defined/created,
the global-workflow may be executed similar to the on-prem machines.
the global-workflow may be executed similarly to the on-prem machines.
Currently the global-workflow supports the following
instance and storage types as a function of CSP and forecast
resolution.
Expand Down Expand Up @@ -51,109 +51,113 @@ Login to the NOAA CSP

Log in to the `NOAA CSP <http://noaa.parallel.works/login>`_ and into
the resources configuration. The user should arrive at the following
screen. Click the "blue" box pointed by the red arrow to login.
screen. Click the "blue" box indicated by the red arrow to login.

.. image:: _static/noaacsp_login_1.png

Fill the ``Username / Email`` box with your own username or NOAA-email.
Note that the ``Username or email`` query is case-sensitive.
And then the respective RSA token key using the same application used
for the other RDHPCS machines (i.e., Hera, Jet, etc.,).
Fill the ``Username / Email`` box with your username or NOAA email (usually in "FirstName.LastName" format).
Note that the ``Username or email`` query field is case-sensitive.
Then enter the respective ``Pin + RSA`` combination using the same RSA token application used
for access to other RDHPCS machines (e.g., Hera, Gaea).

.. image:: _static/noaacsp_login_2.png

*******************************
Configure the NOAA CSP Instance
*******************************

Once logged into the NOAA CSP, navigate to the :red-text:`Marketplace` section
at the left panel as pointed by the red arrow, and click.
Scrow down to selecet "AWS EPIC Wei CentOS" as in red-circle.
Note, current global-workflow still using CentOS built spack-stack.
It will be updated to Rocky 8 soon.
Once logged into the NOAA CSP, navigate to the ``Marketplace`` section
in the left sidebar as indicated by the red arrow, and click.
Scroll down to selecet "AWS EPIC Wei CentOS" circled in red.
Note that the urrent global-workflow is still using CentOS built spack-stack,
but it will be updated to Rocky 8 soon.

.. image:: _static/noaacsp_instance_1.png

Next, click "Forklatest" as shown in the red-circle.
Next, click "Fork latest" as shown in the red-circle.

.. image:: _static/noaacsp_instance_2.png

Please provide a unique name (box pointer by the red-arrow) for the instance.
Please provide a unique name in the "New compute node" field for the instance
(see the box pointer by the red arrow).
Best practices suggest one that is clear, concise, and relevant to the application.
Click :red-text:`Fork` (in the red-circle) to fork an instance.
Click ``Fork`` (in the red-circle) to fork an instance.

.. image:: _static/noaacsp_instance_3.png

Now, an instance is forked, and it is time to configure the cluster.

1. Select a "Recource Account", usually it is: NOAA AWS Commercial Vault.
2. Select a "Group". Which will be something like: ca-epic, ca-sfs-emc, etc.
3. Copy and paste you Public key (.ssh/id_rsa.pub, .ssh/id_dsa.pu from your laptop).
4. Modify "User Bootstrap". If you are not use "ca-epic" group, please UNCOMMENT line 2.
5. Keep "health Check" as it is.
#. Select a "Resource Account"; usually it is *NOAA AWS Commercial Vault*.
#. Select a "Group", which will be something like: ca-epic, ca-sfs-emc, etc.
#. Copy and paste your public key (e.g., *.ssh/id_rsa.pub*, *.ssh/id_dsa.pu* from your laptop).
#. Modify "User Bootstrap". If you are not using the "ca-epic" group, please UNCOMMENT line 2.
#. Keep "Health Check" as it is.

Click "Save Changes" at top-right as shown in read circle.
Click "Save Changes" at top-right as shown in red circle.

.. image:: _static/noaacsp_instance_4.png

The NOAA ParallelWorks currently provides 3 CSPs:
The NOAA ParallelWorks (PW) currently provides 3 CSPs:
**AWS** (Amazon Web Services), **Azure** (Microsoft Azure),
and **GCP** (Google Cloud Platform).
Existing clusters may also be modified.
Better practice is to fork from Marketplace with something similar to your requests.
However, it is best practice to fork from Marketplace with something similar to your requests
(rather than modifying an existing cluster).

******************************
Add CSP Lustre Filesystem
******************************

To run global-workflow on CSPs, we need to attached lustre filesystem as run directory.
Figure, we need to add/define our lustre filesystem, to do so,
navigate to the middle of noaa.parallel.works web-site left panel and select "Lutre"
as pointed by red-arrow, and then click "AddStorage" at top-right as in the red-circle.
To run global-workflow on CSPs, we need to attach the ``/lustre`` filesystem as run directory.
First, we need to add/define our ``/lustre`` filesystem.
To do so, navigate to the middle of the NOAA PW website left side panel and select "Lustre"
(see the red arrow in :numref:`Figure %s <noaacsp_lustre_1>`), and then click "Add Storage"
at the top right as shown in the red-circle.

.. _noaacsp_lustre_1:
.. image:: _static/noaacsp_lustre_1.png

Select `FSx` for AWS FSx lustre filesystem as in red-circle.
Define the lustre with:
Select `FSx` for the AWS FSx ``/lustre`` filesystem as shown in the red circle.
Define ``/lustre`` with:

1. Clear and meaningfull `Resource name` as shown in red-arrow 1
2. Short sentice for `Decription` as shown in red-arrow 2
3. Choose **linux** for `Tag` as shown in red-arrow 3
#. A clear and meaningful `Resource name` as shown by the first red arrow
#. A short sentence for `Description`, as shown in the second red arrow
#. Choose **linux** for `Tag` as shown by red arrow #3

Click "Add Storage" as in red-box at top-right corner.
This will create a "lustre" filesystem template.

.. image:: _static/noaacsp_lustre_2.png

After create the template, we need to fill information for this lustre filesystem.
To do so, go to noaa.parallel.works web-site, click "Lustre" at left-panel as
indicated by red-arrow 1. Then select the filesystem as define above by `Resource name`
as shown in red-box. Here user can delete this resource if not needed by
click the trash-can as pointed by red-arrow 2.
fter creating the template, we need to fill information for this lustre filesystem.
To do so, go to the NOAA PW website, and click "Lustre" on the left side panel as
indicated by red arrow 1. Then select the filesystem defined above by `Resource name`,
as shown in the red box. Here, the user can delete this resource if not needed by
clicking the trash can (indicated by red-arrow 2).

.. image:: _static/noaacsp_lustre_3.png

By click the filesystme in the red-box of the above image,
users led to the lustre definition page.
By clicking the filesystem in the red box of the image above,
users will be led to the lustre definition page.
Then follow the steps illustrated in the :ref:`image below <noaacsp_lustre_4>`:

1. Choose a size in the `Storage Capacity(GB)` box as pointed by red-arrow 1.
#. Choose a size in the `Storage Capacity(GB)` box as pointed by red-arrow 1.
There is a minium of 1200 by AWS. For C48 ATM/GEFS case this will be enough.
For SFS-C96 case, or C768 ATM/S2S case it should probably increase to 12000.
2. For `File System Deployment`, choose "SCRATCH_2" for now as by red-arrow 2.
#. For `File System Deployment`, choose "SCRATCH_2" for now as by red-arrow 2.
Do not use SCRATCH_1, as it is used for test by PW.
3. Choose **NONE** for `File System Compression` as pointed by red-arrow 3.
#. Choose **NONE** for `File System Compression` as pointed by red-arrow 3.
Only choose LZ4 if you understand what it means.
4. Leave "S2 Import Path" and "S3 Export Path" black for now.
5. Click **Save Changes** in red-circle to save the definition/(changes made).
#. Leave "S2 Import Path" and "S3 Export Path" black for now.
#. Click **Save Changes** in red-circle to save the definition/(changes made).

.. _noaacsp_lustre_4:

.. image:: _static/noaacsp_lustre_4.png

For the storage to be allocated for the global-workflow application,
it is suggested that the ``Mount Point`` be ``/lustre``. Once the storage
has been configured, click the ``+ Add Attached Storage`` button in
the upper-right corner. This is illustrated in the following image.

.. image:: _static/noaacsp_lustre_5.png
has been configured, following the steps below to attach the Lustre Filesystem.

******************************
Attach CSP Lustre Filesystem
Expand All @@ -174,21 +178,21 @@ the changes.

.. image:: _static/noaacsp_filesystem_2.png

Scrow down to the bottom, and click `Add Attached Filesystems` as in the red-circle.
Scroll down to the bottom, and click `Add Attached Filesystems` as in the red-circle.

.. image:: _static/noaacsp_filesystem_3.png

After click `Add Attached Filesystems`, one `Attached Filesystems settings` will appear.
After clicking `Add Attached Filesystems`, `Attached Filesystems settings` will appear.

1. In the `Storage` box, select the lustre filesystem defined above, as in red-arrow 1.
2. In the `Mount Point` box, name it `/lustre` (the common and default choice) as pointed by red-arrow 2.
#. In the `Storage` box, select the lustre filesystem defined above, as in red-arrow 1.
#. In the `Mount Point` box, name it `/lustre` (the common and default choice) as pointed by red-arrow 2.
If you choose a different name, make sure to make the Global-Workflow setup step
use the name chosen here.

If you have a `S3 bucket`, one can attached as:

3. In the `Storage` box, select the bucket you want to use, as in red-arrow 3.
4. In the `Mount Point` box, name it `/bucket` (the common and default choice) as pointed by red-arrow 4.
#. In the `Storage` box, select the bucket you want to use, as in red-arrow 3.
#. In the `Mount Point` box, name it `/bucket` (the common and default choice) as pointed by red-arrow 4.

.. image:: _static/noaacsp_filesystem_4.png

Expand All @@ -198,22 +202,22 @@ Always remember to click `Save Changes` after making any changes to the cluster.
Using the NOAA CSP Cluster
**************************

To activate the cluster, click the `cluster` at right panel of the noaa.parallel.works web-site,
as point by the red-arrow. Then click the `Sessions` button in red-square, and than click the
button in red-circle. The cluster status is denoted by the color-coded button
on the right: red, stopped; orange, requested; green, active. The amount of time required to start
the cluster varies and not immediate and may take several minutes for the cluster to become active.
To activate the cluster, click `Clusters` on the left panel of the NOAA PW website,
as indicated by the red arrow. Then click the `Sessions` button in the red square, and click the power
button in the red circle. The cluster status is denoted by the color-coded button
on the right: red means stopped; orange means requested; green means active. The amount of time required to start
the cluster varies and is not immediate; it may take several minutes (often 10-20) for the cluster to become active.

.. image:: _static/noaacsp_using_1.png

when the cluster is activate, user will see:
1. Green dot means the cluster is active, pointed by red-arrow 1.
2. Green dot means the cluster is active, pointed by red-arrow 2.
3. Green button means the cluster is active, pointed by red-arrow 3.
4. Click the blue-square with arrow inside pointed by red-arrow 4 will copy the cluster's IP into clipboard,
#. Green dot means the cluster is active, pointed by red-arrow 1.
#. Green dot means the cluster is active, pointed by red-arrow 2.
#. Green button means the cluster is active, pointed by red-arrow 3.
#. Click the blue-square with arrow inside pointed by red-arrow 4 will copy the cluster's IP into clipboard,
which you can open a laptop xterm/window, and do `ssh username@the-ip-address` in the xterm window will connect you
to the AWS cluster, and you can do you work there.
5. Which is the `username@the-ip-address`, or your AWS PW cluster. Click it, will have a PW web terminal appear in the
#. Which is the `username@the-ip-address`, or your AWS PW cluster. Click it, will have a PW web terminal appear in the
bottom of the web-site, which you can work on this terminal to use your AWS cluster.

Please note, as soon as the cluster is activated, AWS/PW starts charging you for use the cluster.
Expand All @@ -223,8 +227,8 @@ as rocoto is using `crontab`, which needs the cluster active all the time, or th

.. image:: _static/noaacsp_using_2.png

After finish your work on the AWS cluster, one should terminate/stop the cluster, unless you have reasons to keep it active.
To stop/terminate the cluster, go to the cluster session, and click the `green` button. A window pop up, and click the
After finishing your work on the AWS cluster, you should terminate/stop the cluster, unless you have reasons to keep it active.
To stop/terminate the cluster, go to the cluster session, and click the `green` power button. A window pop up, and click the
red `Turn Off` button to switch off the cluster.

.. image:: _static/noaacsp_using_3.png
Expand All @@ -236,64 +240,64 @@ Running the Global Workflow
Assume you have a AWS cluster running, after login to the cluster through `ssh` from your laptop terminal,
or access the cluster from your web terminal, one can start clone, compile, and run global-workflow.

1. clone global-workflow( assume you have setup access to githup)::

cd /contrib/$USER #you should have a username, and have a directory at /contrib where we save our permanent files.
git clone --recursive [email protected]:NOAA-EMC/global-workflow.git
#or the develop form at EPIC:
git clone --recursive [email protected]:NOAA-EPIC/global-workflow-cloud.git

2. compile global-workflow::
#. clone global-workflow( assume you have setup access to githup)::

cd /contrib/$USER/global-workflow
cd sorc
build_all.sh # or similar command to compile for gefs, or others.
link_workflow.sh # after build_all.sh finished successfully
.. code-block:: console

As users may define a very small cluster as controoler, one may use a script similar to this to compile in compute node.::
cd /contrib/$USER #you should have a username, and have a directory at /contrib where we save our permanent files.
git clone --recursive [email protected]:NOAA-EMC/global-workflow.git
#or the develop form at EPIC:
git clone --recursive [email protected]:NOAA-EPIC/global-workflow-cloud.git

#!/bin/bash
#. compile global-workflow::

#SBATCH --job-name=compile
.. code-block:: console

#SBATCH --account=$USER
cd /contrib/$USER/global-workflow
cd sorc
build_all.sh # or similar command to compile for gefs, or others.
link_workflow.sh # after build_all.sh finished successfully

#SBATCH --qos=batch
As users may define a very small cluster as controller, one may use a script similar to this to compile in compute node.::

#SBATCH --partition=compute
.. code-block:: console

#SBATCH -t 04:15:00
#!/bin/bash
#SBATCH --job-name=compile
#SBATCH --account=$USER
#SBATCH --qos=batch
#SBATCH --partition=compute
#SBATCH -t 04:15:00
#SBATCH --nodes=1
#SBATCH -o compile.%J.log
#SBATCH --exclusive

#SBATCH --nodes=1
set -x

#SBATCH -o compile.%J.log
gwhome=/contrib/Wei.Huang/src/global-workflow-cloud
cd ${gwhome}/sorc
source ${gwhome}/workflow/gw_setup.sh

#SBATCH --exclusive
#build_all.sh

set -x
build_all.sh -w

gwhome=/contrib/Wei.Huang/src/global-workflow-cloud
cd ${gwhome}/sorc
source ${gwhome}/workflow/gw_setup.sh

#build_all.sh

build_all.sh -w

link_workflow.sh
link_workflow.sh

Save the above lines in a file, say, com.slurm, and submit this job with command "sbatch com.slurm"

3. run global-workflow C48 ATM test case (assume user has /lustre filesystem attached)::
#. run global-workflow C48 ATM test case (assume user has /lustre filesystem attached)::

.. code-block:: console

cd /contrib/$USER/global-workflow
cd /contrib/$USER/global-workflow
HPC_ACCOUNT=${USER} pslot=c48atm RUNTESTS=/lustre/$USER/run \
./workflow/create_experiment.py \
--yaml ci/cases/pr/C48_ATM.yaml
HPC_ACCOUNT=${USER} pslot=c48atm RUNTESTS=/lustre/$USER/run \
./workflow/create_experiment.py \
--yaml ci/cases/pr/C48_ATM.yaml

cd /lustre/$USER/run/EXPDIR/c48atm
crontab c48atm
cd /lustre/$USER/run/EXPDIR/c48atm
crontab c48atm

EPIC has copied the C48 and C96 ATM, GEFS and some other data to AWS, and the current code has setup to use those data.
If user wants to run own case, user needs to make changes to the IC path and others to make it work.
Expand Down

0 comments on commit 634a6f0

Please sign in to comment.