-
Notifications
You must be signed in to change notification settings - Fork 2
Gsatellite demo 02
This is sort of transcript for a Shelr video/cast showing automatic restart of failed jobs by gsatellite. The video/cast is available here. Please read Gsatellite explained for reference.
This demo makes use of five hosts in total. There is one NFS server available:
nfs-server.asc
There are two NFS client systems available:
nfs-client1.asc
nfs-client2.asc
There are two GridFTP servers available:
gridftp.omicron.jupiter
gridftp.omicron.neptune
We start by logging into one of the two NFS client systems. As gsatellite can do inter-node IPC it makes no difference to which system a user connects. But for this demo we assume the gsatellite launch control (gsatlc
) is running on nfs-client1.asc
and the user logs into nfs-client2.asc
.
$ ssh johndoe@nfs-client2
For this demo I use three basic jobs:
- The already known (see demo01)
succeed.job
- A gtransfer job
gt_job.job
johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job.job
#!/bin/bash
#GSAT -T gtransfer
gt --guc-max-retries 1 --gt-max-retries 0 -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/64x32MB/* -d gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/64x32MB/
- A gtransfer job using a nonexistent option (
-t
)gt_job__wrong_usage.job
johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job__wrong_usage.job
#!/bin/bash
#GSAT -T gtransfer
gt -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/32x32MB/* -t gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/32x32MB/
Let's start with all old gsatellite jobs removed:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
Then submit the following jobs:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00089
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00090
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00091
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00092
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub gt_job.job
00093
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00094
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00095
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00096
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job
00097
johndoe@nfs-client2:~$ gqsub tmp/gsat_demo/gt_job__wrong_usage.job
00098
Let's check which jobs have terminated already:
johndoe@nfs-client2:~$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00089 nfs-client1.asc succeed.job
finished 00090 nfs-client1.asc succeed.job
running 00091 nfs-client1.asc succeed.job
queued 00092 succeed.job
queued 00093 gt_job.job
queued 00094 succeed.job
queued 00095 succeed.job
queued 00096 succeed.job
queued 00097 succeed.job
queued 00098 gt_job__wrong_usage.job
Then wait until the gtransfer job is started:
johndoe@nfs-client2:~$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00089 nfs-client1.asc succeed.job
finished 00090 nfs-client1.asc succeed.job
finished 00091 nfs-client1.asc succeed.job
finished 00092 nfs-client1.asc succeed.job
running 00093 nfs-client1.asc gt_job.job
queued 00094 succeed.job
queued 00095 succeed.job
queued 00096 succeed.job
queued 00097 succeed.job
queued 00098 gt_job__wrong_usage.job
Now to simulate a failed transfer, the easiest way is to simply kill the globus-gridftp-server
(ggs) processes on the GridFTP servers (either source or destination) two times. I'll do this in a separate console session not shown in this video.
Back on nfs-client2.asc
gqstat
gives the following output after the transfer was interrupted:
johndoe@nfs-client2:~$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00089 nfs-client1.asc succeed.job
finished 00090 nfs-client1.asc succeed.job
finished 00091 nfs-client1.asc succeed.job
finished 00092 nfs-client1.asc succeed.job
queued 00093 nfs-client1.asc gt_job.job
running 00094 nfs-client1.asc succeed.job
queued 00095 succeed.job
queued 00096 succeed.job
queued 00097 succeed.job
queued 00098 gt_job__wrong_usage.job
What has happened? Actually gsatlc
has detected (by evaluating the exit value of the gtransfer job), that the job failed temporarily (i.e. it can be restarted). The procedure for a job restart due to a temporary error, is to block the failed job from execution (i.e. put a hold on the job), start the next job and then unblock the failed job from execution (i.e. release a hold on the job). Hence the gtransfer job is in state queued
and the following job is in state running
currently.
After the following job has terminated, the gtransfer job is restarted:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
running 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
queued 00068 succeed.job
queued 00069 succeed.job
queued 00070 succeed.job
Currently gsatellite retries restartable jobs for a maximum of 3 times. So let's tamper with the transfer three additional times. Kill the ggs processes and watch:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
queued 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
running 00068 succeed.job
queued 00069 succeed.job
queued 00070 succeed.job
Wait for the gtransfer job to be restarted:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
running 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
finished 00068 succeed.job
queued 00069 succeed.job
queued 00070 succeed.job
Kill the ggs processes another time. Resulting in:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
queued 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
finished 00068 succeed.job
running 00069 succeed.job
queued 00070 succeed.job
Again wait for the gtransfer job to be restarted:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
running 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
finished 00068 nfs-client1.asc succeed.job
finished 00069 nfs-client1.asc succeed.job
queued 00070 succeed.job
Kill the ggs processes one last time. As gsatellite already restarted the gtransfer job three times, it is in failed
state now:
johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
job.state job.id job.execHost job.name
------------ ------------ ------------ ------------
finished 00064 nfs-client1.asc succeed.job
finished 00065 nfs-client1.asc succeed.job
failed 00066 nfs-client1.asc gt_job.job
finished 00067 nfs-client1.asc succeed.job
finished 00068 nfs-client1.asc succeed.job
finished 00069 nfs-client1.asc succeed.job
running 00070 succeed.job
A gtransfer job is automatically restarted by gsatellite only if the job is considered restartable (i.e. a temporary error assumed) but not if the job exits because of e.g. wrong usage. Because of this, the second gtransfer job just fails and is not restarted automatically by gsatellite:
gqstat