Skip to content

Gsatellite demo 02

fscheiner edited this page Oct 20, 2016 · 10 revisions

This is sort of transcript for a Shelr video/cast showing automatic restart of failed jobs by gsatellite. The video/cast was available here but the shelr service is no longer available. Please read Gsatellite explained for reference.

GSATELLITE DEMO

This demo makes use of five hosts in total. There is one NFS server available:

  • nfs-server.asc

There are two NFS client systems available:

  • nfs-client1.asc
  • nfs-client2.asc

There are two GridFTP servers available:

  • gridftp.omicron.jupiter
  • gridftp.omicron.neptune

We start by logging into one of the two NFS client systems. As gsatellite can do inter-node IPC it makes no difference to which system a user connects. But for this demo we assume the gsatellite launch control (gsatlc) is running on nfs-client1.asc and the user logs into nfs-client2.asc.

$ ssh johndoe@nfs-client2

For this demo I use three basic jobs:

  • The already known (see demo01) succeed.job
  • A gtransfer job gt_job.job
johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job.job
#!/bin/bash
#GSAT -T gtransfer

gt --guc-max-retries 1 --gt-max-retries 0 -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/64x32MB/* -d gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/64x32MB/
  • A gtransfer job using a nonexistent option (-t) gt_job__wrong_usage.job
johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job__wrong_usage.job 
#!/bin/bash
#GSAT -T gtransfer

gt -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/32x32MB/* -t gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/32x32MB/

Let's start with all old gsatellite jobs removed:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------

Then submit the following jobs:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00089
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00090
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00091
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00092
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub gt_job.job
00093
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00094
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00095
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00096
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00097
johndoe@nfs-client2:~$ gqsub tmp/gsat_demo/gt_job__wrong_usage.job                   
00098

Let's check which jobs have terminated already:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
running         00091           nfs-client1.asc succeed.job                          
queued          00092                           succeed.job                          
queued          00093                           gt_job.job                           
queued          00094                           succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Then wait until the gtransfer job is started:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
running         00093           nfs-client1.asc gt_job.job                           
queued          00094                           succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Now to simulate a failed transfer, the easiest way is to simply kill the globus-gridftp-server (ggs) processes on the GridFTP servers (either source or destination) two times. I'll do this in a separate console session not shown in this video.

Back on nfs-client2.asc gqstat gives the following output after the transfer was interrupted:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
queued          00093           nfs-client1.asc gt_job.job                           
running         00094           nfs-client1.asc succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

What has happened? Actually gsatlc has detected (by evaluating the exit value of the gtransfer job), that the job failed temporarily (i.e. it can be restarted). The procedure for a job restart due to a temporary error, is to block the failed job from execution (i.e. put a hold on the job), start the next job and then unblock the failed job from execution (i.e. release a hold on the job). Hence the gtransfer job is in state queued and the following job is in state running currently.

After the following job has terminated, the gtransfer job is restarted:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
running         00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Currently gsatellite retries restartable jobs for a maximum of 3 times. So let's tamper with the transfer three additional times. Kill the ggs processes and watch:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
queued          00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
running         00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Wait for the gtransfer job to be restarted:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
running         00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
finished        00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Kill the ggs processes another time. Resulting in:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
queued          00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
finished        00095           nfs-client1.asc succeed.job                          
running         00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Again wait for the gtransfer job to be restarted:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
running         00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
finished        00095           nfs-client1.asc succeed.job                          
finished        00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Kill the ggs processes one last time. As gsatellite already restarted the gtransfer job three times, it is in failed state now:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
failed          00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
finished        00095           nfs-client1.asc succeed.job                          
finished        00096           nfs-client1.asc succeed.job                          
running         00097           nfs-client1.asc succeed.job                          
queued          00098                           gt_job__wrong_usage.job

A gtransfer job is automatically restarted by gsatellite only if the job is considered restartable (i.e. a temporary error assumed) but not if the job exits because of e.g. wrong usage. Because of this, the second gtransfer job just fails and is not restarted automatically by gsatellite:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
failed          00093           nfs-client1.asc gt_job.job                           
finished        00094           nfs-client1.asc succeed.job                          
finished        00095           nfs-client1.asc succeed.job                          
finished        00096           nfs-client1.asc succeed.job                          
finished        00097           nfs-client1.asc succeed.job                          
failed          00098           nfs-client1.asc gt_job__wrong_usage.job

END OF DEMO

Clone this wiki locally