Gsatellite demo 02

This is sort of transcript for a Shelr video/cast showing automatic restart of failed jobs by gsatellite. The video/cast is available here. Please read Gsatellite explained for reference.

GSATELLITE DEMO

This demo makes use of five hosts in total. There is one NFS server available:

nfs-server.asc

There are two NFS client systems available:

nfs-client1.asc
nfs-client2.asc

There are two GridFTP servers available:

gridftp.omicron.jupiter
gridftp.omicron.neptune

We start by logging into one of the two NFS client systems. As gsatellite can do inter-node IPC it makes no difference to which system a user connects. But for this demo we assume the gsatellite launch control (gsatlc) is running on nfs-client1.asc and the user logs into nfs-client2.asc.

$ ssh johndoe@nfs-client2

For this demo I use three basic jobs:

The already known (see demo01) succeed.job
A gtransfer job gt_job.job

johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job.job
#!/bin/bash
#GSAT -T gtransfer

gt --guc-max-retries 1 --gt-max-retries 0 -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/64x32MB/* -d gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/64x32MB/

A gtransfer job using a nonexistent option (-t) gt_job__wrong_usage.job

johndoe@nfs-client2:~$ cat tmp/gsat_demo/gt_job__wrong_usage.job 
#!/bin/bash
#GSAT -T gtransfer

gt -s gsiftp://gridftp.omicron.jupiter:2811/mnt/scratch/johndoe/32x32MB/* -t gsiftp://gridftp.omicron.neptune:2811/mnt/scratch/johndoe/32x32MB/

Let's start with all old gsatellite jobs removed:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------

Then submit the following jobs:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00089
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00090
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00091
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00092
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub gt_job.job
00093
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00094
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00095
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00096
johndoe@nfs-client2:~/tmp/gsat_tests$ gqsub succeed.job 
00097
johndoe@nfs-client2:~$ gqsub tmp/gsat_demo/gt_job__wrong_usage.job                   
00098

Let's check which jobs have terminated already:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
running         00091           nfs-client1.asc succeed.job                          
queued          00092                           succeed.job                          
queued          00093                           gt_job.job                           
queued          00094                           succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Then wait until the gtransfer job is started:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
running         00093           nfs-client1.asc gt_job.job                           
queued          00094                           succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

Now to simulate a failed transfer, the easiest way is to simply kill the globus-gridftp-server (ggs) processes on the GridFTP servers (either source or destination) two times. I'll do this in a separate console session not shown in this video.

Back on nfs-client2.asc gqstat gives the following output after the transfer was interrupted:

johndoe@nfs-client2:~$ gqstat                                                        
   job.state          job.id    job.execHost        job.name                         
------------    ------------    ------------    ------------                         
finished        00089           nfs-client1.asc succeed.job                          
finished        00090           nfs-client1.asc succeed.job                          
finished        00091           nfs-client1.asc succeed.job                          
finished        00092           nfs-client1.asc succeed.job                          
queued          00093           nfs-client1.asc gt_job.job                           
running         00094           nfs-client1.asc succeed.job                          
queued          00095                           succeed.job                          
queued          00096                           succeed.job                          
queued          00097                           succeed.job                          
queued          00098                           gt_job__wrong_usage.job

What has happened? Actually gsatlc has detected (by evaluating the exit value of the gtransfer job), that the job failed temporarily (i.e. it can be restarted). The procedure for a job restart due to a temporary error, is to block the failed job from execution (i.e. put a hold on the job), start the next job and then unblock the failed job from execution (i.e. release a hold on the job). Hence the gtransfer job is in state queued and the following job is in state running currently.

After the following job has terminated, the gtransfer job is restarted:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
running     	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
queued      	00068       	            	succeed.job 
queued      	00069       	            	succeed.job 
queued      	00070       	            	succeed.job

Currently gsatellite retries restartable jobs for a maximum of 3 times. So let's tamper with the transfer three additional times. Kill the ggs processes and watch:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
queued      	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
running     	00068       	            	succeed.job 
queued      	00069       	            	succeed.job 
queued      	00070       	            	succeed.job

Wait for the gtransfer job to be restarted:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
running     	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
finished    	00068       	            	succeed.job 
queued      	00069       	            	succeed.job 
queued      	00070       	            	succeed.job

Kill the ggs processes another time. Resulting in:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
queued      	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
finished    	00068       	            	succeed.job 
running     	00069       	            	succeed.job 
queued      	00070       	            	succeed.job

Again wait for the gtransfer job to be restarted:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
running     	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
finished    	00068       	nfs-client1.asc	succeed.job 
finished    	00069       	nfs-client1.asc	succeed.job 
queued      	00070       	            	succeed.job

Kill the ggs processes one last time. As gsatellite already restarted the gtransfer job three times, it is in failed state now:

johndoe@nfs-client2:~/tmp/gsat_tests$ gqstat
   job.state	      job.id	job.execHost	    job.name
------------	------------	------------	------------
finished    	00064       	nfs-client1.asc	succeed.job 
finished    	00065       	nfs-client1.asc	succeed.job 
failed      	00066       	nfs-client1.asc	gt_job.job  
finished    	00067       	nfs-client1.asc	succeed.job 
finished    	00068       	nfs-client1.asc	succeed.job 
finished    	00069       	nfs-client1.asc	succeed.job 
running     	00070       	            	succeed.job

A gtransfer job is automatically restarted by gsatellite only if the job is considered restartable (i.e. a temporary error assumed) but not if the job exits because of e.g. wrong usage. Because of this, the second gtransfer job just fails and is not restarted automatically by gsatellite:

gqstat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gsatellite demo 02

GSATELLITE DEMO

END OF DEMO

Clone this wiki locally