This repository has been archived by the owner on Aug 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-troubleshooting.html
107 lines (101 loc) · 7.21 KB
/
04-troubleshooting.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Troubleshooting failed jobs</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="https://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Troubleshooting failed jobs</h1>
<h2 id="objectives" class="objectives">Objectives</h2>
<ul>
<li>Learn how to troubleshoot failed jobs.</li>
<li>Learn how to periodically retry the failed jobs.</li>
</ul>
<h2 id="troubleshooting-techniques">Troubleshooting techniques</h2>
<h3 id="diagnostics-with-condor_q">Diagnostics with condor_q</h3>
<p>The <em>condor_q</em> command shows the status of the jobs and it can be used to diagnose why jobs are not running. The <em>condor_q</em> command with the option <code>-better-analyze</code> will return detailed information about the jobs. Since OSG Connect sends jobs to many places, we also need to specify a pool name with the “-pool” flag</p>
<pre><code>$ condor_q -better-analyze JOB-ID -pool osg-flock.grid.iu.edu</code></pre>
<p>The detailed information about a job may help us to identify why a job is not running properly. As an example, we’ll use the built-in <code>error101</code> tutorial:</p>
<pre><code>$ ssh [email protected]
$ tutorial error101
$ cd tutorial-erro101
$ nano error101_job.submit
$condor_submit error101_job.submit </code></pre>
<p>Let us check the job status by</p>
<pre><code>condor_q username</code></pre>
<p>output from <em>condor_q -better-analyze</em> that will give us additional detail.</p>
<pre><code>$ condor_q -better-analyze JOB-ID -pool osg-flock.grid.iu.edu
# Produces a long ouput.
# The following lines are part of the output regarding the job requirements.
The Requirements expression for your job reduces to these conditions:
Slots
Step Matched Condition
----- -------- ---------
[0] 0 Memory >= 51200 ######## BIG MEMORY, NOT AVAILABLE ######
[1] 14727 TARGET.Arch == "X86_64"
[2] 14727 TARGET.OpSys == "LINUX"
[3] 14727 TARGET.Disk >= RequestDisk
[4] 14727 TARGET.HasFileTransfer</code></pre>
<p>By looking through the conditions listed, it becomes apparent that the job requesting a very large memory, no machine available to meet the requirements. This is an error we introduced puposefully in the script by including a line <em>Requirements = (Memory >= 51200)</em> in the file “error101_job.submit”. The job is removed from the queue and re-submited after correcting the requirement in the submit file.</p>
<pre><code>$ condor_rm JOB-ID #remove the specific job that has problem in matching the resources.
$ nano file.submit # Edit the submit file and insert this line: Requirements = (Memory >= 512)
$ condor_submit file.submit</code></pre>
<p>or you can edit the resource requirement of a job while it is in the idle state.</p>
<pre><code>condor_qedit JOB-ID Requirements 'Requirements = (Memory >= 512)' </code></pre>
<h3 id="held-jobs-and-condor_release">Held jobs and condor_release</h3>
<p>The jobs go to the held state for several reasons. One of them is due to the failure to transfer the output files. If this is the case, you might see the output of the analysis something as follows</p>
<pre><code>$ condor_q -analyze 372993.0
-- Submitter: login01.osgconnect.net : <192.170.227.195:56174> : login01.osgconnect.net
---
372993.000: Request is held.
Hold reason: Error from [email protected]: STARTER at 10.3.11.39 failed to send file(s) to <192.170.227.195:40485>: error reading from /wntmp/condor/compute-6-28/execute/dir_9368/glide_J6I1HT/execute/dir_16393/outputfile: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.84.86.100:50805></code></pre>
<p>The important parts are the message indicating that HTCondor couldn’t transfer the job back (SHADOW failed to receive file(s)) and the part one before the last part is the name of the file or directory that HTCondor couldn’t find. This failure is probably due to your application encountering an error while executing and then exiting before writing it’s output files. If you think that the error is a transient one and won’t reoccur, you can run</p>
<pre><code>condor_release JOB-ID </code></pre>
<p>to requeue the job.</p>
<h3 id="retries-with-periodic_release">Retries with periodic_release</h3>
<p>We do computing on a hetrogenous environment. It is possible that the jobs are pre-empted due to a partcular resource limitation (For example, the machine wants to perform someother task or reboot while your job is still running). In such cases, we want to re-submit failed jobs without manual intervention. Condor supports the automatic deduction and retries of computing jobs in two steps. In the first step, a failed job is forced to be on the held state. Next, the held job is periodically re-submitted. These two steps are listed in the following condor submit file:</p>
<pre><code># Send the job to Held state on failure.
on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)
# Periodically retry the jobs for 3 times with an interval 1 hour.
periodic_release = (NumJobStarts < 3) && ((CurrentTime - EnteredCurrentStatus) > (60*60))</code></pre>
<h2 id="keypoints">Keypoints</h2>
<ul>
<li><em>condor_q -better-analyze JOB-ID</em> command is useful to diagnose failed jobs.</li>
<li>Failed jobs are automatically deducted and periodically retried with <em>on_exit_old</em> and <em>periodic_release</em> in the condor submit files.</li>
</ul>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:[email protected]">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>