-
Notifications
You must be signed in to change notification settings - Fork 1
/
5.2nodeallocation.html
324 lines (203 loc) · 20.7 KB
/
5.2nodeallocation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<title></title>
</head>
<body>
<div class="sright">
<div class="sub-content-head">
Maui Scheduler
</div>
<div id="sub-content-rpt" class="sub-content-rpt">
<div class="tab-container docs" id="tab-container">
<div class="topNav">
<div class="docsSearch"></div>
<div class="navIcons topIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="5.0prioritization.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="5.1.5prioritymanagement.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="5.3nodeaccess.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
<h1>5.2 Node Allocation</h1>
<p>While job prioritization allows a site to determine which job to run, node allocation policies allow a site to specify how available resources should be allocated to each job. The algorithm used is specified by the parameter <a href="a.fparameters.html#nodeallocationpolicy">NODEALLOCATIONPOLICY</a> . There are multiple node allocation policies to choose from allowing selection based on reservation constraints, node configuration, available resource constraints, and other issues. These policies can be specified with a system wide default value and on a per job override basis. The following algorithms are available and described in detail below: <a href="5.2nodeallocation.html#FIRSTAVAILABLE">FIRSTAVAILABLE</a> , <a href="5.2nodeallocation.html#LASTAVAILABLE">LASTAVAILABLE</a> , <a href="5.2nodeallocation.html#PRIORITY">PRIORITY</a> , <a href="5.2nodeallocation.html#CPULOAD">CPULOAD</a> , <a href="5.2nodeallocation.html#MINRESOURCE">MINRESOURCE</a> , <a href="5.2nodeallocation.html#CONTIGUOUS">CONTIGUOUS</a> , <a href="5.2nodeallocation.html#MAXBALANCE">MAXBALANCE</a> , <a href="5.2nodeallocation.html#FASTEST">FASTEST</a> , and <a href="5.2nodeallocation.html#LOCAL">LOCAL</a>. <!-- Additional load allocation polices such as may be enabled through
extension libraries such as G2. Documentation for the extension library
of interest should be consulted.--></p>
<ul>
<li><a href="5.2nodeallocation.html#node">5.2.1 Node Allocation Overview</a></li>
<li>
<a href="5.2nodeallocation.html#resource">5.2.2 Resource Based Algorithms</a>
<ul>
<li><a href="5.2nodeallocation.html#CPULOAD">5.2.2.1 CPULOAD</a></li>
<li><a href="5.2nodeallocation.html#FIRSTAVAILABLE">5.2.2.2 FIRSTAVAILABLE</a></li>
<li><a href="5.2nodeallocation.html#LASTAVAILABLE">5.2.2.3 LASTAVAILABLE</a></li>
<li><a href="5.2nodeallocation.html#PRIORITY">5.2.2.4 PRIORITY</a></li>
<li><a href="5.2nodeallocation.html#MINRESOURCE">5.2.2.5 MINRESOURCE</a></li>
<li><a href="5.2nodeallocation.html#CONTIGUOUS">5.2.2.6 CONTIGUOUS</a></li>
<li><a href="5.2nodeallocation.html#MAXBALANCE">5.2.2.7 MAXBALANCE</a></li>
<li><a href="5.2nodeallocation.html#FASTEST">5.2.2.8 FASTEST</a></li>
<li><a href="5.2nodeallocation.html#LOCAL">5.2.2.9 LOCAL</a></li>
</ul>
</li>
<li><a href="5.2nodeallocation.html#time">5.2.3 Time Based Algorithms</a></li>
<li><a href="5.2nodeallocation.html#LOCAL">5.2.4 Locally Defined Algorithms</a></li>
<li><a href="5.2nodeallocation.html#pref">5.2.5 Specifying 'Per Job' Resource Preferences</a></li>
</ul>
<p><img src="logo1.gif" height="24"> Node allocation policy, along with many many of settings can be set graphically with the <a href="http://docs.adaptivecomputing.com/mcm/index.html">Moab Cluster Manager</a><sup>TM</sup>.</p>
<p><a name="node" id="node"></a></p>
<h2>5.2.1 Node Allocation Overview</h2>
<p>Node allocation is the process of selecting the best resources to allocate to a job from a list of available resources. Making this decision intelligently is important in an environment which possesses one or more of the following attributes:</p>
<p>- heterogeneous resources (resources which vary from node to node in terms of quantity or quality)</p>
<p>- shared nodes (nodes may be utilized by more than one job)</p>
<p>- reservations or service guarantees<br></p>
<p>- non-flat network (a network in which a perceptible performance degradation may potentially exist depending on workload placement)</p>
<p><br></p>
<h3>5.2.1.1 Heterogeneous Resources</h3>
<p>If the available compute resources have differing configurations, and a subset of the submitted jobs cannot run on all of the nodes, then allocation decisions can significantly affect scheduling performance. For example, a system may be comprised of two nodes, A and B, which are identical in all respects except for RAM, possessing 256MB and 1GB of RAM respectively. Two single processor jobs, X and Y, are submitted, one requesting at least 512 MB of RAM, the other, at least 128 MB. The scheduler could run job X on node A in which case job Y would be blocked until job X completes. A more intelligent approach may be to allocate node B to job X because it has the fewest available resources yet still meets the constraints. This is somewhat of a 'bestfit' approach in the configured resource dimension and is essentially what is done by the 'MINRESOURCE' algorithm.</p>
<p><br></p>
<h3>5.2.1.2 Shared Nodes</h3>
<p>Shared node systems are most often involve SMP nodes although this is not mandatory. Regardless, when sharing the resources of a given node amongst tasks from more than one job, resource contention and fragmentation issues arise.</p>
<p>Most current systems still do not do a very good job of logically partitioning the resources (i.e., CPU, Memory, network bandwidth, etc.) available on a given node. Consequently contention often arises between tasks of independent jobs on the node. This can result in a slowdown for all jobs involved which can have significant ramifications if large way parallel jobs are involved.</p>
<p>On large way SMP systems (i.e., > 32 processors/node), job packing can result in intra-node fragmentation. For example, again take two nodes, A and B each with 64 processors. Assume they are currently loaded with various jobs and have 24 and 12 processors free respectively. Two jobs are submitted, Job X requesting 10 processors, and job Y requesting 20 processors. Job X can start on either node but starting it on node A will prevent job Y from running. An algorithm to handle intra-node fragmentation is pretty straightforward for a single resource case, but what happens when jobs request a combination of processors, memory, and local disk. Determining the correct node suddenly gets significantly more complex<!--difficult. Algorithms to handle these type of issues are currently
available in the G2 extension library.--></p>
<h3>5.2.1.3 Reservations or Service Guarantees</h3>
<p>A reservation based system adds the time dimension into the node allocation decision. With reservations, node resources must be viewed in a type of two dimension 'node-time' space. Allocating nodes to jobs fragments this node-time space and makes it more difficult to schedule jobs in the remaining, more constrained node-time slots. Allocation decisions should be made in such a way as top minimize this fragmentation and maximize the schedulers ability to continue to start jobs in existing slots. See the figure to hopefully remove a small amount of the incoherency contained in the above sentences. In this figure, Job A and job B are already running. A reservation, X, has been created some time in the future. Assume that job A is 2 hours long and job B is 3 hours long. Again, two new single processor jobs are submitted, C and D; job C requires 3 hours of compute time while job D requires 5 hours. Either job will just fit in the free space located above Job A or in the free space located below job B. If job C is placed above Job A, job D, requiring 5 hours of time will be prevented from running by the presence of reservation X. However, if job C is placed below job B, job D can still start immediately above Job A.<img src="nodespace.gif" border="2" height="146" width="271" align="right"> Hopefully, this canned example demonstrates the importance of time based reservation information in making node allocation decisions, both at the time of starting jobs, and at the time of creating reservations. The impact of time based issues grows significantly with the number of reservations in place on a given system. The LASTAVAILABLE algorithm works on this premise, locating resources which have the smallest space between the end of a job under consideration and the start of a future reservation.</p>
<h3>5.2.1.4 Non-flat Network</h3>
<p>On systems where network connections do not resemble a flat 'all-to-all' topology, the placement of tasks may present a significant impact on the performance of communication intensive parallel jobs. If latencies and bandwidth of the network between any two nodes vary significantly, the node allocation algorithm should attempt to pack tasks of a given job as close to each other as possible to minimize the impact of these bandwidth and latency differences.</p><a name="resource" id="resource"></a>
<h2>5.2.2 Resource Based Algorithms</h2>
<p>Maui contains a number of allocation algorithms which address some of the needs described above. Additional 'homegrown' allocation algorithms may also be created and interfaced into the Maui scheduling system. The current suite of algorithms is described below.</p><a name="CPULOAD" id="CPULOAD"></a>
<h3>5.2.2.1 CPULOAD</h3>
<p>Nodes are selected which have the maximum amount of available, unused cpu power, i.e. <#of CPU's> - <CPU load>. Good algorithm for timesharing node systems. This algorithm is only applied to jobs starting immediately. For the purpose of future reservations, the MINRESOURCE algorithm is used.<br></p><a name="FIRSTAVAILABLE" id="FIRSTAVAILABLE"></a>
<h3>5.2.2.2 FIRSTAVAILABLE</h3>
<p>Simple first come, first server algorithm where nodes are allocated in the order they are presented by the resource manager. This is a very simple, and very fast algorithm.<br></p><a name="LASTAVAILABLE" id="LASTAVAILABLE"></a>
<h3>5.2.2.3 LASTAVAILABLE</h3>
<p>Algorithm which selects resources so as to minimize the amount of time after the job and before the the trailing reservation. This algorithm is a 'best fit in time' algorithm which minimizes the impact of reservation based node-time fragmentation. It is useful in systems where a large number of reservations (job, standing, or administrative) are in place.<br></p><a name="PRIORITY" id="PRIORITY"></a>
<h3>5.2.2.4 PRIORITY</h3>
<p>This algorithm allows a site to specify the priority of various static and dynamic aspects of compute nodes and allocate them accordingly. It is highly flexible allowing node attribute and usage information to be combined with reservation affinity. Using node allocation priority, the following priority components can be specified:<br></p>
<table border="1" width="100%" nosave="">
<tbody>
<tr>
<td><b>Component Name</b></td>
<td><b>Description</b></td>
</tr>
<tr>
<td><b>ADISK</b></td>
<td>local disk currently available to batch jobs</td>
</tr>
<tr>
<td><b>AMEM</b></td>
<td>real memory currently available to batch jobs</td>
</tr>
<tr>
<td><b>APROCS</b></td>
<td>processors currently available to batch jobs (configured procs - dedicated procs)</td>
</tr>
<tr>
<td><b>ASWAP</b></td>
<td>virtual memory currently available to batch jobs</td>
</tr>
<tr>
<td><b>CDISK</b></td>
<td>total local disk allocated for use by batch jobs</td>
</tr>
<tr>
<td><b>CMEM</b></td>
<td>total real memory on node</td>
</tr>
<tr>
<td><b>CPROCS</b></td>
<td>total processors on node</td>
</tr>
<tr>
<td><b>CSWAP</b></td>
<td>total virtually memory configured on node</td>
</tr>
<tr>
<td><b>JOBCOUNT</b></td>
<td>number of jobs currently running on node</td>
</tr>
<tr>
<td><b>LOAD</b></td>
<td>current 1 minute load average</td>
</tr>
<tr>
<td valign="top"><b>PREF</b><br></td>
<td valign="top">node meets job specific resource preferences<br></td>
</tr>
<tr>
<td><b>PRIORITY</b></td>
<td>admin specified node priority</td>
</tr>
<tr>
<td><b>RESAFFINITY</b></td>
<td>reservation affinity for job being evaluated</td>
</tr>
<tr>
<td><b>SPEED</b></td>
<td>if set, node 'procspeed'. otherwise, relative node 'speed'</td>
</tr>
<tr>
<td><b>USAGE</b></td>
<td>percentage of time node has been running batch jobs since the last statistics initialization</td>
</tr>
</tbody>
</table>
<p>The node allocation priority function can be specified on a node by node or cluster wide basis. In both cases, the recommended approach is to specify the <b>PRIORITYF</b> attribute with the <b><a href="a.fparameters.html#nodecfg">NODECFG</a></b> parameter. A few examples follow.</p>
<p>Example 1: Favor the fastest nodes with the most available memory which are running the fewest jobs.</p>
<p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF='SPEED+ .01 * AMEM - 10 * JOBCOUNT'</font></tt></p>
<p>Example 2: A site has a batch system consisting of two dedicated 'batchX' nodes, as well as numerous desktop systems. The allocation function should favor batch nodes first, followed by desktop systems which are the least loaded and have received the least historical usage.</p>
<p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF='-LOAD - 5*USAGE'</font></tt><br>
<tt><font size="-1">NODECFG[batch1] PRIORITY=1000 PRIORITYF='PRIORITY + APROCS'</font></tt><br>
<tt><font size="-1">NODECFG[batch2] PRIORITY=1000 PRIORITYF='PRIORITY + APROCS'</font></tt></p>
<p>Example 3: Pack tasks onto loaded nodes first.</p>
<p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF=JOBCOUNT</font></tt></p>
<table class="note">
<tbody>
<tr>
<td class="noteIMG"><img src="note.png" title="Note" alt="Note"></td>
<td class="noteDetail">As in the example above, if spaces are placed within the priority function for readability, the priority function value will need to be quoted to allow proper parsing.</td>
</tr>
</tbody>
</table><a name="MINRESOURCE" id="MINRESOURCE"></a>
<h3>5.2.2.5 MINRESOURCE</h3>
<p>This algorithm priorities nodes according to the configured resources on each node. Those nodes with the fewest configured resources which still meet the job's resource constraints are selected.<br></p><a name="CONTIGUOUS" id="CONTIGUOUS"></a>
<h3>5.2.2.6 CONTIGUOUS</h3>
<p>This algorithm will allocate nodes in contiguous (linear) blocks as required by the Compaq RMS system.<br></p><a name="MAXBALANCE" id="MAXBALANCE"></a>
<h3>5.2.2.7 MAXBALANCE</h3>
<p>This algorithm will attempt to allocate the most 'balanced' set of nodes possible to a job. In most cases, but not all, the metric for balance of the nodes is node speed. Thus, if possible, nodes with identical speeds will be allocated to the job. If identical speed nodes cannot be found, the algorithm will allocate the set of nodes with the minimum node speed 'span' or range.<br></p><a name="FASTEST" id="FASTEST"></a>
<h3>5.2.2.8 FASTEST</h3>
<p>This algorithm will select nodes in 'fastest node first' order. Nodes will be selected by node <i>speed</i> if specified. If node speed is not specified, nodes will be selected by processor speed. If neither is specified, nodes will be selected in a random order.<br></p><a name="LOCAL" id="LOCAL"></a>
<h3>5.2.2.9 LOCAL</h3>
<p>This will call the locally created 'contrib' node allocation algorithm.</p>
<p><b>See also</b></p>
<p>N/A.</p><a name="time" id="time"></a>
<h2>5.2.3 Time Based Algorithms</h2>
<p>Under Construction</p>
<p><a name="locally" id="locally"></a></p>
<h2>5.2.4 Locally Defined Algorithms</h2>
<p>Under Construction</p>
<h2><a name="pref" id="pref"></a> 5.2.5 Specifying 'Per Job' Resource Preferences</h2>
<p>While the resource based node allocation algorithms can make a good guess at what compute resources would best satisfy a job, sites often possess a subset of jobs which benefit from more explicit resource allocation specification. For example one job may perform best on a particular subset of nodes due to direct access to a tape drive, another may be very memory intensive. Resource preferences are distinct from node requirements. While the former describes what a job needs to run at all, the latter describes what the job needs to run well. In general, a scheduler must satisfy a job's node requirement specification, and then, as best possible, should satisfy the job's resource preferences.</p>
<h3>5.2.5.1 Specifying Resource Preferences</h3>
<p>A number of resources managers natively support the concept of resource preferences (ie, Loadleveler). When using these systems, the language specific preferences keywords may be used. For systems which do not support resource preferences natively, Maui provides a resource manager extension keyword, '<b>PREF</b>' which may be utilized to specify desired resources. This extension allows specification of node features, memory, swap, and disk space conditions which define whether or not the node is considered to be 'preferred. (NOTE: Maui 3.2.5 only supports feature based preferences)<br></p>
<h3>5.2.5.2 Selecting 'Preferred' Resources</h3>
<p>Enforcing resource preferences is not completely straightforward. A site may have a number of potentially conflicting desires which the scheduler is asked to simultaneously satisfy. For example, a scheduler may be asked to maximize the proximity of the allocated nodes at the same time it is supposed to satisfy resource preferences and minimize node overcommitment. To allow site specific 'weighting' of these varying desires, Maui allows resources preferences to be enabled through the '<a href="5.2nodeallocation.html#PRIORITY">Priority</a> ' node allocation algorithm. For example, to utilize resource preferences together with node load, the following configuration might be used:</p>
<p>-----<br></p>
<pre>
NODEALLOCATIONPOLICY PRIORITY<br>NODECFG[DEFAULT] PRIORITYF='5 * PREF - LOAD'
</pre>
<p>-----</p>
<p>To request specific resource preferences, a user could then submit a job indicating those preferences. In the case of a PBS job, the following might work:</p>
<p>----<br>
qsub -l nodes=4,walltime=1:00:00 -W x=PREF(FEATURE:FAST,FEATURE:TAPE)<br>
----</p>
<div class="navIcons bottomIcons">
<a href="index.html"><img src="home.png" title="Home" alt="Home" border="0"></a> <a href="5.0prioritization.html"><img src="upArrow.png" title="Up" alt="Up" border="0"></a> <a href="5.1.5prioritymanagement.html"><img src="prevArrow.png" title="Previous" alt="Previous" border="0"></a> <a href="5.3nodeaccess.html"><img src="nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>
</div>
</div>
</div>
<div class="sub-content-btm"></div>
</div>
</body>
</html>