Skip to content

Commit

Permalink
Merge branch 'akpm'
Browse files Browse the repository at this point in the history
  • Loading branch information
sfrothwell committed Nov 17, 2011
2 parents 34de998 + b3823ec commit 9cb45ed
Show file tree
Hide file tree
Showing 145 changed files with 3,505 additions and 1,645 deletions.
50 changes: 50 additions & 0 deletions Documentation/DocBook/debugobjects.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
<listitem><para>debug_object_deactivate</para></listitem>
<listitem><para>debug_object_destroy</para></listitem>
<listitem><para>debug_object_free</para></listitem>
<listitem><para>debug_object_assert_init</para></listitem>
</itemizedlist>
Each of these functions takes the address of the real object and
a pointer to the object type specific debug description
Expand Down Expand Up @@ -273,6 +274,26 @@
debug checks.
</para>
</sect1>

<sect1 id="debug_object_assert_init">
<title>debug_object_assert_init</title>
<para>
This function is called to assert that an object has been
initialized.
</para>
<para>
When the real object is not tracked by debugobjects, it calls
fixup_assert_init of the object type description structure
provided by the caller, with the hardcoded object state
ODEBUG_NOT_AVAILABLE. The fixup function can correct the problem
by calling debug_object_init and other specific initializing
functions.
</para>
<para>
When the real object is already tracked by debugobjects it is
ignored.
</para>
</sect1>
</chapter>
<chapter id="fixupfunctions">
<title>Fixup functions</title>
Expand Down Expand Up @@ -381,6 +402,35 @@
statistics.
</para>
</sect1>
<sect1 id="fixup_assert_init">
<title>fixup_assert_init</title>
<para>
This function is called from the debug code whenever a problem
in debug_object_assert_init is detected.
</para>
<para>
Called from debug_object_assert_init() with a hardcoded state
ODEBUG_STATE_NOTAVAILABLE when the object is not found in the
debug bucket.
</para>
<para>
The function returns 1 when the fixup was successful,
otherwise 0. The return value is used to update the
statistics.
</para>
<para>
Note, this function should make sure debug_object_init() is
called before returning.
</para>
<para>
The handling of statically initialized objects is a special
case. The fixup function should check if this is a legitimate
case of a statically initialized object or not. In this case only
debug_object_init() should be called to make the object known to
the tracker. Then the function should return 0 because this is not
a real fixup.
</para>
</sect1>
</chapter>
<chapter id="bugs">
<title>Known Bugs And Assumptions</title>
Expand Down
14 changes: 12 additions & 2 deletions Documentation/cgroups/cgroups.txt
Original file line number Diff line number Diff line change
Expand Up @@ -605,7 +605,8 @@ called on a fork. If this method returns 0 (success) then this should
remain valid while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future.

int can_attach_task(struct cgroup *cgrp, struct task_struct *tsk);
int can_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *tsk);
(cgroup_mutex held by caller)

As can_attach, but for operations that must be run once per task to be
Expand All @@ -622,6 +623,14 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded.

void cancel_attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *tsk)
(cgroup_mutex held by caller)

As cancel_attach, but for operations that must be cancelled once per
task that wanted to be attached. This typically revert the effect of
can_attach_task().

void pre_attach(struct cgroup *cgrp);
(cgroup_mutex held by caller)

Expand All @@ -635,7 +644,8 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.

void attach_task(struct cgroup *cgrp, struct task_struct *tsk);
void attach_task(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *tsk);
(cgroup_mutex held by caller)

As attach, but for operations that must be run once per task to be attached,
Expand Down
20 changes: 19 additions & 1 deletion Documentation/cgroups/resource_counter.txt
Original file line number Diff line number Diff line change
Expand Up @@ -76,14 +76,24 @@ to work with it.
limit_fail_at parameter is set to the particular res_counter element
where the charging failed.

It returns 0 on success and -1 on failure.

d. int res_counter_charge_locked
(struct res_counter *rc, unsigned long val)

The same as res_counter_charge(), but it must not acquire/release the
res_counter->lock internally (it must be called with res_counter->lock
held).

e. void res_counter_uncharge[_locked]
e. int res_counter_charge_until(struct res_counter *counter,
struct res_counter *limit, unsigned long val,
struct res_counter **limit_fail_at)

The same as res_counter_charge(), but the charge propagation to
the hierarchy stops at the limit given in the "limit" parameter.


f. void res_counter_uncharge[_locked]
(struct res_counter *rc, unsigned long val)

When a resource is released (freed) it should be de-accounted
Expand All @@ -92,6 +102,14 @@ to work with it.

The _locked routines imply that the res_counter->lock is taken.


g. void res_counter_uncharge_until(struct res_counter *counter,
struct res_counter *limit,
unsigned long val)

The same as res_counter_charge, but the uncharge propagation to
the hierarchy stops at the limit given in the "limit" parameter.

2.1 Other accounting routines

There are more routines that may help you with common needs, like
Expand Down
153 changes: 153 additions & 0 deletions Documentation/cgroups/task_counter.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
Task counter subsystem

1. Description

The task counter subsystem limits the number of tasks running
inside a given cgroup. It behaves like the NR_PROC rlimit but in
the scope of a cgroup instead of a user.

It has two typical usecases, although more can probably be found:

1.1 Protection against forkbomb in a container

One usecase is to protect against forkbombs that explode inside
a container when that container is implemented using a cgroup. The
NR_PROC rlimit is known to be a working protection against this type
of attack but is not suitable anymore when we run containers in
parallel under the same user. One container could starve all the
others by spawning a high number of tasks close to the rlimit
boundary. So in this case we need this limitation to be done in a
per cgroup granularity.

Note this works by preventing forkbombs propagation. It doesn't cure
the forkbomb effects when it has already grown up enough to make
the system hardly responsive. While defining the limit on the number
of tasks, it's up to the admin to find the right balance between the
possible needs of a container and the resources the system can afford
to provide.

Also the NR_PROC rlimit and this cgroup subsystem are totally
dissociated. But they can be complementary. The task counter limits
the containers and the rlimit can provide an upper bound on the whole
set of containers.


1.2 Kill tasks inside a cgroup

An other usecase comes along the forkbomb prevention: it brings
the ability to kill all tasks inside a cgroup without races. By
setting the limit of running tasks to 0, one can prevent from any
further fork inside a cgroup and then kill all of its tasks without
the need to retry an unbound amount of time due to races between
kills and forks running in parallel (more details in "Kill a cgroup
safely" paragraph).

This is useful to kill a forkbomb for example. When its gazillion
of forks are competing with the kills, one need to ensure this
operation won't run in a nearly endless loop of retry.

And more generally it is useful to kill a cgroup in a bound amount
of pass.


2. Interface

When a hierarchy is mounted with the task counter subsystem binded, it
adds two files into the cgroups directories, except the root one:

- tasks.usage contains the number of tasks running inside a cgroup and
its children in the hierarchy (see paragraph about Inheritance).

- tasks.limit contains the maximum number of tasks that can run inside
a cgroup. We check this limit when a task forks or when it is migrated
to a cgroup.

Note that the tasks.limit value can be forced below tasks.usage, in which
case any new task in the cgroup will be rejected until the tasks.usage
value goes below tasks.limit.

For optimization reasons, the root directory of a hierarchy doesn't have
a task counter.


3. Inheritance

When a task is added to a cgroup, by way of a cgroup migration or a fork,
it increases the task counter of that cgroup and of all its ancestors.
Hence a cgroup is also subject to the limit of its ancestors.

In the following hierarchy:


A
|
B
/ \
C D


We have 1 task running in B, one running in C and none running in D.
It means we have tasks.usage = 1 in C and tasks.usage = 2 in B because
B counts its task and those of its children.

Now lets set tasks.limit = 2 in B and tasks.limit = 1 in D.
If we move a new task in D, it will be refused because the limit in B has
been reached already.


4. Kill a cgroup safely

As explained in the description, this subsystem is also helpful to
kill all tasks in a cgroup safely, after setting tasks.limit to 0,
so that we don't race against parallel forks in an unbound numbers
of kill iterations.

But there is a small detail to be aware of to use this feature that
way.

Some typical way to proceed would be:

echo 0 > tasks.limit
for TASK in $(cat cgroup.procs)
do
kill -KILL $TASK
done

However there is a small race window where a task can be in the way to
be forked but hasn't enough completed the fork to have the PID of the
fork appearing in the cgroup.procs file.

The only way to get it right is to run a loop that reads tasks.usage, kill
all the tasks in cgroup.procs and exit the loop only if the value in
tasks.usage was the same than the number of tasks that were in cgroup.procs,
ie: the number of tasks that were killed.

It works because the new child appears in tasks.usage right before we check,
in the fork path, whether the parent has a pending signal, in which case the
fork is cancelled anyway. So relying on tasks.usage is fine and non-racy.

This race window is tiny and unlikely to happen, so most of the time a single
kill iteration should be enough. But it's worth knowing about that corner
case spotted by Oleg Nesterov.

Example of safe use would be:

echo 0 > tasks.limit
END=false

while [ $END == false ]
do
NR_TASKS=$(cat tasks.usage)
NR_KILLED=0

for TASK in $(cat cgroup.procs)
do
let NR_KILLED=NR_KILLED+1
kill -KILL $TASK
done

if [ "$NR_TASKS" = "$NR_KILLED" ]
then
END=true
fi
done
16 changes: 16 additions & 0 deletions Documentation/sysctl/vm.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm:
- dirty_writeback_centisecs
- drop_caches
- extfrag_threshold
- extra_free_kbytes
- hugepages_treat_as_movable
- hugetlb_shm_group
- laptop_mode
Expand Down Expand Up @@ -168,6 +169,21 @@ fragmentation index is <= extfrag_threshold. The default value is 500.

==============================================================

extra_free_kbytes

This parameter tells the VM to keep extra free memory between the threshold
where background reclaim (kswapd) kicks in, and the threshold where direct
reclaim (by allocating processes) kicks in.

This is useful for workloads that require low latency memory allocations
and have a bounded burstiness in memory allocations, for example a
realtime application that receives and transmits network traffic
(causing in-kernel memory allocations) with a maximum total message burst
size of 200MB may need 200MB of extra free memory to avoid direct reclaim
related latencies.

==============================================================

hugepages_treat_as_movable

This parameter is only useful when kernelcore= is specified at boot time to
Expand Down
12 changes: 6 additions & 6 deletions Documentation/trace/events-kmem.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ but the call_site can usually be used to extrapolate that information.
==================
mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
mm_page_free_direct page=%p pfn=%lu order=%d
mm_pagevec_free page=%p pfn=%lu order=%d cold=%d
mm_page_free page=%p pfn=%lu order=%d
mm_page_free_batched page=%p pfn=%lu order=%d cold=%d

These four events deal with page allocation and freeing. mm_page_alloc is
a simple indicator of page allocator activity. Pages may be allocated from
Expand All @@ -53,13 +53,13 @@ amounts of activity imply high activity on the zone->lock. Taking this lock
impairs performance by disabling interrupts, dirtying cache lines between
CPUs and serialising many CPUs.

When a page is freed directly by the caller, the mm_page_free_direct event
When a page is freed directly by the caller, the only mm_page_free event
is triggered. Significant amounts of activity here could indicate that the
callers should be batching their activities.

When pages are freed using a pagevec, the mm_pagevec_free is
triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a pagevec. Significant amounts of activity here could
When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
contention on the zone->lru_lock.

Expand Down
Loading

0 comments on commit 9cb45ed

Please sign in to comment.