packages (LINUX_2_6_38): kernel/kernel-small_fixes.patch, kernel/kernel.spe...
arekm
arekm at pld-linux.org
Thu Nov 10 06:57:46 CET 2011
Author: arekm Date: Thu Nov 10 05:57:46 2011 GMT
Module: packages Tag: LINUX_2_6_38
---- Log message:
- rel 6; cgroup memory limit fixes
---- Files affected:
packages/kernel:
kernel-small_fixes.patch (1.25.2.7 -> 1.25.2.8) , kernel.spec (1.924.2.10 -> 1.924.2.11)
---- Diffs:
================================================================
Index: packages/kernel/kernel-small_fixes.patch
diff -u packages/kernel/kernel-small_fixes.patch:1.25.2.7 packages/kernel/kernel-small_fixes.patch:1.25.2.8
--- packages/kernel/kernel-small_fixes.patch:1.25.2.7 Mon Sep 26 09:04:03 2011
+++ packages/kernel/kernel-small_fixes.patch Thu Nov 10 06:57:40 2011
@@ -454,3 +454,458 @@
* Flags for inode locking.
* Bit ranges: 1<<1 - 1<<16-1 -- iolock/ilock modes (bitfield)
* 1<<16 - 1<<32-1 -- lockdep annotation (integers)
+commit 79dfdaccd1d5b40ff7cf4a35a0e63696ebb78b4d
+Author: Michal Hocko <mhocko at suse.cz>
+Date: Tue Jul 26 16:08:23 2011 -0700
+
+ memcg: make oom_lock 0 and 1 based rather than counter
+
+ Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
+ counter which is incremented by mem_cgroup_oom_lock when we are about to
+ handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
+ if oom_lock > 1 to prevent from multiple oom kills at the same time.
+ The counter is then decremented by mem_cgroup_oom_unlock called from the
+ same function.
+
+ This works correctly but it can lead to serious starvations when we have
+ many processes triggering OOM and many CPUs available for them (I have
+ tested with 16 CPUs).
+
+ Consider a process (call it A) which gets the oom_lock (the first one
+ that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
+ processes that are blocked on the mutex. While A releases the mutex and
+ calls mem_cgroup_out_of_memory others will wake up (one after another)
+ and increase the counter and fall into sleep (memcg_oom_waitq).
+
+ Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
+ decreases oom_lock and wakes other tasks (if releasing memory by
+ somebody else - e.g. killed process - hasn't done it yet).
+
+ A testcase would look like:
+ Assume malloc XXX is a program allocating XXX Megabytes of memory
+ which touches all allocated pages in a tight loop
+ # swapoff SWAP_DEVICE
+ # cgcreate -g memory:A
+ # cgset -r memory.oom_control=0 A
+ # cgset -r memory.limit_in_bytes= 200M
+ # for i in `seq 100`
+ # do
+ # cgexec -g memory:A malloc 10 &
+ # done
+
+ The main problem here is that all processes still race for the mutex and
+ there is no guarantee that we will get counter back to 0 for those that
+ got back to mem_cgroup_handle_oom. In the end the whole convoy
+ in/decreases the counter but we do not get to 1 that would enable
+ killing so nothing useful can be done. The time is basically unbounded
+ because it highly depends on scheduling and ordering on mutex (I have
+ seen this taking hours...).
+
+ This patch replaces the counter by a simple {un}lock semantic. As
+ mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
+ make sure that nobody else races with us which is guaranteed by the
+ memcg_oom_mutex.
+
+ We have to be careful while locking subtrees because we can encounter a
+ subtree which is already locked: hierarchy:
+
+ A
+ / \
+ B \
+ /\ \
+ C D E
+
+ B - C - D tree might be already locked. While we want to enable locking
+ E subtree because OOM situations cannot influence each other we
+ definitely do not want to allow locking A.
+
+ Therefore we have to refuse lock if any subtree is already locked and
+ clear up the lock for all nodes that have been set up to the failure
+ point.
+
+ On the other hand we have to make sure that the rest of the world will
+ recognize that a group is under OOM even though it doesn't have a lock.
+ Therefore we have to introduce under_oom variable which is incremented
+ and decremented for the whole subtree when we enter resp. leave
+ mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
+ updated under memcg_oom_mutex because its users only check a single
+ group and they use atomic operations for that.
+
+ This can be checked easily by the following test case:
+
+ # cgcreate -g memory:A
+ # cgset -r memory.use_hierarchy=1 A
+ # cgset -r memory.oom_control=1 A
+ # cgset -r memory.limit_in_bytes= 100M
+ # cgset -r memory.memsw.limit_in_bytes= 100M
+ # cgcreate -g memory:A/B
+ # cgset -r memory.oom_control=1 A/B
+ # cgset -r memory.limit_in_bytes=20M
+ # cgset -r memory.memsw.limit_in_bytes=20M
+ # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
+ # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A
+
+ While B gets oom_lock A will not get it. Both of them go into sleep and
+ wait for an external action. We can make the limit higher for A to
+ enforce waking it up
+
+ # cgset -r memory.memsw.limit_in_bytes=300M A
+ # cgset -r memory.limit_in_bytes=300M A
+
+ malloc in A has to wake up even though it doesn't have oom_lock.
+
+ Finally, the unlock path is very easy because we always unlock only the
+ subtree we have locked previously while we always decrement under_oom.
+
+ Signed-off-by: Michal Hocko <mhocko at suse.cz>
+ Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+ Cc: Balbir Singh <bsingharora at gmail.com>
+ Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
+ Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
+
+diff --git a/mm/memcontrol.c b/mm/memcontrol.c
+index 8559966..95d6c25 100644
+--- a/mm/memcontrol.c
++++ b/mm/memcontrol.c
+@@ -246,7 +246,10 @@ struct mem_cgroup {
+ * Should the accounting and control be hierarchical, per subtree?
+ */
+ bool use_hierarchy;
+- atomic_t oom_lock;
++
++ bool oom_lock;
++ atomic_t under_oom;
++
+ atomic_t refcnt;
+
+ int swappiness;
+@@ -1722,37 +1725,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+ /*
+ * Check OOM-Killer is already running under our hierarchy.
+ * If someone is running, return false.
++ * Has to be called with memcg_oom_mutex
+ */
+ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+ {
+- int x, lock_count = 0;
+- struct mem_cgroup *iter;
++ int lock_count = -1;
++ struct mem_cgroup *iter, *failed = NULL;
++ bool cond = true;
+
+- for_each_mem_cgroup_tree(iter, mem) {
+- x = atomic_inc_return(&iter->oom_lock);
+- lock_count = max(x, lock_count);
++ for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++ bool locked = iter->oom_lock;
++
++ iter->oom_lock = true;
++ if (lock_count == -1)
++ lock_count = iter->oom_lock;
++ else if (lock_count != locked) {
++ /*
++ * this subtree of our hierarchy is already locked
++ * so we cannot give a lock.
++ */
++ lock_count = 0;
++ failed = iter;
++ cond = false;
++ }
+ }
+
+- if (lock_count == 1)
+- return true;
+- return false;
++ if (!failed)
++ goto done;
++
++ /*
++ * OK, we failed to lock the whole subtree so we have to clean up
++ * what we set up to the failing subtree
++ */
++ cond = true;
++ for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++ if (iter == failed) {
++ cond = false;
++ continue;
++ }
++ iter->oom_lock = false;
++ }
++done:
++ return lock_count;
+ }
+
++/*
++ * Has to be called with memcg_oom_mutex
++ */
+ static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+ {
+ struct mem_cgroup *iter;
+
++ for_each_mem_cgroup_tree(iter, mem)
++ iter->oom_lock = false;
++ return 0;
++}
++
++static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
++{
++ struct mem_cgroup *iter;
++
++ for_each_mem_cgroup_tree(iter, mem)
++ atomic_inc(&iter->under_oom);
++}
++
++static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
++{
++ struct mem_cgroup *iter;
++
+ /*
+ * When a new child is created while the hierarchy is under oom,
+ * mem_cgroup_oom_lock() may not be called. We have to use
+ * atomic_add_unless() here.
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+- atomic_add_unless(&iter->oom_lock, -1, 0);
+- return 0;
++ atomic_add_unless(&iter->under_oom, -1, 0);
+ }
+
+-
+ static DEFINE_MUTEX(memcg_oom_mutex);
+ static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
+
+@@ -1794,7 +1843,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
+
+ static void memcg_oom_recover(struct mem_cgroup *mem)
+ {
+- if (mem && atomic_read(&mem->oom_lock))
++ if (mem && atomic_read(&mem->under_oom))
+ memcg_wakeup_oom(mem);
+ }
+
+@@ -1812,6 +1861,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+ need_to_kill = true;
++ mem_cgroup_mark_under_oom(mem);
++
+ /* At first, try to OOM lock hierarchy under mem.*/
+ mutex_lock(&memcg_oom_mutex);
+ locked = mem_cgroup_oom_lock(mem);
+@@ -1835,10 +1886,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+ }
+ mutex_lock(&memcg_oom_mutex);
+- mem_cgroup_oom_unlock(mem);
++ if (locked)
++ mem_cgroup_oom_unlock(mem);
+ memcg_wakeup_oom(mem);
+ mutex_unlock(&memcg_oom_mutex);
+
++ mem_cgroup_unmark_under_oom(mem);
++
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ return false;
+ /* Give chance to dying process */
+@@ -4505,7 +4559,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
+ list_add(&event->list, &memcg->oom_notify);
+
+ /* already in OOM ? */
+- if (atomic_read(&memcg->oom_lock))
++ if (atomic_read(&memcg->under_oom))
+ eventfd_signal(eventfd, 1);
+ mutex_unlock(&memcg_oom_mutex);
+
+@@ -4540,7 +4594,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
+
+ cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
+
+- if (atomic_read(&mem->oom_lock))
++ if (atomic_read(&mem->under_oom))
+ cb->fill(cb, "under_oom", 1);
+ else
+ cb->fill(cb, "under_oom", 0);
+commit 1d65f86db14806cf7b1218c7b4ecb8b4db5af27d
+Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+Date: Mon Jul 25 17:12:27 2011 -0700
+
+ mm: preallocate page before lock_page() at filemap COW
+
+ Currently we are keeping faulted page locked throughout whole __do_fault
+ call (except for page_mkwrite code path) after calling file system's fault
+ code. If we do early COW, we allocate a new page which has to be charged
+ for a memcg (mem_cgroup_newpage_charge).
+
+ This function, however, might block for unbounded amount of time if memcg
+ oom killer is disabled or fork-bomb is running because the only way out of
+ the OOM situation is either an external event or OOM-situation fix.
+
+ In the end we are keeping the faulted page locked and blocking other
+ processes from faulting it in which is not good at all because we are
+ basically punishing potentially an unrelated process for OOM condition in
+ a different group (I have seen stuck system because of ld-2.11.1.so being
+ locked).
+
+ We can do test easily.
+
+ % cgcreate -g memory:A
+ % cgset -r memory.limit_in_bytes=64M A
+ % cgset -r memory.memsw.limit_in_bytes=64M A
+ % cd kernel_dir; cgexec -g memory:A make -j
+
+ Then, the whole system will live-locked until you kill 'make -j'
+ by hands (or push reboot...) This is because some important page in a
+ a shared library are locked.
+
+ Considering again, the new page is not necessary to be allocated
+ with lock_page() held. And usual page allocation may dive into
+ long memory reclaim loop with holding lock_page() and can cause
+ very long latency.
+
+ There are 3 ways.
+ 1. do allocation/charge before lock_page()
+ Pros. - simple and can handle page allocation in the same manner.
+ This will reduce holding time of lock_page() in general.
+ Cons. - we do page allocation even if ->fault() returns error.
+
+ 2. do charge after unlock_page(). Even if charge fails, it's just OOM.
+ Pros. - no impact to non-memcg path.
+ Cons. - implemenation requires special cares of LRU and we need to modify
+ page_add_new_anon_rmap()...
+
+ 3. do unlock->charge->lock again method.
+ Pros. - no impact to non-memcg path.
+ Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
+ lock and get it again...
+
+ This patch moves "charge" and memory allocation for COW page
+ before lock_page(). Then, we can avoid scanning LRU with holding
+ a lock on a page and latency under lock_page() will be reduced.
+
+ Then, above livelock disappears.
+
+ [akpm at linux-foundation.org: fix code layout]
+ Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+ Reported-by: Lutz Vieweg <lvml at 5t9.de>
+ Original-idea-by: Michal Hocko <mhocko at suse.cz>
+ Cc: Michal Hocko <mhocko at suse.cz>
+ Cc: Ying Han <yinghan at google.com>
+ Cc: Johannes Weiner <hannes at cmpxchg.org>
+ Cc: Daisuke Nishimura <nishimura at mxp.nes.nec.co.jp>
+ Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
+ Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
+
+diff --git a/mm/memory.c b/mm/memory.c
+index a58bbeb..3c9f3aa 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -3093,14 +3093,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ pte_t *page_table;
+ spinlock_t *ptl;
+ struct page *page;
++ struct page *cow_page;
+ pte_t entry;
+ int anon = 0;
+- int charged = 0;
+ struct page *dirty_page = NULL;
+ struct vm_fault vmf;
+ int ret;
+ int page_mkwrite = 0;
+
++ /*
++ * If we do COW later, allocate page befor taking lock_page()
++ * on the file cache page. This will reduce lock holding time.
++ */
++ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
++
++ if (unlikely(anon_vma_prepare(vma)))
++ return VM_FAULT_OOM;
++
++ cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
++ if (!cow_page)
++ return VM_FAULT_OOM;
++
++ if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
++ page_cache_release(cow_page);
++ return VM_FAULT_OOM;
++ }
++ } else
++ cow_page = NULL;
++
+ vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.pgoff = pgoff;
+ vmf.flags = flags;
+@@ -3109,12 +3129,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ ret = vma->vm_ops->fault(vma, &vmf);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ VM_FAULT_RETRY)))
+- return ret;
++ goto uncharge_out;
+
+ if (unlikely(PageHWPoison(vmf.page))) {
+ if (ret & VM_FAULT_LOCKED)
+ unlock_page(vmf.page);
+- return VM_FAULT_HWPOISON;
++ ret = VM_FAULT_HWPOISON;
++ goto uncharge_out;
+ }
+
+ /*
+@@ -3132,23 +3153,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ page = vmf.page;
+ if (flags & FAULT_FLAG_WRITE) {
+ if (!(vma->vm_flags & VM_SHARED)) {
++ page = cow_page;
+ anon = 1;
+- if (unlikely(anon_vma_prepare(vma))) {
+- ret = VM_FAULT_OOM;
+- goto out;
+- }
+- page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+- vma, address);
+- if (!page) {
+- ret = VM_FAULT_OOM;
+- goto out;
+- }
+- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+- ret = VM_FAULT_OOM;
+- page_cache_release(page);
+- goto out;
+- }
+- charged = 1;
+ copy_user_highpage(page, vmf.page, address, vma);
+ __SetPageUptodate(page);
+ } else {
+@@ -3217,8 +3223,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, address, page_table);
+ } else {
+- if (charged)
+- mem_cgroup_uncharge_page(page);
++ if (cow_page)
++ mem_cgroup_uncharge_page(cow_page);
+ if (anon)
+ page_cache_release(page);
+ else
+@@ -3227,7 +3233,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+
+ pte_unmap_unlock(page_table, ptl);
+
+-out:
+ if (dirty_page) {
+ struct address_space *mapping = page->mapping;
+
+@@ -3257,6 +3262,13 @@ out:
+ unwritable_page:
+ page_cache_release(page);
+ return ret;
++uncharge_out:
++ /* fs's fault handler get error */
++ if (cow_page) {
++ mem_cgroup_uncharge_page(cow_page);
++ page_cache_release(cow_page);
++ }
++ return ret;
+ }
+
+ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
================================================================
Index: packages/kernel/kernel.spec
diff -u packages/kernel/kernel.spec:1.924.2.10 packages/kernel/kernel.spec:1.924.2.11
--- packages/kernel/kernel.spec:1.924.2.10 Wed Oct 5 21:39:08 2011
+++ packages/kernel/kernel.spec Thu Nov 10 06:57:40 2011
@@ -95,7 +95,7 @@
%define basever 2.6.38
%define postver .8
-%define rel 5
+%define rel 6
%define _enable_debug_packages 0
@@ -1577,6 +1577,9 @@
All persons listed below can be reached at <cvs_login>@pld-linux.org
$Log$
+Revision 1.924.2.11 2011/11/10 05:57:40 arekm
+- rel 6; cgroup memory limit fixes
+
Revision 1.924.2.10 2011/10/05 19:39:08 glen
- patch-2.5.9-3.amd64 fails to apply grsec patch, 2.6.1 applies fine, so assume 2.6.0 is needed
================================================================
---- CVS-web:
http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel-small_fixes.patch?r1=1.25.2.7&r2=1.25.2.8&f=u
http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel.spec?r1=1.924.2.10&r2=1.924.2.11&f=u
More information about the pld-cvs-commit
mailing list