packages (LINUX_2_6_38): kernel/kernel-small_fixes.patch, kernel/kernel.spe...

Thu Nov 10 06:57:46 CET 2011

Author: arekm                        Date: Thu Nov 10 05:57:46 2011 GMT
Module: packages                      Tag: LINUX_2_6_38
---- Log message:
- rel 6; cgroup memory limit fixes

---- Files affected:
packages/kernel:
   kernel-small_fixes.patch (1.25.2.7 -> 1.25.2.8) , kernel.spec (1.924.2.10 -> 1.924.2.11) 

---- Diffs:

================================================================
Index: packages/kernel/kernel-small_fixes.patch
diff -u packages/kernel/kernel-small_fixes.patch:1.25.2.7 packages/kernel/kernel-small_fixes.patch:1.25.2.8

--- packages/kernel/kernel-small_fixes.patch:1.25.2.7	Mon Sep 26 09:04:03 2011
+++ packages/kernel/kernel-small_fixes.patch	Thu Nov 10 06:57:40 2011
@@ -454,3 +454,458 @@
   * Flags for inode locking.
   * Bit ranges:	1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
   *		1<<16 - 1<<32-1 -- lockdep annotation (integers)
+commit 79dfdaccd1d5b40ff7cf4a35a0e63696ebb78b4d
+Author: Michal Hocko <mhocko at suse.cz>
+Date:   Tue Jul 26 16:08:23 2011 -0700
+
+    memcg: make oom_lock 0 and 1 based rather than counter
+    
+    Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
+    counter which is incremented by mem_cgroup_oom_lock when we are about to
+    handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
+    if oom_lock > 1 to prevent from multiple oom kills at the same time.
+    The counter is then decremented by mem_cgroup_oom_unlock called from the
+    same function.
+    
+    This works correctly but it can lead to serious starvations when we have
+    many processes triggering OOM and many CPUs available for them (I have
+    tested with 16 CPUs).
+    
+    Consider a process (call it A) which gets the oom_lock (the first one
+    that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
+    processes that are blocked on the mutex.  While A releases the mutex and
+    calls mem_cgroup_out_of_memory others will wake up (one after another)
+    and increase the counter and fall into sleep (memcg_oom_waitq).
+    
+    Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
+    decreases oom_lock and wakes other tasks (if releasing memory by
+    somebody else - e.g.  killed process - hasn't done it yet).
+    
+    A testcase would look like:
+      Assume malloc XXX is a program allocating XXX Megabytes of memory
+      which touches all allocated pages in a tight loop
+      # swapoff SWAP_DEVICE
+      # cgcreate -g memory:A
+      # cgset -r memory.oom_control=0   A
+      # cgset -r memory.limit_in_bytes= 200M
+      # for i in `seq 100`
+      # do
+      #     cgexec -g memory:A   malloc 10 &
+      # done
+    
+    The main problem here is that all processes still race for the mutex and
+    there is no guarantee that we will get counter back to 0 for those that
+    got back to mem_cgroup_handle_oom.  In the end the whole convoy
+    in/decreases the counter but we do not get to 1 that would enable
+    killing so nothing useful can be done.  The time is basically unbounded
+    because it highly depends on scheduling and ordering on mutex (I have
+    seen this taking hours...).
+    
+    This patch replaces the counter by a simple {un}lock semantic.  As
+    mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
+    make sure that nobody else races with us which is guaranteed by the
+    memcg_oom_mutex.
+    
+    We have to be careful while locking subtrees because we can encounter a
+    subtree which is already locked: hierarchy:
+    
+              A
+            /   \
+           B     \
+          /\      \
+         C  D     E
+    
+    B - C - D tree might be already locked.  While we want to enable locking
+    E subtree because OOM situations cannot influence each other we
+    definitely do not want to allow locking A.
+    
+    Therefore we have to refuse lock if any subtree is already locked and
+    clear up the lock for all nodes that have been set up to the failure
+    point.
+    
+    On the other hand we have to make sure that the rest of the world will
+    recognize that a group is under OOM even though it doesn't have a lock.
+    Therefore we have to introduce under_oom variable which is incremented
+    and decremented for the whole subtree when we enter resp.  leave
+    mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
+    updated under memcg_oom_mutex because its users only check a single
+    group and they use atomic operations for that.
+    
+    This can be checked easily by the following test case:
+    
+      # cgcreate -g memory:A
+      # cgset -r memory.use_hierarchy=1 A
+      # cgset -r memory.oom_control=1   A
+      # cgset -r memory.limit_in_bytes= 100M
+      # cgset -r memory.memsw.limit_in_bytes= 100M
+      # cgcreate -g memory:A/B
+      # cgset -r memory.oom_control=1 A/B
+      # cgset -r memory.limit_in_bytes=20M
+      # cgset -r memory.memsw.limit_in_bytes=20M
+      # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
+      # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
+    
+    While B gets oom_lock A will not get it.  Both of them go into sleep and
+    wait for an external action.  We can make the limit higher for A to
+    enforce waking it up
+    
+      # cgset -r memory.memsw.limit_in_bytes=300M A
+      # cgset -r memory.limit_in_bytes=300M A
+    
+    malloc in A has to wake up even though it doesn't have oom_lock.
+    
+    Finally, the unlock path is very easy because we always unlock only the
+    subtree we have locked previously while we always decrement under_oom.
+    
+    Signed-off-by: Michal Hocko <mhocko at suse.cz>
+    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+    Cc: Balbir Singh <bsingharora at gmail.com>
+    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
+    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
+
+diff --git a/mm/memcontrol.c b/mm/memcontrol.c
+index 8559966..95d6c25 100644
+--- a/mm/memcontrol.c
++++ b/mm/memcontrol.c
+@@ -246,7 +246,10 @@ struct mem_cgroup {
+ 	 * Should the accounting and control be hierarchical, per subtree?
+ 	 */
+ 	bool use_hierarchy;
+-	atomic_t	oom_lock;
++
++	bool		oom_lock;
++	atomic_t	under_oom;
++
+ 	atomic_t	refcnt;
+ 
+ 	int	swappiness;
+@@ -1722,37 +1725,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+ /*
+  * Check OOM-Killer is already running under our hierarchy.
+  * If someone is running, return false.
++ * Has to be called with memcg_oom_mutex
+  */
+ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+ {
+-	int x, lock_count = 0;
+-	struct mem_cgroup *iter;
++	int lock_count = -1;
++	struct mem_cgroup *iter, *failed = NULL;
++	bool cond = true;
+ 
+-	for_each_mem_cgroup_tree(iter, mem) {
+-		x = atomic_inc_return(&iter->oom_lock);
+-		lock_count = max(x, lock_count);
++	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++		bool locked = iter->oom_lock;
++
++		iter->oom_lock = true;
++		if (lock_count == -1)
++			lock_count = iter->oom_lock;
++		else if (lock_count != locked) {
++			/*
++			 * this subtree of our hierarchy is already locked
++			 * so we cannot give a lock.
++			 */
++			lock_count = 0;
++			failed = iter;
++			cond = false;
++		}
+ 	}
+ 
+-	if (lock_count == 1)
+-		return true;
+-	return false;
++	if (!failed)
++		goto done;
++
++	/*
++	 * OK, we failed to lock the whole subtree so we have to clean up
++	 * what we set up to the failing subtree
++	 */
++	cond = true;
++	for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++		if (iter == failed) {
++			cond = false;
++			continue;
++		}
++		iter->oom_lock = false;
++	}
++done:
++	return lock_count;
+ }
+ 
++/*
++ * Has to be called with memcg_oom_mutex
++ */
+ static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+ {
+ 	struct mem_cgroup *iter;
+ 
++	for_each_mem_cgroup_tree(iter, mem)
++		iter->oom_lock = false;
++	return 0;
++}
++
++static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
++{
++	struct mem_cgroup *iter;
++
++	for_each_mem_cgroup_tree(iter, mem)
++		atomic_inc(&iter->under_oom);
++}
++
++static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
++{
++	struct mem_cgroup *iter;
++
+ 	/*
+ 	 * When a new child is created while the hierarchy is under oom,
+ 	 * mem_cgroup_oom_lock() may not be called. We have to use
+ 	 * atomic_add_unless() here.
+ 	 */
+ 	for_each_mem_cgroup_tree(iter, mem)
+-		atomic_add_unless(&iter->oom_lock, -1, 0);
+-	return 0;
++		atomic_add_unless(&iter->under_oom, -1, 0);
+ }
+ 
+-
+ static DEFINE_MUTEX(memcg_oom_mutex);
+ static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
+ 
+@@ -1794,7 +1843,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
+ 
+ static void memcg_oom_recover(struct mem_cgroup *mem)
+ {
+-	if (mem && atomic_read(&mem->oom_lock))
++	if (mem && atomic_read(&mem->under_oom))
+ 		memcg_wakeup_oom(mem);
+ }
+ 
+@@ -1812,6 +1861,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+ 	owait.wait.private = current;
+ 	INIT_LIST_HEAD(&owait.wait.task_list);
+ 	need_to_kill = true;
++	mem_cgroup_mark_under_oom(mem);
++
+ 	/* At first, try to OOM lock hierarchy under mem.*/
+ 	mutex_lock(&memcg_oom_mutex);
+ 	locked = mem_cgroup_oom_lock(mem);
+@@ -1835,10 +1886,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+ 		finish_wait(&memcg_oom_waitq, &owait.wait);
+ 	}
+ 	mutex_lock(&memcg_oom_mutex);
+-	mem_cgroup_oom_unlock(mem);
++	if (locked)
++		mem_cgroup_oom_unlock(mem);
+ 	memcg_wakeup_oom(mem);
+ 	mutex_unlock(&memcg_oom_mutex);
+ 
++	mem_cgroup_unmark_under_oom(mem);
++
+ 	if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ 		return false;
+ 	/* Give chance to dying process */
+@@ -4505,7 +4559,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp,
+ 	list_add(&event->list, &memcg->oom_notify);
+ 
+ 	/* already in OOM ? */
+-	if (atomic_read(&memcg->oom_lock))
++	if (atomic_read(&memcg->under_oom))
+ 		eventfd_signal(eventfd, 1);
+ 	mutex_unlock(&memcg_oom_mutex);
+ 
+@@ -4540,7 +4594,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
+ 
+ 	cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
+ 
+-	if (atomic_read(&mem->oom_lock))
++	if (atomic_read(&mem->under_oom))
+ 		cb->fill(cb, "under_oom", 1);
+ 	else
+ 		cb->fill(cb, "under_oom", 0);
+commit 1d65f86db14806cf7b1218c7b4ecb8b4db5af27d
+Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+Date:   Mon Jul 25 17:12:27 2011 -0700
+
+    mm: preallocate page before lock_page() at filemap COW
+    
+    Currently we are keeping faulted page locked throughout whole __do_fault
+    call (except for page_mkwrite code path) after calling file system's fault
+    code.  If we do early COW, we allocate a new page which has to be charged
+    for a memcg (mem_cgroup_newpage_charge).
+    
+    This function, however, might block for unbounded amount of time if memcg
+    oom killer is disabled or fork-bomb is running because the only way out of
+    the OOM situation is either an external event or OOM-situation fix.
+    
+    In the end we are keeping the faulted page locked and blocking other
+    processes from faulting it in which is not good at all because we are
+    basically punishing potentially an unrelated process for OOM condition in
+    a different group (I have seen stuck system because of ld-2.11.1.so being
+    locked).
+    
+    We can do test easily.
+    
+     % cgcreate -g memory:A
+     % cgset -r memory.limit_in_bytes=64M A
+     % cgset -r memory.memsw.limit_in_bytes=64M A
+     % cd kernel_dir; cgexec -g memory:A make -j
+    
+    Then, the whole system will live-locked until you kill 'make -j'
+    by hands (or push reboot...) This is because some important page in a
+    a shared library are locked.
+    
+    Considering again, the new page is not necessary to be allocated
+    with lock_page() held. And usual page allocation may dive into
+    long memory reclaim loop with holding lock_page() and can cause
+    very long latency.
+    
+    There are 3 ways.
+      1. do allocation/charge before lock_page()
+         Pros. - simple and can handle page allocation in the same manner.
+                 This will reduce holding time of lock_page() in general.
+         Cons. - we do page allocation even if ->fault() returns error.
+    
+      2. do charge after unlock_page(). Even if charge fails, it's just OOM.
+         Pros. - no impact to non-memcg path.
+         Cons. - implemenation requires special cares of LRU and we need to modify
+                 page_add_new_anon_rmap()...
+    
+      3. do unlock->charge->lock again method.
+         Pros. - no impact to non-memcg path.
+         Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
+                 lock and get it again...
+    
+    This patch moves "charge" and memory allocation for COW page
+    before lock_page(). Then, we can avoid scanning LRU with holding
+    a lock on a page and latency under lock_page() will be reduced.
+    
+    Then, above livelock disappears.
+    
+    [akpm at linux-foundation.org: fix code layout]
+    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu at jp.fujitsu.com>
+    Reported-by: Lutz Vieweg <lvml at 5t9.de>
+    Original-idea-by: Michal Hocko <mhocko at suse.cz>
+    Cc: Michal Hocko <mhocko at suse.cz>
+    Cc: Ying Han <yinghan at google.com>
+    Cc: Johannes Weiner <hannes at cmpxchg.org>
+    Cc: Daisuke Nishimura <nishimura at mxp.nes.nec.co.jp>
+    Signed-off-by: Andrew Morton <akpm at linux-foundation.org>
+    Signed-off-by: Linus Torvalds <torvalds at linux-foundation.org>
+
+diff --git a/mm/memory.c b/mm/memory.c
+index a58bbeb..3c9f3aa 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -3093,14 +3093,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ 	pte_t *page_table;
+ 	spinlock_t *ptl;
+ 	struct page *page;
++	struct page *cow_page;
+ 	pte_t entry;
+ 	int anon = 0;
+-	int charged = 0;
+ 	struct page *dirty_page = NULL;
+ 	struct vm_fault vmf;
+ 	int ret;
+ 	int page_mkwrite = 0;
+ 
++	/*
++	 * If we do COW later, allocate page befor taking lock_page()
++	 * on the file cache page. This will reduce lock holding time.
++	 */
++	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
++
++		if (unlikely(anon_vma_prepare(vma)))
++			return VM_FAULT_OOM;
++
++		cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
++		if (!cow_page)
++			return VM_FAULT_OOM;
++
++		if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
++			page_cache_release(cow_page);
++			return VM_FAULT_OOM;
++		}
++	} else
++		cow_page = NULL;
++
+ 	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ 	vmf.pgoff = pgoff;
+ 	vmf.flags = flags;
+@@ -3109,12 +3129,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ 	ret = vma->vm_ops->fault(vma, &vmf);
+ 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+ 			    VM_FAULT_RETRY)))
+-		return ret;
++		goto uncharge_out;
+ 
+ 	if (unlikely(PageHWPoison(vmf.page))) {
+ 		if (ret & VM_FAULT_LOCKED)
+ 			unlock_page(vmf.page);
+-		return VM_FAULT_HWPOISON;
++		ret = VM_FAULT_HWPOISON;
++		goto uncharge_out;
+ 	}
+ 
+ 	/*
+@@ -3132,23 +3153,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ 	page = vmf.page;
+ 	if (flags & FAULT_FLAG_WRITE) {
+ 		if (!(vma->vm_flags & VM_SHARED)) {
++			page = cow_page;
+ 			anon = 1;
+-			if (unlikely(anon_vma_prepare(vma))) {
+-				ret = VM_FAULT_OOM;
+-				goto out;
+-			}
+-			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+-						vma, address);
+-			if (!page) {
+-				ret = VM_FAULT_OOM;
+-				goto out;
+-			}
+-			if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+-				ret = VM_FAULT_OOM;
+-				page_cache_release(page);
+-				goto out;
+-			}
+-			charged = 1;
+ 			copy_user_highpage(page, vmf.page, address, vma);
+ 			__SetPageUptodate(page);
+ 		} else {
+@@ -3217,8 +3223,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ 		/* no need to invalidate: a not-present page won't be cached */
+ 		update_mmu_cache(vma, address, page_table);
+ 	} else {
+-		if (charged)
+-			mem_cgroup_uncharge_page(page);
++		if (cow_page)
++			mem_cgroup_uncharge_page(cow_page);
+ 		if (anon)
+ 			page_cache_release(page);
+ 		else
+@@ -3227,7 +3233,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ 
+ 	pte_unmap_unlock(page_table, ptl);
+ 
+-out:
+ 	if (dirty_page) {
+ 		struct address_space *mapping = page->mapping;
+ 
+@@ -3257,6 +3262,13 @@ out:
+ unwritable_page:
+ 	page_cache_release(page);
+ 	return ret;
++uncharge_out:
++	/* fs's fault handler get error */
++	if (cow_page) {
++		mem_cgroup_uncharge_page(cow_page);
++		page_cache_release(cow_page);
++	}
++	return ret;
+ }
+ 
+ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,

================================================================
Index: packages/kernel/kernel.spec
diff -u packages/kernel/kernel.spec:1.924.2.10 packages/kernel/kernel.spec:1.924.2.11
--- packages/kernel/kernel.spec:1.924.2.10	Wed Oct  5 21:39:08 2011
+++ packages/kernel/kernel.spec	Thu Nov 10 06:57:40 2011
@@ -95,7 +95,7 @@
 
 %define		basever		2.6.38
 %define		postver		.8
-%define		rel		5
+%define		rel		6
 
 %define		_enable_debug_packages			0
 
@@ -1577,6 +1577,9 @@
 All persons listed below can be reached at <cvs_login>@pld-linux.org
 
 $Log$
+Revision 1.924.2.11  2011/11/10 05:57:40  arekm
+- rel 6; cgroup memory limit fixes
+
 Revision 1.924.2.10  2011/10/05 19:39:08  glen
 - patch-2.5.9-3.amd64 fails to apply grsec patch, 2.6.1 applies fine, so assume 2.6.0 is needed
 
================================================================

---- CVS-web:
    http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel-small_fixes.patch?r1=1.25.2.7&r2=1.25.2.8&f=u
    http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel.spec?r1=1.924.2.10&r2=1.924.2.11&f=u