SOURCES: kernel-desktop-preempt-rt.patch - up to 2.6.22-rc6

Thu Aug 2 14:07:30 CEST 2007

Author: czarny                       Date: Thu Aug  2 12:07:30 2007 GMT
Module: SOURCES                       Tag: HEAD
---- Log message:
- up to 2.6.22-rc6

---- Files affected:
SOURCES:
   kernel-desktop-preempt-rt.patch (1.25 -> 1.26) 

---- Diffs:

================================================================
Index: SOURCES/kernel-desktop-preempt-rt.patch
diff -u SOURCES/kernel-desktop-preempt-rt.patch:1.25 SOURCES/kernel-desktop-preempt-rt.patch:1.26

--- SOURCES/kernel-desktop-preempt-rt.patch:1.25	Tue Nov 21 18:00:38 2006
+++ SOURCES/kernel-desktop-preempt-rt.patch	Thu Aug  2 14:07:25 2007
@@ -1,887 +1,52 @@
-Index: linux/Documentation/hrtimer/highres.txt
-===================================================================
---- /dev/null
-+++ linux/Documentation/hrtimer/highres.txt
-@@ -0,0 +1,249 @@
-+High resolution timers and dynamic ticks design notes
-+-----------------------------------------------------
-+
-+Further information can be found in the paper of the OLS 2006 talk "hrtimers
-+and beyond". The paper is part of the OLS 2006 Proceedings Volume 1, which can
-+be found on the OLS website:
-+http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf
-+
-+The slides to this talk are available from:
-+http://tglx.de/projects/hrtimers/ols2006-hrtimers.pdf
-+
-+The slides contain five figures (pages 2, 15, 18, 20, 22), which illustrate the
-+changes in the time(r) related Linux subsystems. Figure #1 (p. 2) shows the
-+design of the Linux time(r) system before hrtimers and other building blocks
-+got merged into mainline.
-+
-+Note: the paper and the slides are talking about "clock event source", while we
-+switched to the name "clock event devices" in meantime.
-+
-+The design contains the following basic building blocks:
-+
-+- hrtimer base infrastructure
-+- timeofday and clock source management
-+- clock event management
-+- high resolution timer functionality
-+- dynamic ticks
-+
-+
-+hrtimer base infrastructure
-+---------------------------
-+
-+The hrtimer base infrastructure was merged into the 2.6.16 kernel. Details of
-+the base implementation are covered in Documentation/hrtimer/hrtimer.txt. See
-+also figure #2 (OLS slides p. 15)
-+
-+The main differences to the timer wheel, which holds the armed timer_list type
-+timers are:
-+       - time ordered enqueueing into a rb-tree
-+       - independent of ticks (the processing is based on nanoseconds)
-+
-+
-+timeofday and clock source management
-+-------------------------------------
-+
-+John Stultz's Generic Time Of Day (GTOD) framework moves a large portion of
-+code out of the architecture-specific areas into a generic management
-+framework, as illustrated in figure #3 (OLS slides p. 18). The architecture
-+specific portion is reduced to the low level hardware details of the clock
-+sources, which are registered in the framework and selected on a quality based
-+decision. The low level code provides hardware setup and readout routines and
-+initializes data structures, which are used by the generic time keeping code to
-+convert the clock ticks to nanosecond based time values. All other time keeping
-+related functionality is moved into the generic code. The GTOD base patch got
-+merged into the 2.6.18 kernel.
-+
-+Further information about the Generic Time Of Day framework is available in the
-+OLS 2005 Proceedings Volume 1:
-+http://www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf
-+
-+The paper "We Are Not Getting Any Younger: A New Approach to Time and
-+Timers" was written by J. Stultz, D.V. Hart, & N. Aravamudan.
-+
-+Figure #3 (OLS slides p.18) illustrates the transformation.
-+
-+
-+clock event management
-+----------------------
-+
-+While clock sources provide read access to the monotonically increasing time
-+value, clock event devices are used to schedule the next event
-+interrupt(s). The next event is currently defined to be periodic, with its
-+period defined at compile time. The setup and selection of the event device
-+for various event driven functionalities is hardwired into the architecture
-+dependent code. This results in duplicated code across all architectures and
-+makes it extremely difficult to change the configuration of the system to use
-+event interrupt devices other than those already built into the
-+architecture. Another implication of the current design is that it is necessary
-+to touch all the architecture-specific implementations in order to provide new
-+functionality like high resolution timers or dynamic ticks.
-+
-+The clock events subsystem tries to address this problem by providing a generic
-+solution to manage clock event devices and their usage for the various clock
-+event driven kernel functionalities. The goal of the clock event subsystem is
-+to minimize the clock event related architecture dependent code to the pure
-+hardware related handling and to allow easy addition and utilization of new
-+clock event devices. It also minimizes the duplicated code across the
-+architectures as it provides generic functionality down to the interrupt
-+service handler, which is almost inherently hardware dependent.
-+
-+Clock event devices are registered either by the architecture dependent boot
-+code or at module insertion time. Each clock event device fills a data
-+structure with clock-specific property parameters and callback functions. The
-+clock event management decides, by using the specified property parameters, the
-+set of system functions a clock event device will be used to support. This
-+includes the distinction of per-CPU and per-system global event devices.
-+
-+System-level global event devices are used for the Linux periodic tick. Per-CPU
-+event devices are used to provide local CPU functionality such as process
-+accounting, profiling, and high resolution timers.
-+
-+The management layer assignes one or more of the folliwing functions to a clock
-+event device:
-+      - system global periodic tick (jiffies update)
-+      - cpu local update_process_times
-+      - cpu local profiling
-+      - cpu local next event interrupt (non periodic mode)
-+
-+The clock event device delegates the selection of those timer interrupt related
-+functions completely to the management layer. The clock management layer stores
-+a function pointer in the device description structure, which has to be called
-+from the hardware level handler. This removes a lot of duplicated code from the
-+architecture specific timer interrupt handlers and hands the control over the
-+clock event devices and the assignment of timer interrupt related functionality
-+to the core code.
-+
-+The clock event layer API is rather small. Aside from the clock event device
-+registration interface it provides functions to schedule the next event
-+interrupt, clock event device notification service and support for suspend and
-+resume.
-+
-+The framework adds about 700 lines of code which results in a 2KB increase of
-+the kernel binary size. The conversion of i386 removes about 100 lines of
-+code. The binary size decrease is in the range of 400 byte. We believe that the
-+increase of flexibility and the avoidance of duplicated code across
-+architectures justifies the slight increase of the binary size.
-+
-+The conversion of an architecture has no functional impact, but allows to
-+utilize the high resolution and dynamic tick functionalites without any change
-+to the clock event device and timer interrupt code. After the conversion the
-+enabling of high resolution timers and dynamic ticks is simply provided by
-+adding the kernel/time/Kconfig file to the architecture specific Kconfig and
-+adding the dynamic tick specific calls to the idle routine (a total of 3 lines
-+added to the idle function and the Kconfig file)
-+
-+Figure #4 (OLS slides p.20) illustrates the transformation.
-+
-+
-+high resolution timer functionality
-+-----------------------------------
-+
-+During system boot it is not possible to use the high resolution timer
-+functionality, while making it possible would be difficult and would serve no
-+useful function. The initialization of the clock event device framework, the
-+clock source framework (GTOD) and hrtimers itself has to be done and
-+appropriate clock sources and clock event devices have to be registered before
-+the high resolution functionality can work. Up to the point where hrtimers are
-+initialized, the system works in the usual low resolution periodic mode. The
-+clock source and the clock event device layers provide notification functions
-+which inform hrtimers about availability of new hardware. hrtimers validates
-+the usability of the registered clock sources and clock event devices before
-+switching to high resolution mode. This ensures also that a kernel which is
-+configured for high resolution timers can run on a system which lacks the
-+necessary hardware support.
-+
-+The high resolution timer code does not support SMP machines which have only
-+global clock event devices. The support of such hardware would involve IPI
-+calls when an interrupt happens. The overhead would be much larger than the
-+benefit. This is the reason why we currently disable high resolution and
-+dynamic ticks on i386 SMP systems which stop the local APIC in C3 power
-+state. A workaround is available as an idea, but the problem has not been
-+tackled yet.
-+
-+The time ordered insertion of timers provides all the infrastructure to decide
-+whether the event device has to be reprogrammed when a timer is added. The
-+decision is made per timer base and synchronized across per-cpu timer bases in
-+a support function. The design allows the system to utilize separate per-CPU
-+clock event devices for the per-CPU timer bases, but currently only one
-+reprogrammable clock event device per-CPU is utilized.
-+
-+When the timer interrupt happens, the next event interrupt handler is called
-+from the clock event distribution code and moves expired timers from the
-+red-black tree to a separate double linked list and invokes the softirq
-+handler. An additional mode field in the hrtimer structure allows the system to
-+execute callback functions directly from the next event interrupt handler. This
-+is restricted to code which can safely be executed in the hard interrupt
-+context. This applies, for example, to the common case of a wakeup function as
-+used by nanosleep. The advantage of executing the handler in the interrupt
-+context is the avoidance of up to two context switches - from the interrupted
-+context to the softirq and to the task which is woken up by the expired
-+timer.
-+
-+Once a system has switched to high resolution mode, the periodic tick is
-+switched off. This disables the per system global periodic clock event device -
-+e.g. the PIT on i386 SMP systems.
-+
-+The periodic tick functionality is provided by an per-cpu hrtimer. The callback
-+function is executed in the next event interrupt context and updates jiffies
-+and calls update_process_times and profiling. The implementation of the hrtimer
-+based periodic tick is designed to be extended with dynamic tick functionality.
-+This allows to use a single clock event device to schedule high resolution
-+timer and periodic events (jiffies tick, profiling, process accounting) on UP
-+systems. This has been proved to work with the PIT on i386 and the Incrementer
-+on PPC.
-+
-+The softirq for running the hrtimer queues and executing the callbacks has been
-+separated from the tick bound timer softirq to allow accurate delivery of high
-+resolution timer signals which are used by itimer and POSIX interval
-+timers. The execution of this softirq can still be delayed by other softirqs,
-+but the overall latencies have been significantly improved by this separation.
-+
-+Figure #5 (OLS slides p.22) illustrates the transformation.
-+
-+
-+dynamic ticks
-+-------------
-+
-+Dynamic ticks are the logical consequence of the hrtimer based periodic tick
-+replacement (sched_tick). The functionality of the sched_tick hrtimer is
-+extended by three functions:
-+
-+- hrtimer_stop_sched_tick
-+- hrtimer_restart_sched_tick
-+- hrtimer_update_jiffies
-+
-+hrtimer_stop_sched_tick() is called when a CPU goes into idle state. The code
-+evaluates the next scheduled timer event (from both hrtimers and the timer
-+wheel) and in case that the next event is further away than the next tick it
-+reprograms the sched_tick to this future event, to allow longer idle sleeps
-+without worthless interruption by the periodic tick. The function is also
-+called when an interrupt happens during the idle period, which does not cause a
-+reschedule. The call is necessary as the interrupt handler might have armed a
-+new timer whose expiry time is before the time which was identified as the
-+nearest event in the previous call to hrtimer_stop_sched_tick.
-+
-+hrtimer_restart_sched_tick() is called when the CPU leaves the idle state before
-+it calls schedule(). hrtimer_restart_sched_tick() resumes the periodic tick,
-+which is kept active until the next call to hrtimer_stop_sched_tick().
-+
-+hrtimer_update_jiffies() is called from irq_enter() when an interrupt happens
-+in the idle period to make sure that jiffies are up to date and the interrupt
-+handler has not to deal with an eventually stale jiffy value.
-+
-+The dynamic tick feature provides statistical values which are exported to
-+userspace via /proc/stats and can be made available for enhanced power
-+management control.
-+
-+The implementation leaves room for further development like full tickless
-+systems, where the time slice is controlled by the scheduler, variable
-+frequency profiling, and a complete removal of jiffies in the future.
-+
-+
-+Aside the current initial submission of i386 support, the patchset has been
-+extended to x86_64 and ARM already. Initial (work in progress) support is also
-+available for MIPS and PowerPC.
-+
-+	  Thomas, Ingo
-+
-+
-+
-Index: linux/Documentation/hrtimer/hrtimers.txt
-===================================================================
---- /dev/null
-+++ linux/Documentation/hrtimer/hrtimers.txt
-@@ -0,0 +1,178 @@
-+
-+hrtimers - subsystem for high-resolution kernel timers
-+----------------------------------------------------
-+
-+This patch introduces a new subsystem for high-resolution kernel timers.
-+
-+One might ask the question: we already have a timer subsystem
-+(kernel/timers.c), why do we need two timer subsystems? After a lot of
-+back and forth trying to integrate high-resolution and high-precision
-+features into the existing timer framework, and after testing various
-+such high-resolution timer implementations in practice, we came to the
-+conclusion that the timer wheel code is fundamentally not suitable for
-+such an approach. We initially didnt believe this ('there must be a way
-+to solve this'), and spent a considerable effort trying to integrate
-+things into the timer wheel, but we failed. In hindsight, there are
-+several reasons why such integration is hard/impossible:
-+
-+- the forced handling of low-resolution and high-resolution timers in
-+  the same way leads to a lot of compromises, macro magic and #ifdef
-+  mess. The timers.c code is very "tightly coded" around jiffies and
-+  32-bitness assumptions, and has been honed and micro-optimized for a
-+  relatively narrow use case (jiffies in a relatively narrow HZ range)
-+  for many years - and thus even small extensions to it easily break
-+  the wheel concept, leading to even worse compromises. The timer wheel
-+  code is very good and tight code, there's zero problems with it in its
-+  current usage - but it is simply not suitable to be extended for
-+  high-res timers.
-+
-+- the unpredictable [O(N)] overhead of cascading leads to delays which
-+  necessiate a more complex handling of high resolution timers, which
-+  in turn decreases robustness. Such a design still led to rather large
-+  timing inaccuracies. Cascading is a fundamental property of the timer
-+  wheel concept, it cannot be 'designed out' without unevitably
-+  degrading other portions of the timers.c code in an unacceptable way.
-+
-+- the implementation of the current posix-timer subsystem on top of
-+  the timer wheel has already introduced a quite complex handling of
-+  the required readjusting of absolute CLOCK_REALTIME timers at
-+  settimeofday or NTP time - further underlying our experience by
-+  example: that the timer wheel data structure is too rigid for high-res
-+  timers.
-+
-+- the timer wheel code is most optimal for use cases which can be
-+  identified as "timeouts". Such timeouts are usually set up to cover
-+  error conditions in various I/O paths, such as networking and block
-+  I/O. The vast majority of those timers never expire and are rarely
-+  recascaded because the expected correct event arrives in time so they
-+  can be removed from the timer wheel before any further processing of
-+  them becomes necessary. Thus the users of these timeouts can accept
-+  the granularity and precision tradeoffs of the timer wheel, and
-+  largely expect the timer subsystem to have near-zero overhead.
-+  Accurate timing for them is not a core purpose - in fact most of the
-+  timeout values used are ad-hoc. For them it is at most a necessary
-+  evil to guarantee the processing of actual timeout completions
-+  (because most of the timeouts are deleted before completion), which
-+  should thus be as cheap and unintrusive as possible.
-+
-+The primary users of precision timers are user-space applications that
-+utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
-+users like drivers and subsystems which require precise timed events
-+(e.g. multimedia) can benefit from the availability of a seperate
-+high-resolution timer subsystem as well.
-+
-+While this subsystem does not offer high-resolution clock sources just
-+yet, the hrtimer subsystem can be easily extended with high-resolution
-+clock capabilities, and patches for that exist and are maturing quickly.
-+The increasing demand for realtime and multimedia applications along
-+with other potential users for precise timers gives another reason to
-+separate the "timeout" and "precise timer" subsystems.
-+
-+Another potential benefit is that such a seperation allows even more
-+special-purpose optimization of the existing timer wheel for the low
-+resolution and low precision use cases - once the precision-sensitive
-+APIs are separated from the timer wheel and are migrated over to
-+hrtimers. E.g. we could decrease the frequency of the timeout subsystem
-+from 250 Hz to 100 HZ (or even smaller).
-+
-+hrtimer subsystem implementation details
-+----------------------------------------
-+
-+the basic design considerations were:
-+
-+- simplicity
-+
-+- data structure not bound to jiffies or any other granularity. All the
-+  kernel logic works at 64-bit nanoseconds resolution - no compromises.
-+
-+- simplification of existing, timing related kernel code
-+
-+another basic requirement was the immediate enqueueing and ordering of
-+timers at activation time. After looking at several possible solutions
-+such as radix trees and hashes, we chose the red black tree as the basic
-+data structure. Rbtrees are available as a library in the kernel and are
-+used in various performance-critical areas of e.g. memory management and
-+file systems. The rbtree is solely used for time sorted ordering, while
-+a separate list is used to give the expiry code fast access to the
-+queued timers, without having to walk the rbtree.
-+
-+(This seperate list is also useful for later when we'll introduce
-+high-resolution clocks, where we need seperate pending and expired
-+queues while keeping the time-order intact.)
-+
-+Time-ordered enqueueing is not purely for the purposes of
-+high-resolution clocks though, it also simplifies the handling of
-+absolute timers based on a low-resolution CLOCK_REALTIME. The existing
-+implementation needed to keep an extra list of all armed absolute
-+CLOCK_REALTIME timers along with complex locking. In case of
-+settimeofday and NTP, all the timers (!) had to be dequeued, the
-+time-changing code had to fix them up one by one, and all of them had to
-+be enqueued again. The time-ordered enqueueing and the storage of the
-+expiry time in absolute time units removes all this complex and poorly
-+scaling code from the posix-timer implementation - the clock can simply
-+be set without having to touch the rbtree. This also makes the handling
-+of posix-timers simpler in general.
-+
-+The locking and per-CPU behavior of hrtimers was mostly taken from the
-+existing timer wheel code, as it is mature and well suited. Sharing code
-+was not really a win, due to the different data structures. Also, the
-+hrtimer functions now have clearer behavior and clearer names - such as
-+hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
-+equivalent to del_timer() and del_timer_sync()] - so there's no direct
-+1:1 mapping between them on the algorithmical level, and thus no real
-+potential for code sharing either.
-+
-+Basic data types: every time value, absolute or relative, is in a
-+special nanosecond-resolution type: ktime_t. The kernel-internal
-+representation of ktime_t values and operations is implemented via
-+macros and inline functions, and can be switched between a "hybrid
-+union" type and a plain "scalar" 64bit nanoseconds representation (at
-+compile time). The hybrid union type optimizes time conversions on 32bit
-+CPUs. This build-time-selectable ktime_t storage format was implemented
-+to avoid the performance impact of 64-bit multiplications and divisions
-+on 32bit CPUs. Such operations are frequently necessary to convert
-+between the storage formats provided by kernel and userspace interfaces
-+and the internal time format. (See include/linux/ktime.h for further
-+details.)
-+
-+hrtimers - rounding of timer values
-+-----------------------------------
-+
-+the hrtimer code will round timer events to lower-resolution clocks
-+because it has to. Otherwise it will do no artificial rounding at all.
-+
-+one question is, what resolution value should be returned to the user by
-+the clock_getres() interface. This will return whatever real resolution
-+a given clock has - be it low-res, high-res, or artificially-low-res.
-+
-+hrtimers - testing and verification
-+----------------------------------
-+
-+We used the high-resolution clock subsystem ontop of hrtimers to verify
-+the hrtimer implementation details in praxis, and we also ran the posix
-+timer tests in order to ensure specification compliance. We also ran
-+tests on low-resolution clocks.
-+
-+The hrtimer patch converts the following kernel functionality to use
-+hrtimers:
-+
-+ - nanosleep
-+ - itimers
-+ - posix-timers
-+
-+The conversion of nanosleep and posix-timers enabled the unification of
-+nanosleep and clock_nanosleep.
-+
-+The code was successfully compiled for the following platforms:
-+
-+ i386, x86_64, ARM, PPC, PPC64, IA64
-+
-+The code was run-tested on the following platforms:
-+
-+ i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
-+
-+hrtimers were also integrated into the -rt tree, along with a
-+hrtimers-based high-resolution clock implementation, so the hrtimers
-+code got a healthy amount of testing and use in practice.
-+
-+	Thomas Gleixner, Ingo Molnar
-Index: linux/Documentation/hrtimer/timer_stats.txt
-===================================================================
---- /dev/null
-+++ linux/Documentation/hrtimer/timer_stats.txt
-@@ -0,0 +1,68 @@
-+timer_stats - timer usage statistics
-+------------------------------------
-+
-+timer_stats is a debugging facility to make the timer (ab)usage in a Linux
-+system visible to kernel and userspace developers. It is not intended for
-+production usage as it adds significant overhead to the (hr)timer code and the
-+(hr)timer data structures.
-+
-+timer_stats should be used by kernel and userspace developers to verify that
-+their code does not make unduly use of timers. This helps to avoid unnecessary
-+wakeups, which should be avoided to optimize power consumption.
-+
-+It can be enabled by CONFIG_TIMER_STATS in the "Kernel hacking" configuration
-+section.
-+
-+timer_stats collects information about the timer events which are fired in a
-+Linux system over a sample period:
-+
-+- the pid of the task(process) which initialized the timer
-+- the name of the process which initialized the timer
-+- the function where the timer was intialized
-+- the callback function which is associated to the timer
-+- the number of events (callbacks)
-+
-+timer_stats adds an entry to /proc: /proc/timer_stats
-+
-+This entry is used to control the statistics functionality and to read out the
-+sampled information.
-+
-+The timer_stats functionality is inactive on bootup.
-+
-+To activate a sample period issue:
-+# echo 1 >/proc/timer_stats
-+
-+To stop a sample period issue:
-+# echo 0 >/proc/timer_stats
-+
-+The statistics can be retrieved by:
-+# cat /proc/timer_stats
-+
-+The readout of /proc/timer_stats automatically disables sampling. The sampled
-+information is kept until a new sample period is started. This allows multiple
-+readouts.
-+
-+Sample output of /proc/timer_stats:
-+
-+Timerstats sample period: 3.888770 s
-+  12,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
-+  15,     1 swapper          hcd_submit_urb (rh_timer_func)
-+   4,   959 kedac            schedule_timeout (process_timeout)
-+   1,     0 swapper          page_writeback_init (wb_timer_fn)
-+  28,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
-+  22,  2948 IRQ 4            tty_flip_buffer_push (delayed_work_timer_fn)
-+   3,  3100 bash             schedule_timeout (process_timeout)
-+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
-+   1,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
-+   1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
-+   1,  2292 ip               __netdev_watchdog_up (dev_watchdog)
-+   1,    23 events/1         do_cache_clean (delayed_work_timer_fn)
-+90 total events, 30.0 events/sec
-+
-+The first column is the number of events, the second column the pid, the third
-+column is the name of the process. The forth column shows the function which
-+initialized the timer and in parantheses the callback function which was
-+executed on expiry.
-+
-+    Thomas, Ingo
-+
-Index: linux/Documentation/hrtimers.txt
-===================================================================
---- linux.orig/Documentation/hrtimers.txt
-+++ /dev/null
-@@ -1,178 +0,0 @@
--
--hrtimers - subsystem for high-resolution kernel timers
------------------------------------------------------
--
--This patch introduces a new subsystem for high-resolution kernel timers.
--
--One might ask the question: we already have a timer subsystem
--(kernel/timers.c), why do we need two timer subsystems? After a lot of
--back and forth trying to integrate high-resolution and high-precision
--features into the existing timer framework, and after testing various
--such high-resolution timer implementations in practice, we came to the
--conclusion that the timer wheel code is fundamentally not suitable for
--such an approach. We initially didnt believe this ('there must be a way
--to solve this'), and spent a considerable effort trying to integrate
--things into the timer wheel, but we failed. In hindsight, there are
--several reasons why such integration is hard/impossible:
--
--- the forced handling of low-resolution and high-resolution timers in
--  the same way leads to a lot of compromises, macro magic and #ifdef
--  mess. The timers.c code is very "tightly coded" around jiffies and
--  32-bitness assumptions, and has been honed and micro-optimized for a
--  relatively narrow use case (jiffies in a relatively narrow HZ range)
--  for many years - and thus even small extensions to it easily break
--  the wheel concept, leading to even worse compromises. The timer wheel
--  code is very good and tight code, there's zero problems with it in its
--  current usage - but it is simply not suitable to be extended for
--  high-res timers.
--
--- the unpredictable [O(N)] overhead of cascading leads to delays which
--  necessiate a more complex handling of high resolution timers, which
--  in turn decreases robustness. Such a design still led to rather large
--  timing inaccuracies. Cascading is a fundamental property of the timer
--  wheel concept, it cannot be 'designed out' without unevitably
--  degrading other portions of the timers.c code in an unacceptable way.
--
--- the implementation of the current posix-timer subsystem on top of
--  the timer wheel has already introduced a quite complex handling of
--  the required readjusting of absolute CLOCK_REALTIME timers at
--  settimeofday or NTP time - further underlying our experience by
--  example: that the timer wheel data structure is too rigid for high-res
--  timers.
--
--- the timer wheel code is most optimal for use cases which can be
--  identified as "timeouts". Such timeouts are usually set up to cover
--  error conditions in various I/O paths, such as networking and block
--  I/O. The vast majority of those timers never expire and are rarely
--  recascaded because the expected correct event arrives in time so they
--  can be removed from the timer wheel before any further processing of
--  them becomes necessary. Thus the users of these timeouts can accept
--  the granularity and precision tradeoffs of the timer wheel, and
--  largely expect the timer subsystem to have near-zero overhead.
--  Accurate timing for them is not a core purpose - in fact most of the
--  timeout values used are ad-hoc. For them it is at most a necessary
--  evil to guarantee the processing of actual timeout completions
--  (because most of the timeouts are deleted before completion), which
--  should thus be as cheap and unintrusive as possible.
--
--The primary users of precision timers are user-space applications that
--utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
--users like drivers and subsystems which require precise timed events
--(e.g. multimedia) can benefit from the availability of a seperate
--high-resolution timer subsystem as well.
--
--While this subsystem does not offer high-resolution clock sources just
--yet, the hrtimer subsystem can be easily extended with high-resolution
--clock capabilities, and patches for that exist and are maturing quickly.
--The increasing demand for realtime and multimedia applications along
--with other potential users for precise timers gives another reason to
--separate the "timeout" and "precise timer" subsystems.
--
--Another potential benefit is that such a seperation allows even more
--special-purpose optimization of the existing timer wheel for the low
--resolution and low precision use cases - once the precision-sensitive
--APIs are separated from the timer wheel and are migrated over to
--hrtimers. E.g. we could decrease the frequency of the timeout subsystem
--from 250 Hz to 100 HZ (or even smaller).
<<Diff was trimmed, longer than 597 lines>>

---- CVS-web:
    http://cvs.pld-linux.org/SOURCES/kernel-desktop-preempt-rt.patch?r1=1.25&r2=1.26&f=u