BFS: back out BFS scheduler for stability reasons

It works faster sometimes, other times not.. the code is so old now
2026-02-04 00:04:43 +00:00 · 2012-12-10 20:18:47 -06:00
parent 102933f39b
commit c50431b567
28 changed files with 198 additions and 7401 deletions
--- a/Documentation/scheduler/sched-BFS.txt
+++ b/Documentation/scheduler/sched-BFS.txt
@@ -1,326 +0,0 @@
-BFS - The Brain Fuck Scheduler by Con Kolivas.
-
-Goals.
-
-The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
-completely do away with the complex designs of the past for the cpu process
-scheduler and instead implement one that is very simple in basic design.
-The main focus of BFS is to achieve excellent desktop interactivity and
-responsiveness without heuristics and tuning knobs that are difficult to
-understand, impossible to model and predict the effect of, and when tuned to
-one workload cause massive detriment to another.
-
-
-Design summary.
-
-BFS is best described as a single runqueue, O(n) lookup, earliest effective
-virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
-deadline first) and my previous Staircase Deadline scheduler. Each component
-shall be described in order to understand the significance of, and reasoning for
-it. The codebase when the first stable version was released was approximately
-9000 lines less code than the existing mainline linux kernel scheduler (in
-2.6.31). This does not even take into account the removal of documentation and
-the cgroups code that is not used.
-
-Design reasoning.
-
-The single runqueue refers to the queued but not running processes for the
-entire system, regardless of the number of CPUs. The reason for going back to
-a single runqueue design is that once multiple runqueues are introduced,
-per-CPU or otherwise, there will be complex interactions as each runqueue will
-be responsible for the scheduling latency and fairness of the tasks only on its
-own runqueue, and to achieve fairness and low latency across multiple CPUs, any
-advantage in throughput of having CPU local tasks causes other disadvantages.
-This is due to requiring a very complex balancing system to at best achieve some
-semblance of fairness across CPUs and can only maintain relatively low latency
-for tasks bound to the same CPUs, not across them. To increase said fairness
-and latency across CPUs, the advantage of local runqueue locking, which makes
-for better scalability, is lost due to having to grab multiple locks.
-
-A significant feature of BFS is that all accounting is done purely based on CPU
-used and nowhere is sleep time used in any way to determine entitlement or
-interactivity. Interactivity "estimators" that use some kind of sleep/run
-algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
-tasks that aren't interactive as being so. The reason for this is that it is
-close to impossible to determine that when a task is sleeping, whether it is
-doing it voluntarily, as in a userspace application waiting for input in the
-form of a mouse click or otherwise, or involuntarily, because it is waiting for
-another thread, process, I/O, kernel activity or whatever. Thus, such an
-estimator will introduce corner cases, and more heuristics will be required to
-cope with those corner cases, introducing more corner cases and failed
-interactivity detection and so on. Interactivity in BFS is built into the design
-by virtue of the fact that tasks that are waking up have not used up their quota
-of CPU time, and have earlier effective deadlines, thereby making it very likely
-they will preempt any CPU bound task of equivalent nice level. See below for
-more information on the virtual deadline mechanism. Even if they do not preempt
-a running task, because the rr interval is guaranteed to have a bound upper
-limit on how long a task will wait for, it will be scheduled within a timeframe
-that will not cause visible interface jitter.
-
-
-Design details.
-
-Task insertion.
-
-BFS inserts tasks into each relevant queue as an O(1) insertion into a double
-linked list. On insertion, *every* running queue is checked to see if the newly
-queued task can run on any idle queue, or preempt the lowest running task on the
-system. This is how the cross-CPU scheduling of BFS achieves significantly lower
-latency per extra CPU the system has. In this case the lookup is, in the worst
-case scenario, O(n) where n is the number of CPUs on the system.
-
-Data protection.
-
-BFS has one single lock protecting the process local data of every task in the
-global queue. Thus every insertion, removal and modification of task data in the
-global runqueue needs to grab the global lock. However, once a task is taken by
-a CPU, the CPU has its own local data copy of the running process' accounting
-information which only that CPU accesses and modifies (such as during a
-timer tick) thus allowing the accounting data to be updated lockless. Once a
-CPU has taken a task to run, it removes it from the global queue. Thus the
-global queue only ever has, at most,
-
-	(number of tasks requesting cpu time) - (number of logical CPUs) + 1
-
-tasks in the global queue. This value is relevant for the time taken to look up
-tasks during scheduling. This will increase if many tasks with CPU affinity set
-in their policy to limit which CPUs they're allowed to run on if they outnumber
-the number of CPUs. The +1 is because when rescheduling a task, the CPU's
-currently running task is put back on the queue. Lookup will be described after
-the virtual deadline mechanism is explained.
-
-Virtual deadline.
-
-The key to achieving low latency, scheduling fairness, and "nice level"
-distribution in BFS is entirely in the virtual deadline mechanism. The one
-tunable in BFS is the rr_interval, or "round robin interval". This is the
-maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
-tasks of the same nice level will be running for, or looking at it the other
-way around, the longest duration two tasks of the same nice level will be
-delayed for. When a task requests cpu time, it is given a quota (time_slice)
-equal to the rr_interval and a virtual deadline. The virtual deadline is
-offset from the current time in jiffies by this equation:
-
-	jiffies + (prio_ratio * rr_interval)
-
-The prio_ratio is determined as a ratio compared to the baseline of nice -20
-and increases by 10% per nice level. The deadline is a virtual one only in that
-no guarantee is placed that a task will actually be scheduled by this time, but
-it is used to compare which task should go next. There are three components to
-how a task is next chosen. First is time_slice expiration. If a task runs out
-of its time_slice, it is descheduled, the time_slice is refilled, and the
-deadline reset to that formula above. Second is sleep, where a task no longer
-is requesting CPU for whatever reason. The time_slice and deadline are _not_
-adjusted in this case and are just carried over for when the task is next
-scheduled. Third is preemption, and that is when a newly waking task is deemed
-higher priority than a currently running task on any cpu by virtue of the fact
-that it has an earlier virtual deadline than the currently running task. The
-earlier deadline is the key to which task is next chosen for the first and
-second cases. Once a task is descheduled, it is put back on the queue, and an
-O(n) lookup of all queued-but-not-running tasks is done to determine which has
-the earliest deadline and that task is chosen to receive CPU next.
-
-The CPU proportion of different nice tasks works out to be approximately the
-
-	(prio_ratio difference)^2
-
-The reason it is squared is that a task's deadline does not change while it is
-running unless it runs out of time_slice. Thus, even if the time actually
-passes the deadline of another task that is queued, it will not get CPU time
-unless the current running task deschedules, and the time "base" (jiffies) is
-constantly moving.
-
-Task lookup.
-
-BFS has 103 priority queues. 100 of these are dedicated to the static priority
-of realtime tasks, and the remaining 3 are, in order of best to worst priority,
-SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
-scheduling). When a task of these priorities is queued, a bitmap of running
-priorities is set showing which of these priorities has tasks waiting for CPU
-time. When a CPU is made to reschedule, the lookup for the next task to get
-CPU time is performed in the following way:
-
-First the bitmap is checked to see what static priority tasks are queued. If
-any realtime priorities are found, the corresponding queue is checked and the
-first task listed there is taken (provided CPU affinity is suitable) and lookup
-is complete. If the priority corresponds to a SCHED_ISO task, they are also
-taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
-to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
-stage, every task in the runlist that corresponds to that priority is checked
-to see which has the earliest set deadline, and (provided it has suitable CPU
-affinity) it is taken off the runqueue and given the CPU. If a task has an
-expired deadline, it is taken and the rest of the lookup aborted (as they are
-chosen in FIFO order).
-
-Thus, the lookup is O(n) in the worst case only, where n is as described
-earlier, as tasks may be chosen before the whole task list is looked over.
-
-
-Scalability.
-
-The major limitations of BFS will be that of scalability, as the separate
-runqueue designs will have less lock contention as the number of CPUs rises.
-However they do not scale linearly even with separate runqueues as multiple
-runqueues will need to be locked concurrently on such designs to be able to
-achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
-across CPUs, and to achieve low enough latency for tasks on a busy CPU when
-other CPUs would be more suited. BFS has the advantage that it requires no
-balancing algorithm whatsoever, as balancing occurs by proxy simply because
-all CPUs draw off the global runqueue, in priority and deadline order. Despite
-the fact that scalability is _not_ the prime concern of BFS, it both shows very
-good scalability to smaller numbers of CPUs and is likely a more scalable design
-at these numbers of CPUs.
-
-It also has some very low overhead scalability features built into the design
-when it has been deemed their overhead is so marginal that they're worth adding.
-The first is the local copy of the running process' data to the CPU it's running
-on to allow that data to be updated lockless where possible. Then there is
-deference paid to the last CPU a task was running on, by trying that CPU first
-when looking for an idle CPU to use the next time it's scheduled. Finally there
-is the notion of "sticky" tasks that are flagged when they are involuntarily
-descheduled, meaning they still want further CPU time. This sticky flag is
-used to bias heavily against those tasks being scheduled on a different CPU
-unless that CPU would be otherwise idle. When a cpu frequency governor is used
-that scales with CPU load, such as ondemand, sticky tasks are not scheduled
-on a different CPU at all, preferring instead to go idle. This means the CPU
-they were bound to is more likely to increase its speed while the other CPU
-will go idle, thus speeding up total task execution time and likely decreasing
-power usage. This is the only scenario where BFS will allow a CPU to go idle
-in preference to scheduling a task on the earliest available spare CPU.
-
-The real cost of migrating a task from one CPU to another is entirely dependant
-on the cache footprint of the task, how cache intensive the task is, how long
-it's been running on that CPU to take up the bulk of its cache, how big the CPU
-cache is, how fast and how layered the CPU cache is, how fast a context switch
-is... and so on. In other words, it's close to random in the real world where we
-do more than just one sole workload. The only thing we can be sure of is that
-it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
-utilising idle CPUs is more important than cache locality, and cache locality
-only plays a part after that.
-
-Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
-However this benchmarking was performed on an earlier design that was far less
-scalable than the current one so it's hard to know how scalable it is in terms
-of both CPUs (due to the global runqueue) and heavily loaded machines (due to
-O(n) lookup) at this stage. Note that in terms of scalability, the number of
-_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
-quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
-results are very promising indeed, without needing to tweak any knobs, features
-or options. Benchmark contributions are most welcome.
-
-
-Features
-
-As the initial prime target audience for BFS was the average desktop user, it
-was designed to not need tweaking, tuning or have features set to obtain benefit
-from it. Thus the number of knobs and features has been kept to an absolute
-minimum and should not require extra user input for the vast majority of cases.
-There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
-and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
-to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
-support for CGROUPS. The average user should neither need to know what these
-are, nor should they need to be using them to have good desktop behaviour.
-
-rr_interval
-
-There is only one "scheduler" tunable, the round robin interval. This can be
-accessed in
-
-	/proc/sys/kernel/rr_interval
-
-The value is in milliseconds, and the default value is set to 6ms. Valid values
-are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
-decreasing throughput, while increasing it will improve throughput, but at the
-cost of worsening latencies. The accuracy of the rr interval is limited by HZ
-resolution of the kernel configuration. Thus, the worst case latencies are
-usually slightly higher than this actual value. BFS uses "dithering" to try and
-minimise the effect the Hz limitation has. The default value of 6 is not an
-arbitrary one. It is based on the fact that humans can detect jitter at
-approximately 7ms, so aiming for much lower latencies is pointless under most
-circumstances. It is worth noting this fact when comparing the latency
-performance of BFS to other schedulers. Worst case latencies being higher than
-7ms are far worse than average latencies not being in the microsecond range.
-Experimentation has shown that rr intervals being increased up to 300 can
-improve throughput but beyond that, scheduling noise from elsewhere prevents
-further demonstrable throughput.
-
-Isochronous scheduling.
-
-Isochronous scheduling is a unique scheduling policy designed to provide
-near-real-time performance to unprivileged (ie non-root) users without the
-ability to starve the machine indefinitely. Isochronous tasks (which means
-"same time") are set using, for example, the schedtool application like so:
-
-	schedtool -I -e amarok
-
-This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
-is that it has a priority level between true realtime tasks and SCHED_NORMAL
-which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
-if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
-rate). However if ISO tasks run for more than a tunable finite amount of time,
-they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
-time is the percentage of _total CPU_ available across the machine, configurable
-as a percentage in the following "resource handling" tunable (as opposed to a
-scheduler tunable):
-
-	/proc/sys/kernel/iso_cpu
-
-and is set to 70% by default. It is calculated over a rolling 5 second average
-Because it is the total CPU available, it means that on a multi CPU machine, it
-is possible to have an ISO task running as realtime scheduling indefinitely on
-just one CPU, as the other CPUs will be available. Setting this to 100 is the
-equivalent of giving all users SCHED_RR access and setting it to 0 removes the
-ability to run any pseudo-realtime tasks.
-
-A feature of BFS is that it detects when an application tries to obtain a
-realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
-appropriate privileges to use those policies. When it detects this, it will
-give the task SCHED_ISO policy instead. Thus it is transparent to the user.
-Because some applications constantly set their policy as well as their nice
-level, there is potential for them to undo the override specified by the user
-on the command line of setting the policy to SCHED_ISO. To counter this, once
-a task has been set to SCHED_ISO policy, it needs superuser privileges to set
-it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
-processes and threads will also inherit the ISO policy.
-
-Idleprio scheduling.
-
-Idleprio scheduling is a scheduling policy designed to give out CPU to a task
-_only_ when the CPU would be otherwise idle. The idea behind this is to allow
-ultra low priority tasks to be run in the background that have virtually no
-effect on the foreground tasks. This is ideally suited to distributed computing
-clients (like setiathome, folding, mprime etc) but can also be used to start
-a video encode or so on without any slowdown of other tasks. To avoid this
-policy from grabbing shared resources and holding them indefinitely, if it
-detects a state where the task is waiting on I/O, the machine is about to
-suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
-per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
-it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
-be set to start as SCHED_IDLEPRIO with the schedtool command like so:
-
-	schedtool -D -e ./mprime
-
-Subtick accounting.
-
-It is surprisingly difficult to get accurate CPU accounting, and in many cases,
-the accounting is done by simply determining what is happening at the precise
-moment a timer tick fires off. This becomes increasingly inaccurate as the
-timer tick frequency (HZ) is lowered. It is possible to create an application
-which uses almost 100% CPU, yet by being descheduled at the right time, records
-zero CPU usage. While the main problem with this is that there are possible
-security implications, it is also difficult to determine how much CPU a task
-really does use. BFS tries to use the sub-tick accounting from the TSC clock,
-where possible, to determine real CPU usage. This is not entirely reliable, but
-is far more likely to produce accurate CPU usage data than the existing designs
-and will not show tasks as consuming no CPU usage when they actually are. Thus,
-the amount of CPU reported as being used by BFS will more accurately represent
-how much CPU the task itself is using (as is shown for example by the 'time'
-application), so the reported values may be quite different to other schedulers.
-Values reported as the 'load' are more prone to problems with this design, but
-per process values are closer to real usage. When comparing throughput of BFS
-to other designs, it is important to compare the actual completed work in terms
-of total wall clock time taken and total work done, rather than the reported
-"cpu usage".
-
-
-Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -27,7 +27,6 @@ show up in /proc/sys/kernel:
 - domainname
 - hostname
 - hotplug
- iso_cpu
 - java-appletviewer           [ binfmt_java, obsolete ]
 - java-interpreter            [ binfmt_java, obsolete ]
 - kstack_depth_to_print       [ X86 only ]
@@ -50,7 +49,6 @@ show up in /proc/sys/kernel:
 - randomize_va_space
 - real-root-dev               ==> Documentation/initrd.txt
 - reboot-cmd                  [ SPARC only ]
- rr_interval
 - rtsig-max
 - rtsig-nr
 - sem
@@ -173,16 +171,6 @@ Default value is "/sbin/hotplug".

 ==============================================================

-iso_cpu: (BFS CPU scheduler only).
-
-This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
-run effectively at realtime priority, averaged over a rolling five
-seconds over the -whole- system, meaning all cpus.
-
-Set to 70 (percent) by default.
-
-==============================================================
-
 l2cr: (PPC only)

 This flag controls the L2 cache of G3 processor boards. If
@@ -345,20 +333,6 @@ rebooting. ???

 ==============================================================

-rr_interval: (BFS CPU scheduler only)
-
-This is the smallest duration that any cpu process scheduling unit
-will run for. Increasing this value can increase throughput of cpu
-bound tasks substantially but at the expense of increased latencies
-overall. Conversely decreasing it will decrease average and maximum
-latencies but at the expense of throughput. This value is in
-milliseconds and the default value chosen depends on the number of
-cpus available at scheduler initialisation with a minimum of 6.
-
-Valid values are from 1-5000.
-
-==============================================================
-
 rtsig-max & rtsig-nr:

 The file rtsig-max can be used to tune the maximum number
--- a/arch/arm/configs/mx51_efikamx_defconfig
+++ b/arch/arm/configs/mx51_efikamx_defconfig
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.31.14.27
-# Mon Nov 19 11:59:35 2012
+# Mon Dec 10 20:11:59 2012
 #
 CONFIG_ARM=y
 CONFIG_HAVE_PWM=y
@@ -31,7 +31,6 @@ CONFIG_CONSTRUCTORS=y
 #
 # General setup
 #
-CONFIG_SCHED_BFS=y
 CONFIG_EXPERIMENTAL=y
 CONFIG_BROKEN_ON_SMP=y
 CONFIG_INIT_ENV_ARG_LIMIT=32
@@ -59,6 +58,7 @@ CONFIG_RCU_FANOUT=32
 CONFIG_IKCONFIG=y
 CONFIG_IKCONFIG_PROC=y
 CONFIG_LOG_BUF_SHIFT=16
+# CONFIG_GROUP_SCHED is not set
 CONFIG_CGROUPS=y
 # CONFIG_CGROUP_DEBUG is not set
 CONFIG_CGROUP_NS=y
@@ -66,6 +66,7 @@ CONFIG_CGROUP_FREEZER=y
 # CONFIG_CGROUP_DEVICE is not set
 CONFIG_CPUSETS=y
 # CONFIG_PROC_PID_CPUSET is not set
+CONFIG_CGROUP_CPUACCT=y
 CONFIG_RESOURCE_COUNTERS=y
 # CONFIG_CGROUP_MEM_RES_CTLR is not set
 # CONFIG_SYSFS_DEPRECATED_V2 is not set
@@ -280,7 +281,7 @@ CONFIG_VMSPLIT_2G=y
 # CONFIG_VMSPLIT_1G is not set
 CONFIG_PAGE_OFFSET=0x80000000
 # CONFIG_PREEMPT is not set
-CONFIG_HZ=256
+CONFIG_HZ=100
 CONFIG_AEABI=y
 # CONFIG_OABI_COMPAT is not set
 # CONFIG_ARCH_SPARSEMEM_DEFAULT is not set
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -61,6 +61,11 @@ static struct task_struct *spusched_task;
 static struct timer_list spusched_timer;
 static struct timer_list spuloadavg_timer;

+/*
+ * Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
+ */
+#define NORMAL_PRIO		120
+
 /*
 * Frequency of the spu scheduler tick.  By default we do one SPU scheduler
 * tick for every 10 CPU scheduler ticks.
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -444,10 +444,8 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			freq_target = 5;

 		this_dbs_info->requested_freq += freq_target;
-		if (this_dbs_info->requested_freq >= policy->max) {
+		if (this_dbs_info->requested_freq > policy->max)
 			this_dbs_info->requested_freq = policy->max;
-			cpu_nonscaling(policy->cpu);
-		}

 		__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
 			CPUFREQ_RELATION_H);
@@ -472,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 		if (policy->cur == policy->min)
 			return;

-		cpu_scaling(policy->cpu);
 		__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
 				CPUFREQ_RELATION_H);
 		return;
@@ -588,7 +585,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,

 		dbs_timer_init(this_dbs_info);

-		cpu_scaling(cpu);
 		break;

 	case CPUFREQ_GOV_STOP:
@@ -610,7 +606,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,

 		mutex_unlock(&dbs_mutex);

-		cpu_nonscaling(cpu);
 		break;

 	case CPUFREQ_GOV_LIMITS:
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -470,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 		if (freq_next < policy->min)
 			freq_next = policy->min;

-		cpu_scaling(policy->cpu);
 		if (!dbs_tuners_ins.powersave_bias) {
 			__cpufreq_driver_target(policy, freq_next,
 					CPUFREQ_RELATION_L);
@@ -594,7 +593,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		mutex_unlock(&dbs_mutex);

 		dbs_timer_init(this_dbs_info);
-		cpu_scaling(cpu);
 		break;

 	case CPUFREQ_GOV_STOP:
@@ -606,7 +604,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 		dbs_enable--;
 		mutex_unlock(&dbs_mutex);

-		cpu_nonscaling(cpu);
 		break;

 	case CPUFREQ_GOV_LIMITS:
--- a/drivers/cpufreq/cpufreq_userspace.c
+++ b/drivers/cpufreq/cpufreq_userspace.c
@@ -23,7 +23,6 @@
 #include <linux/fs.h>
 #include <linux/sysfs.h>
 #include <linux/mutex.h>
-#include <linux/sched.h>

 /**
 * A few values needed by the userspace governor
@@ -98,10 +97,6 @@ static int cpufreq_set(struct cpufreq_policy *policy, unsigned int freq)
 	 *         cpufreq_governor_userspace (lock userspace_mutex)
 	 */
 	ret = __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
-	if (freq == cpu_max_freq)
-		cpu_nonscaling(policy->cpu);
-	else
-		cpu_scaling(policy->cpu);

 err:
 	mutex_unlock(&userspace_mutex);
@@ -147,7 +142,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
 				per_cpu(cpu_cur_freq, cpu));

 		mutex_unlock(&userspace_mutex);
-		cpu_scaling(cpu);
 		break;
 	case CPUFREQ_GOV_STOP:
 		mutex_lock(&userspace_mutex);
@@ -164,7 +158,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
 		per_cpu(cpu_set_freq, cpu) = 0;
 		dprintk("managing cpu %u stopped\n", cpu);
 		mutex_unlock(&userspace_mutex);
-		cpu_nonscaling(cpu);
 		break;
 	case CPUFREQ_GOV_LIMITS:
 		mutex_lock(&userspace_mutex);
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -366,7 +366,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 static int proc_pid_schedstat(struct task_struct *task, char *buffer)
 {
 	return sprintf(buffer, "%llu %llu %lu\n",
-			(unsigned long long)tsk_seruntime(task),
+			(unsigned long long)task->se.sum_exec_runtime,
 			(unsigned long long)task->sched_info.run_delay,
 			task->sched_info.pcount);
 }
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -109,68 +109,6 @@ extern struct cred init_cred;
 *  INIT_TASK is used to set up the first task table, touch at
 * your own risk!. Base=0, limit=0x1fffff (=2MB)
 */
-#ifdef CONFIG_SCHED_BFS
-#define INIT_TASK(tsk)	\
-{									\
-	.state		= 0,						\
-	.stack		= &init_thread_info,				\
-	.usage		= ATOMIC_INIT(2),				\
-	.flags		= PF_KTHREAD,					\
-	.lock_depth	= -1,						\
-	.prio		= NORMAL_PRIO,					\
-	.static_prio	= MAX_PRIO-20,					\
-	.normal_prio	= NORMAL_PRIO,					\
-	.deadline	= 0,						\
-	.policy		= SCHED_NORMAL,					\
-	.cpus_allowed	= CPU_MASK_ALL,					\
-	.mm		= NULL,						\
-	.active_mm	= &init_mm,					\
-	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
-	.time_slice	= HZ,					\
-	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
-	.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
-	.ptraced	= LIST_HEAD_INIT(tsk.ptraced),			\
-	.ptrace_entry	= LIST_HEAD_INIT(tsk.ptrace_entry),		\
-	.real_parent	= &tsk,						\
-	.parent		= &tsk,						\
-	.children	= LIST_HEAD_INIT(tsk.children),			\
-	.sibling	= LIST_HEAD_INIT(tsk.sibling),			\
-	.group_leader	= &tsk,						\
-	.real_cred	= &init_cred,					\
-	.cred		= &init_cred,					\
-	.cred_guard_mutex =						\
-		 __MUTEX_INITIALIZER(tsk.cred_guard_mutex),		\
-	.comm		= "swapper",					\
-	.thread		= INIT_THREAD,					\
-	.fs		= &init_fs,					\
-	.files		= &init_files,					\
-	.signal		= &init_signals,				\
-	.sighand	= &init_sighand,				\
-	.nsproxy	= &init_nsproxy,				\
-	.pending	= {						\
-		.list = LIST_HEAD_INIT(tsk.pending.list),		\
-		.signal = {{0}}},					\
-	.blocked	= {{0}},					\
-	.alloc_lock	= __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),		\
-	.journal_info	= NULL,						\
-	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
-	.fs_excl	= ATOMIC_INIT(0),				\
-	.pi_lock	= __SPIN_LOCK_UNLOCKED(tsk.pi_lock),		\
-	.timer_slack_ns = 50000, /* 50 usec default slack */		\
-	.pids = {							\
-		[PIDTYPE_PID]  = INIT_PID_LINK(PIDTYPE_PID),		\
-		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),		\
-		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),		\
-	},								\
-	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
-	INIT_IDS							\
-	INIT_PERF_COUNTERS(tsk)						\
-	INIT_TRACE_IRQFLAGS						\
-	INIT_LOCKDEP							\
-	INIT_FTRACE_GRAPH						\
-	INIT_TRACE_RECURSION						\
-}
-#else /* CONFIG_SCHED_BFS */
 #define INIT_TASK(tsk)	\
 {									\
 	.state		= 0,						\
@@ -230,14 +168,13 @@ extern struct cred init_cred;
 	},								\
 	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_IDS							\
-	INIT_PERF_EVENTS(tsk)						\
+	INIT_PERF_COUNTERS(tsk)						\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
 	INIT_FTRACE_GRAPH						\
 	INIT_TRACE_RECURSION						\
-	INIT_TASK_RCU_PREEMPT(tsk)					\
 }
-#endif /* CONFIG_SCHED_BFS */
+

 #define INIT_CPU_TIMERS(cpu_timers)					\
 {									\
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -64,8 +64,6 @@ static inline int task_ioprio_class(struct io_context *ioc)

 static inline int task_nice_ioprio(struct task_struct *task)
 {
-	if (iso_task(task))
-		return 0;
 	return (task_nice(task) + 20) / 5;
 }

--- a/include/linux/jiffies.h
+++ b/include/linux/jiffies.h
@@ -164,7 +164,7 @@ static inline u64 get_jiffies_64(void)
 * Have the 32 bit jiffies value wrap 5 minutes after boot
 * so jiffies wrap bugs show up earlier.
 */
-#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))
+#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

 /*
 * Change timeval to jiffies, trying to avoid the
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,16 +36,8 @@
 #define SCHED_FIFO		1
 #define SCHED_RR		2
 #define SCHED_BATCH		3
-/* SCHED_ISO: Implemented on BFS only */
+/* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
-#ifdef CONFIG_SCHED_BFS
-#define SCHED_ISO		4
-#define SCHED_IDLEPRIO		SCHED_IDLE
-
-#define SCHED_MAX		(SCHED_IDLEPRIO)
-#define SCHED_RANGE(policy)	((policy) <= SCHED_MAX)
-#endif
-
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000

@@ -148,7 +140,7 @@ extern int nr_processes(void);
 extern unsigned long nr_running(void);
 extern unsigned long nr_uninterruptible(void);
 extern unsigned long nr_iowait(void);
-extern void calc_global_load(void);
+extern void calc_global_load(unsigned long ticks);
 extern u64 cpu_nr_migrations(int cpu);

 extern unsigned long get_parent_ip(unsigned long addr);
@@ -264,6 +256,9 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);

+extern int runqueue_is_locked(void);
+extern void task_rq_unlock_wait(struct task_struct *p);
+
 extern cpumask_var_t nohz_cpu_mask;
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
 extern int select_nohz_load_balancer(int cpu);
@@ -1028,6 +1023,148 @@ struct uts_namespace;
 struct rq;
 struct sched_domain;

+struct sched_class {
+	const struct sched_class *next;
+
+	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
+	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
+	void (*yield_task) (struct rq *rq);
+
+	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);
+
+	struct task_struct * (*pick_next_task) (struct rq *rq);
+	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
+
+#ifdef CONFIG_SMP
+	int  (*select_task_rq)(struct task_struct *p, int sync);
+
+	unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
+			struct rq *busiest, unsigned long max_load_move,
+			struct sched_domain *sd, enum cpu_idle_type idle,
+			int *all_pinned, int *this_best_prio);
+
+	int (*move_one_task) (struct rq *this_rq, int this_cpu,
+			      struct rq *busiest, struct sched_domain *sd,
+			      enum cpu_idle_type idle);
+	void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
+	int (*needs_post_schedule) (struct rq *this_rq);
+	void (*post_schedule) (struct rq *this_rq);
+	void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
+
+	void (*set_cpus_allowed)(struct task_struct *p,
+				 const struct cpumask *newmask);
+
+	void (*rq_online)(struct rq *rq);
+	void (*rq_offline)(struct rq *rq);
+#endif
+
+	void (*set_curr_task) (struct rq *rq);
+	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
+	void (*task_new) (struct rq *rq, struct task_struct *p);
+
+	void (*switched_from) (struct rq *this_rq, struct task_struct *task,
+			       int running);
+	void (*switched_to) (struct rq *this_rq, struct task_struct *task,
+			     int running);
+	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
+			     int oldprio, int running);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	void (*moved_group) (struct task_struct *p);
+#endif
+};
+
+struct load_weight {
+	unsigned long weight, inv_weight;
+};
+
+/*
+ * CFS stats for a schedulable entity (task, task-group etc)
+ *
+ * Current field usage histogram:
+ *
+ *     4 se->block_start
+ *     4 se->run_node
+ *     4 se->sleep_start
+ *     6 se->load.weight
+ */
+struct sched_entity {
+	struct load_weight	load;		/* for load-balancing */
+	struct rb_node		run_node;
+	struct list_head	group_node;
+	unsigned int		on_rq;
+
+	u64			exec_start;
+	u64			sum_exec_runtime;
+	u64			vruntime;
+	u64			prev_sum_exec_runtime;
+
+	u64			last_wakeup;
+	u64			avg_overlap;
+
+	u64			nr_migrations;
+
+	u64			start_runtime;
+	u64			avg_wakeup;
+
+#ifdef CONFIG_SCHEDSTATS
+	u64			wait_start;
+	u64			wait_max;
+	u64			wait_count;
+	u64			wait_sum;
+
+	u64			sleep_start;
+	u64			sleep_max;
+	s64			sum_sleep_runtime;
+
+	u64			block_start;
+	u64			block_max;
+	u64			exec_max;
+	u64			slice_max;
+
+	u64			nr_migrations_cold;
+	u64			nr_failed_migrations_affine;
+	u64			nr_failed_migrations_running;
+	u64			nr_failed_migrations_hot;
+	u64			nr_forced_migrations;
+	u64			nr_forced2_migrations;
+
+	u64			nr_wakeups;
+	u64			nr_wakeups_sync;
+	u64			nr_wakeups_migrate;
+	u64			nr_wakeups_local;
+	u64			nr_wakeups_remote;
+	u64			nr_wakeups_affine;
+	u64			nr_wakeups_affine_attempts;
+	u64			nr_wakeups_passive;
+	u64			nr_wakeups_idle;
+#endif
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	struct sched_entity	*parent;
+	/* rq on which this entity is (to be) queued: */
+	struct cfs_rq		*cfs_rq;
+	/* rq "owned" by this entity/group: */
+	struct cfs_rq		*my_q;
+#endif
+};
+
+struct sched_rt_entity {
+	struct list_head run_list;
+	unsigned long timeout;
+	unsigned int time_slice;
+	int nr_cpus_allowed;
+
+	struct sched_rt_entity *back;
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct sched_rt_entity	*parent;
+	/* rq on which this entity is (to be) queued: */
+	struct rt_rq		*rt_rq;
+	/* rq "owned" by this entity/group: */
+	struct rt_rq		*my_q;
+#endif
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1037,33 +1174,17 @@ struct task_struct {

 	int lock_depth;		/* BKL lock depth */

-#ifndef CONFIG_SCHED_BFS
 #ifdef CONFIG_SMP
 #ifdef __ARCH_WANT_UNLOCKED_CTXSW
 	int oncpu;
 #endif
-#endif
-#else /* CONFIG_SCHED_BFS */
-	int oncpu;
 #endif

 	int prio, static_prio, normal_prio;
 	unsigned int rt_priority;
-#ifdef CONFIG_SCHED_BFS
-	int time_slice;
-	u64 deadline;
-	struct list_head run_list;
-	u64 last_ran;
-	u64 sched_time; /* sched_clock time spent running */
-#ifdef CONFIG_SMP
-	int sticky; /* Soft affined flag */
-#endif
-	unsigned long rt_timeout;
-#else /* CONFIG_SCHED_BFS */
 	const struct sched_class *sched_class;
 	struct sched_entity se;
 	struct sched_rt_entity rt;
-#endif

 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	/* list of struct preempt_notifier: */
@@ -1158,9 +1279,6 @@ struct task_struct {
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */

 	cputime_t utime, stime, utimescaled, stimescaled;
-#ifdef CONFIG_SCHED_BFS
-	unsigned long utime_pc, stime_pc;
-#endif
 	cputime_t gtime;
 	cputime_t prev_utime, prev_stime;
 	unsigned long nvcsw, nivcsw; /* context switch counts */
@@ -1370,66 +1488,6 @@ struct task_struct {
 #endif /* CONFIG_TRACING */
 };

-#ifdef CONFIG_SCHED_BFS
-extern int grunqueue_is_locked(void);
-extern void grq_unlock_wait(void);
-extern void cpu_scaling(int cpu);
-extern void cpu_nonscaling(int cpu);
-#define tsk_seruntime(t)		((t)->sched_time)
-#define tsk_rttimeout(t)		((t)->rt_timeout)
-#define task_rq_unlock_wait(tsk)	grq_unlock_wait()
-
-static inline void set_oom_timeslice(struct task_struct *p)
-{
-	p->time_slice = HZ;
-}
-
-static inline void tsk_cpus_current(struct task_struct *p)
-{
-}
-
-#define runqueue_is_locked(cpu)	grunqueue_is_locked()
-
-static inline void print_scheduler_version(void)
-{
-	printk(KERN_INFO"BFS CPU scheduler v0.376 by Con Kolivas.\n");
-}
-
-static inline int iso_task(struct task_struct *p)
-{
-	return (p->policy == SCHED_ISO);
-}
-#else
-extern int runqueue_is_locked(int cpu);
-extern void task_rq_unlock_wait(struct task_struct *p);
-#define tsk_seruntime(t)	((t)->se.sum_exec_runtime)
-#define tsk_rttimeout(t)	((t)->rt.timeout)
-
-static inline void sched_exit(struct task_struct *p)
-{
-}
-
-static inline void set_oom_timeslice(struct task_struct *p)
-{
-	p->rt.time_slice = HZ;
-}
-
-static inline void tsk_cpus_current(struct task_struct *p)
-{
-	p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
-}
-
-static inline void print_scheduler_version(void)
-{
-	printk(KERN_INFO"CFS CPU scheduler.\n");
-}
-
-static inline int iso_task(struct task_struct *p)
-{
-	return 0;
-}
-#endif
-
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpumask(tsk) (&(tsk)->cpus_allowed)

@@ -1448,19 +1506,9 @@ static inline int iso_task(struct task_struct *p)

 #define MAX_USER_RT_PRIO	100
 #define MAX_RT_PRIO		MAX_USER_RT_PRIO
-#define DEFAULT_PRIO		(MAX_RT_PRIO + 20)

-#ifdef CONFIG_SCHED_BFS
-#define PRIO_RANGE		(40)
-#define MAX_PRIO		(MAX_RT_PRIO + PRIO_RANGE)
-#define ISO_PRIO		(MAX_RT_PRIO)
-#define NORMAL_PRIO		(MAX_RT_PRIO + 1)
-#define IDLE_PRIO		(MAX_RT_PRIO + 2)
-#define PRIO_LIMIT		((IDLE_PRIO) + 1)
-#else /* CONFIG_SCHED_BFS */
 #define MAX_PRIO		(MAX_RT_PRIO + 40)
-#define NORMAL_PRIO		DEFAULT_PRIO
-#endif /* CONFIG_SCHED_BFS */
+#define DEFAULT_PRIO		(MAX_RT_PRIO + 20)

 static inline int rt_prio(int prio)
 {
@@ -1743,7 +1791,7 @@ task_sched_runtime(struct task_struct *task);
 extern unsigned long long thread_group_sched_runtime(struct task_struct *task);

 /* sched_exec is called by processes performing an exec */
-#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS)
+#ifdef CONFIG_SMP
 extern void sched_exec(void);
 #else
 #define sched_exec()   {}
@@ -1897,9 +1945,6 @@ extern void wake_up_new_task(struct task_struct *tsk,
 static inline void kick_process(struct task_struct *tsk) { }
 #endif
 extern void sched_fork(struct task_struct *p, int clone_flags);
-#ifdef CONFIG_SCHED_BFS
-extern void sched_exit(struct task_struct *p);
-#endif
 extern void sched_dead(struct task_struct *p);

 extern void proc_caches_init(void);
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -23,19 +23,6 @@ config CONSTRUCTORS

 menu "General setup"

-config SCHED_BFS
-	bool "BFS cpu scheduler"
-	---help---
-	  The Brain Fuck CPU Scheduler for excellent interactivity and
-	  responsiveness on the desktop and solid scalability on normal
-          hardware. Not recommended for 4096 CPUs.
-
-	  Currently incompatible with the Group CPU scheduler, and RCU TORTURE
-          TEST so these options are disabled.
-
-          Say Y here.
-	default y
-
 config EXPERIMENTAL
 	bool "Prompt for development and/or incomplete code/drivers"
 	---help---
@@ -456,7 +443,7 @@ config HAVE_UNSTABLE_SCHED_CLOCK

 config GROUP_SCHED
 	bool "Group CPU scheduler"
-	depends on EXPERIMENTAL && !SCHED_BFS
+	depends on EXPERIMENTAL
 	default n
 	help
 	  This feature lets CPU scheduler recognize task groups and control CPU
@@ -572,7 +559,7 @@ config PROC_PID_CPUSET

 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
-	depends on CGROUPS && !SCHED_BFS
+	depends on CGROUPS
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
--- a/init/main.c
+++ b/init/main.c
@@ -840,8 +840,6 @@ static noinline int init_post(void)
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();

-	print_scheduler_version();
-
 	if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
 		printk(KERN_WARNING "Warning: unable to open an initial console.\n");

--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -2,7 +2,7 @@
 # Makefile for the linux kernel.
 #

-obj-y     = sched_bfs.o fork.o exec_domain.o panic.o printk.o \
+obj-y     = sched.o fork.o exec_domain.o panic.o printk.o \
 	    cpu.o exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
@@ -107,7 +107,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # me.  I suspect most platforms don't need this, but until we know that for sure
 # I turn this off for IA-64 only.  Andreas Schwab says it's also needed on m68k
 # to get a correct value for the wait-channel (WCHAN in ps). --davidm
-CFLAGS_sched_bfs.o := $(PROFILING) -fno-omit-frame-pointer
+CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer
 endif

 $(obj)/configs.o: $(obj)/config_data.h
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -127,7 +127,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 	 */
 	t1 = tsk->sched_info.pcount;
 	t2 = tsk->sched_info.run_delay;
-	t3 = tsk_seruntime(tsk);
+	t3 = tsk->se.sum_exec_runtime;

 	d->cpu_count += t1;

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -120,7 +120,7 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
 		task_io_accounting_add(&sig->ioac, &tsk->ioac);
-		sig->sum_sched_runtime += tsk_seruntime(tsk);
+		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 		sig = NULL; /* Marker for below. */
 	}

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1199,7 +1199,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * parent's CPU). This avoids alot of nasty races.
 	 */
 	p->cpus_allowed = current->cpus_allowed;
-	tsk_cpus_current(p);
+	p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
 	if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
 			!cpu_online(task_cpu(p))))
 		set_task_cpu(p, smp_processor_id());
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -16,7 +16,7 @@
 #include <linux/mutex.h>
 #include <trace/events/sched.h>

-#define KTHREAD_NICE_LEVEL (0)
+#define KTHREAD_NICE_LEVEL (-5)

 static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
 	}
 	set_task_cpu(k, cpu);
 	k->cpus_allowed = cpumask_of_cpu(cpu);
+	k->rt.nr_cpus_allowed = 1;
 	k->flags |= PF_THREAD_BOUND;
 }
 EXPORT_SYMBOL(kthread_bind);
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -249,7 +249,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
-		times->sum_exec_runtime += tsk_seruntime(t);
+		times->sum_exec_runtime += t->se.sum_exec_runtime;

 		t = next_thread(t);
 	} while (t != tsk);
@@ -516,7 +516,7 @@ static void cleanup_timers(struct list_head *head,
 void posix_cpu_timers_exit(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->cpu_timers,
-		       tsk->utime, tsk->stime, tsk_seruntime(tsk));
+		       tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);

 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -526,7 +526,7 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
 	cleanup_timers(tsk->signal->cpu_timers,
 		       cputime_add(tsk->utime, sig->utime),
 		       cputime_add(tsk->stime, sig->stime),
-		       tsk_seruntime(tsk) + sig->sum_sched_runtime);
+		       tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
 }

 static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -1017,7 +1017,7 @@ static void check_thread_timers(struct task_struct *tsk,
 		struct cpu_timer_list *t = list_first_entry(timers,
 						      struct cpu_timer_list,
 						      entry);
-		if (!--maxfire || tsk_seruntime(tsk) < t->expires.sched) {
+		if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
 			tsk->cputime_expires.sched_exp = t->expires.sched;
 			break;
 		}
@@ -1033,7 +1033,7 @@ static void check_thread_timers(struct task_struct *tsk,
 		unsigned long *soft = &sig->rlim[RLIMIT_RTTIME].rlim_cur;

 		if (hard != RLIM_INFINITY &&
-		    tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
+		    tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
 			/*
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
@@ -1041,7 +1041,7 @@ static void check_thread_timers(struct task_struct *tsk,
 			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
 			return;
 		}
-		if (tsk_rttimeout(tsk) > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
+		if (tsk->rt.timeout > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
@@ -1357,7 +1357,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 		struct task_cputime task_sample = {
 			.utime = tsk->utime,
 			.stime = tsk->stime,
-			.sum_exec_runtime = tsk_seruntime(tsk)
+			.sum_exec_runtime = tsk->se.sum_exec_runtime
 		};

 		if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1,6 +1,3 @@
-#ifdef CONFIG_SCHED_BFS
-#include "sched_bfs.c"
-#else
 /*
 *  kernel/sched.c
 *
@@ -10801,4 +10798,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-#endif /* CONFIG_SCHED_BFS */
--- a/kernel/sched_bfs.c
+++ b/kernel/sched_bfs.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -100,15 +100,10 @@ static int neg_one = -1;
 #endif

 static int zero;
+static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static unsigned long one_ul = 1;
-static int __read_mostly one = 1;
-static int __read_mostly one_hundred = 100;
-#ifdef CONFIG_SCHED_BFS
-extern int rr_interval;
-extern int sched_iso_cpu;
-static int __read_mostly one_thousand = 1000;
-#endif
+static int one_hundred = 100;

 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
@@ -243,7 +238,7 @@ static struct ctl_table root_table[] = {
 	{ .ctl_name = 0 }
 };

-#if defined(CONFIG_SCHED_DEBUG) && !defined(CONFIG_SCHED_BFS)
+#ifdef CONFIG_SCHED_DEBUG
 static int min_sched_granularity_ns = 100000;		/* 100 usecs */
 static int max_sched_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 static int min_wakeup_granularity_ns;			/* 0 usecs */
@@ -251,15 +246,6 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC;	/* 1 second */
 #endif

 static struct ctl_table kern_table[] = {
-#ifndef CONFIG_SCHED_BFS
-	{
-		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_child_runs_first",
-		.data		= &sysctl_sched_child_runs_first,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
-	},
 #ifdef CONFIG_SCHED_DEBUG
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -312,6 +298,14 @@ static struct ctl_table kern_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_child_runs_first",
+		.data		= &sysctl_sched_child_runs_first,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_features",
@@ -336,14 +330,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
-	{
-		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_time_avg",
-		.data		= &sysctl_sched_time_avg,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec,
-	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "timer_migration",
@@ -380,7 +366,6 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
-#endif /* !CONFIG_SCHED_BFS */
 #ifdef CONFIG_PROVE_LOCKING
 	{
 		.ctl_name	= CTL_UNNUMBERED,
@@ -813,30 +798,6 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
-#ifdef CONFIG_SCHED_BFS
-	{
-		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "rr_interval",
-		.data		= &rr_interval,
-		.maxlen		= sizeof (int),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
-		.strategy	= &sysctl_intvec,
-		.extra1		= &one,
-		.extra2		= &one_thousand,
-	},
-	{
-		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "iso_cpu",
-		.data		= &sched_iso_cpu,
-		.maxlen		= sizeof (int),
-		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
-		.strategy	= &sysctl_intvec,
-		.extra1		= &zero,
-		.extra2		= &one_hundred,
-	},
-#endif
 #if defined(CONFIG_S390) && defined(CONFIG_SMP)
 	{
 		.ctl_name	= KERN_SPIN_RETRY,
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1153,7 +1153,8 @@ void update_process_times(int user_tick)
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();

-	/* Accounting is done within sched_bfs.c */
+	/* Note: this timer irq context must be accounted for as well. */
+	account_process_tick(p, user_tick);
 	run_local_timers();
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_tick);
@@ -1197,7 +1198,7 @@ void do_timer(unsigned long ticks)
 {
 	jiffies_64 += ticks;
 	update_wall_time();
-	calc_global_load();
+	calc_global_load(ticks);
 }

 #ifdef __ARCH_WANT_SYS_ALARM
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -275,10 +275,10 @@ unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
 void trace_wake_up(void)
 {
 	/*
-	 * The grunqueue_is_locked() can fail, but this is the best we
+	 * The runqueue_is_locked() can fail, but this is the best we
 	 * have for now:
 	 */
-	if (!(trace_flags & TRACE_ITER_BLOCK) && !grunqueue_is_locked())
+	if (!(trace_flags & TRACE_ITER_BLOCK) && !runqueue_is_locked())
 		wake_up(&trace_wait);
 }

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -317,6 +317,8 @@ static int worker_thread(void *__cwq)
 	if (cwq->wq->freezeable)
 		set_freezable();

+	set_user_nice(current, -5);
+
 	for (;;) {
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
 		if (!freezing(current) &&
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -723,37 +723,6 @@ config RCU_TORTURE_TEST_RUNNABLE
 	  Say N here if you want the RCU torture tests to start only
 	  after being manually enabled via /proc.

-config RCU_TORTURE_TEST
-	tristate "torture tests for RCU"
-	depends on DEBUG_KERNEL && !SCHED_BFS
-	default n
-	help
-	  This option provides a kernel module that runs torture tests
-	  on the RCU infrastructure.  The kernel module may be built
-	  after the fact on the running kernel to be tested, if desired.
-
-	  Say Y here if you want RCU torture tests to be built into
-	  the kernel.
-	  Say M if you want the RCU torture tests to build as a module.
-	  Say N if you are unsure.
-
-config RCU_TORTURE_TEST_RUNNABLE
-	bool "torture tests for RCU runnable by default"
-	depends on RCU_TORTURE_TEST = y
-	default n
-	help
-	  This option provides a way to build the RCU torture tests
-	  directly into the kernel without them starting up at boot
-	  time.  You can use /proc/sys/kernel/rcutorture_runnable
-	  to manually override this setting.  This /proc file is
-	  available only when the RCU torture tests have been built
-	  into the kernel.
-
-	  Say Y here if you want the RCU torture tests to start during
-	  boot (you probably don't).
-	  Say N here if you want the RCU torture tests to start only
-	  after being manually enabled via /proc.
-
 config RCU_CPU_STALL_DETECTOR
 	bool "Check for stalled CPUs delaying RCU grace periods"
 	depends on CLASSIC_RCU || TREE_RCU
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -338,7 +338,7 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
-	p->time_slice = HZ;
+	p->rt.time_slice = HZ;
 	set_tsk_thread_flag(p, TIF_MEMDIE);

 	force_sig(SIGKILL, p);