BFS: back out BFS scheduler for stability reasons

It works faster sometimes, other times not.. the code is so old now
This commit is contained in:
Matt Sealey
2012-12-10 20:18:47 -06:00
parent 102933f39b
commit c50431b567
28 changed files with 198 additions and 7401 deletions

View File

@@ -1,326 +0,0 @@
BFS - The Brain Fuck Scheduler by Con Kolivas.
Goals.
The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
completely do away with the complex designs of the past for the cpu process
scheduler and instead implement one that is very simple in basic design.
The main focus of BFS is to achieve excellent desktop interactivity and
responsiveness without heuristics and tuning knobs that are difficult to
understand, impossible to model and predict the effect of, and when tuned to
one workload cause massive detriment to another.
Design summary.
BFS is best described as a single runqueue, O(n) lookup, earliest effective
virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
deadline first) and my previous Staircase Deadline scheduler. Each component
shall be described in order to understand the significance of, and reasoning for
it. The codebase when the first stable version was released was approximately
9000 lines less code than the existing mainline linux kernel scheduler (in
2.6.31). This does not even take into account the removal of documentation and
the cgroups code that is not used.
Design reasoning.
The single runqueue refers to the queued but not running processes for the
entire system, regardless of the number of CPUs. The reason for going back to
a single runqueue design is that once multiple runqueues are introduced,
per-CPU or otherwise, there will be complex interactions as each runqueue will
be responsible for the scheduling latency and fairness of the tasks only on its
own runqueue, and to achieve fairness and low latency across multiple CPUs, any
advantage in throughput of having CPU local tasks causes other disadvantages.
This is due to requiring a very complex balancing system to at best achieve some
semblance of fairness across CPUs and can only maintain relatively low latency
for tasks bound to the same CPUs, not across them. To increase said fairness
and latency across CPUs, the advantage of local runqueue locking, which makes
for better scalability, is lost due to having to grab multiple locks.
A significant feature of BFS is that all accounting is done purely based on CPU
used and nowhere is sleep time used in any way to determine entitlement or
interactivity. Interactivity "estimators" that use some kind of sleep/run
algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
tasks that aren't interactive as being so. The reason for this is that it is
close to impossible to determine that when a task is sleeping, whether it is
doing it voluntarily, as in a userspace application waiting for input in the
form of a mouse click or otherwise, or involuntarily, because it is waiting for
another thread, process, I/O, kernel activity or whatever. Thus, such an
estimator will introduce corner cases, and more heuristics will be required to
cope with those corner cases, introducing more corner cases and failed
interactivity detection and so on. Interactivity in BFS is built into the design
by virtue of the fact that tasks that are waking up have not used up their quota
of CPU time, and have earlier effective deadlines, thereby making it very likely
they will preempt any CPU bound task of equivalent nice level. See below for
more information on the virtual deadline mechanism. Even if they do not preempt
a running task, because the rr interval is guaranteed to have a bound upper
limit on how long a task will wait for, it will be scheduled within a timeframe
that will not cause visible interface jitter.
Design details.
Task insertion.
BFS inserts tasks into each relevant queue as an O(1) insertion into a double
linked list. On insertion, *every* running queue is checked to see if the newly
queued task can run on any idle queue, or preempt the lowest running task on the
system. This is how the cross-CPU scheduling of BFS achieves significantly lower
latency per extra CPU the system has. In this case the lookup is, in the worst
case scenario, O(n) where n is the number of CPUs on the system.
Data protection.
BFS has one single lock protecting the process local data of every task in the
global queue. Thus every insertion, removal and modification of task data in the
global runqueue needs to grab the global lock. However, once a task is taken by
a CPU, the CPU has its own local data copy of the running process' accounting
information which only that CPU accesses and modifies (such as during a
timer tick) thus allowing the accounting data to be updated lockless. Once a
CPU has taken a task to run, it removes it from the global queue. Thus the
global queue only ever has, at most,
(number of tasks requesting cpu time) - (number of logical CPUs) + 1
tasks in the global queue. This value is relevant for the time taken to look up
tasks during scheduling. This will increase if many tasks with CPU affinity set
in their policy to limit which CPUs they're allowed to run on if they outnumber
the number of CPUs. The +1 is because when rescheduling a task, the CPU's
currently running task is put back on the queue. Lookup will be described after
the virtual deadline mechanism is explained.
Virtual deadline.
The key to achieving low latency, scheduling fairness, and "nice level"
distribution in BFS is entirely in the virtual deadline mechanism. The one
tunable in BFS is the rr_interval, or "round robin interval". This is the
maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
tasks of the same nice level will be running for, or looking at it the other
way around, the longest duration two tasks of the same nice level will be
delayed for. When a task requests cpu time, it is given a quota (time_slice)
equal to the rr_interval and a virtual deadline. The virtual deadline is
offset from the current time in jiffies by this equation:
jiffies + (prio_ratio * rr_interval)
The prio_ratio is determined as a ratio compared to the baseline of nice -20
and increases by 10% per nice level. The deadline is a virtual one only in that
no guarantee is placed that a task will actually be scheduled by this time, but
it is used to compare which task should go next. There are three components to
how a task is next chosen. First is time_slice expiration. If a task runs out
of its time_slice, it is descheduled, the time_slice is refilled, and the
deadline reset to that formula above. Second is sleep, where a task no longer
is requesting CPU for whatever reason. The time_slice and deadline are _not_
adjusted in this case and are just carried over for when the task is next
scheduled. Third is preemption, and that is when a newly waking task is deemed
higher priority than a currently running task on any cpu by virtue of the fact
that it has an earlier virtual deadline than the currently running task. The
earlier deadline is the key to which task is next chosen for the first and
second cases. Once a task is descheduled, it is put back on the queue, and an
O(n) lookup of all queued-but-not-running tasks is done to determine which has
the earliest deadline and that task is chosen to receive CPU next.
The CPU proportion of different nice tasks works out to be approximately the
(prio_ratio difference)^2
The reason it is squared is that a task's deadline does not change while it is
running unless it runs out of time_slice. Thus, even if the time actually
passes the deadline of another task that is queued, it will not get CPU time
unless the current running task deschedules, and the time "base" (jiffies) is
constantly moving.
Task lookup.
BFS has 103 priority queues. 100 of these are dedicated to the static priority
of realtime tasks, and the remaining 3 are, in order of best to worst priority,
SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
scheduling). When a task of these priorities is queued, a bitmap of running
priorities is set showing which of these priorities has tasks waiting for CPU
time. When a CPU is made to reschedule, the lookup for the next task to get
CPU time is performed in the following way:
First the bitmap is checked to see what static priority tasks are queued. If
any realtime priorities are found, the corresponding queue is checked and the
first task listed there is taken (provided CPU affinity is suitable) and lookup
is complete. If the priority corresponds to a SCHED_ISO task, they are also
taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
stage, every task in the runlist that corresponds to that priority is checked
to see which has the earliest set deadline, and (provided it has suitable CPU
affinity) it is taken off the runqueue and given the CPU. If a task has an
expired deadline, it is taken and the rest of the lookup aborted (as they are
chosen in FIFO order).
Thus, the lookup is O(n) in the worst case only, where n is as described
earlier, as tasks may be chosen before the whole task list is looked over.
Scalability.
The major limitations of BFS will be that of scalability, as the separate
runqueue designs will have less lock contention as the number of CPUs rises.
However they do not scale linearly even with separate runqueues as multiple
runqueues will need to be locked concurrently on such designs to be able to
achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
across CPUs, and to achieve low enough latency for tasks on a busy CPU when
other CPUs would be more suited. BFS has the advantage that it requires no
balancing algorithm whatsoever, as balancing occurs by proxy simply because
all CPUs draw off the global runqueue, in priority and deadline order. Despite
the fact that scalability is _not_ the prime concern of BFS, it both shows very
good scalability to smaller numbers of CPUs and is likely a more scalable design
at these numbers of CPUs.
It also has some very low overhead scalability features built into the design
when it has been deemed their overhead is so marginal that they're worth adding.
The first is the local copy of the running process' data to the CPU it's running
on to allow that data to be updated lockless where possible. Then there is
deference paid to the last CPU a task was running on, by trying that CPU first
when looking for an idle CPU to use the next time it's scheduled. Finally there
is the notion of "sticky" tasks that are flagged when they are involuntarily
descheduled, meaning they still want further CPU time. This sticky flag is
used to bias heavily against those tasks being scheduled on a different CPU
unless that CPU would be otherwise idle. When a cpu frequency governor is used
that scales with CPU load, such as ondemand, sticky tasks are not scheduled
on a different CPU at all, preferring instead to go idle. This means the CPU
they were bound to is more likely to increase its speed while the other CPU
will go idle, thus speeding up total task execution time and likely decreasing
power usage. This is the only scenario where BFS will allow a CPU to go idle
in preference to scheduling a task on the earliest available spare CPU.
The real cost of migrating a task from one CPU to another is entirely dependant
on the cache footprint of the task, how cache intensive the task is, how long
it's been running on that CPU to take up the bulk of its cache, how big the CPU
cache is, how fast and how layered the CPU cache is, how fast a context switch
is... and so on. In other words, it's close to random in the real world where we
do more than just one sole workload. The only thing we can be sure of is that
it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
utilising idle CPUs is more important than cache locality, and cache locality
only plays a part after that.
Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
However this benchmarking was performed on an earlier design that was far less
scalable than the current one so it's hard to know how scalable it is in terms
of both CPUs (due to the global runqueue) and heavily loaded machines (due to
O(n) lookup) at this stage. Note that in terms of scalability, the number of
_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
results are very promising indeed, without needing to tweak any knobs, features
or options. Benchmark contributions are most welcome.
Features
As the initial prime target audience for BFS was the average desktop user, it
was designed to not need tweaking, tuning or have features set to obtain benefit
from it. Thus the number of knobs and features has been kept to an absolute
minimum and should not require extra user input for the vast majority of cases.
There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
support for CGROUPS. The average user should neither need to know what these
are, nor should they need to be using them to have good desktop behaviour.
rr_interval
There is only one "scheduler" tunable, the round robin interval. This can be
accessed in
/proc/sys/kernel/rr_interval
The value is in milliseconds, and the default value is set to 6ms. Valid values
are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
decreasing throughput, while increasing it will improve throughput, but at the
cost of worsening latencies. The accuracy of the rr interval is limited by HZ
resolution of the kernel configuration. Thus, the worst case latencies are
usually slightly higher than this actual value. BFS uses "dithering" to try and
minimise the effect the Hz limitation has. The default value of 6 is not an
arbitrary one. It is based on the fact that humans can detect jitter at
approximately 7ms, so aiming for much lower latencies is pointless under most
circumstances. It is worth noting this fact when comparing the latency
performance of BFS to other schedulers. Worst case latencies being higher than
7ms are far worse than average latencies not being in the microsecond range.
Experimentation has shown that rr intervals being increased up to 300 can
improve throughput but beyond that, scheduling noise from elsewhere prevents
further demonstrable throughput.
Isochronous scheduling.
Isochronous scheduling is a unique scheduling policy designed to provide
near-real-time performance to unprivileged (ie non-root) users without the
ability to starve the machine indefinitely. Isochronous tasks (which means
"same time") are set using, for example, the schedtool application like so:
schedtool -I -e amarok
This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
is that it has a priority level between true realtime tasks and SCHED_NORMAL
which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
rate). However if ISO tasks run for more than a tunable finite amount of time,
they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
time is the percentage of _total CPU_ available across the machine, configurable
as a percentage in the following "resource handling" tunable (as opposed to a
scheduler tunable):
/proc/sys/kernel/iso_cpu
and is set to 70% by default. It is calculated over a rolling 5 second average
Because it is the total CPU available, it means that on a multi CPU machine, it
is possible to have an ISO task running as realtime scheduling indefinitely on
just one CPU, as the other CPUs will be available. Setting this to 100 is the
equivalent of giving all users SCHED_RR access and setting it to 0 removes the
ability to run any pseudo-realtime tasks.
A feature of BFS is that it detects when an application tries to obtain a
realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
appropriate privileges to use those policies. When it detects this, it will
give the task SCHED_ISO policy instead. Thus it is transparent to the user.
Because some applications constantly set their policy as well as their nice
level, there is potential for them to undo the override specified by the user
on the command line of setting the policy to SCHED_ISO. To counter this, once
a task has been set to SCHED_ISO policy, it needs superuser privileges to set
it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
processes and threads will also inherit the ISO policy.
Idleprio scheduling.
Idleprio scheduling is a scheduling policy designed to give out CPU to a task
_only_ when the CPU would be otherwise idle. The idea behind this is to allow
ultra low priority tasks to be run in the background that have virtually no
effect on the foreground tasks. This is ideally suited to distributed computing
clients (like setiathome, folding, mprime etc) but can also be used to start
a video encode or so on without any slowdown of other tasks. To avoid this
policy from grabbing shared resources and holding them indefinitely, if it
detects a state where the task is waiting on I/O, the machine is about to
suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
be set to start as SCHED_IDLEPRIO with the schedtool command like so:
schedtool -D -e ./mprime
Subtick accounting.
It is surprisingly difficult to get accurate CPU accounting, and in many cases,
the accounting is done by simply determining what is happening at the precise
moment a timer tick fires off. This becomes increasingly inaccurate as the
timer tick frequency (HZ) is lowered. It is possible to create an application
which uses almost 100% CPU, yet by being descheduled at the right time, records
zero CPU usage. While the main problem with this is that there are possible
security implications, it is also difficult to determine how much CPU a task
really does use. BFS tries to use the sub-tick accounting from the TSC clock,
where possible, to determine real CPU usage. This is not entirely reliable, but
is far more likely to produce accurate CPU usage data than the existing designs
and will not show tasks as consuming no CPU usage when they actually are. Thus,
the amount of CPU reported as being used by BFS will more accurately represent
how much CPU the task itself is using (as is shown for example by the 'time'
application), so the reported values may be quite different to other schedulers.
Values reported as the 'load' are more prone to problems with this design, but
per process values are closer to real usage. When comparing throughput of BFS
to other designs, it is important to compare the actual completed work in terms
of total wall clock time taken and total work done, rather than the reported
"cpu usage".
Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011

View File

@@ -27,7 +27,6 @@ show up in /proc/sys/kernel:
- domainname
- hostname
- hotplug
- iso_cpu
- java-appletviewer [ binfmt_java, obsolete ]
- java-interpreter [ binfmt_java, obsolete ]
- kstack_depth_to_print [ X86 only ]
@@ -50,7 +49,6 @@ show up in /proc/sys/kernel:
- randomize_va_space
- real-root-dev ==> Documentation/initrd.txt
- reboot-cmd [ SPARC only ]
- rr_interval
- rtsig-max
- rtsig-nr
- sem
@@ -173,16 +171,6 @@ Default value is "/sbin/hotplug".
==============================================================
iso_cpu: (BFS CPU scheduler only).
This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
run effectively at realtime priority, averaged over a rolling five
seconds over the -whole- system, meaning all cpus.
Set to 70 (percent) by default.
==============================================================
l2cr: (PPC only)
This flag controls the L2 cache of G3 processor boards. If
@@ -345,20 +333,6 @@ rebooting. ???
==============================================================
rr_interval: (BFS CPU scheduler only)
This is the smallest duration that any cpu process scheduling unit
will run for. Increasing this value can increase throughput of cpu
bound tasks substantially but at the expense of increased latencies
overall. Conversely decreasing it will decrease average and maximum
latencies but at the expense of throughput. This value is in
milliseconds and the default value chosen depends on the number of
cpus available at scheduler initialisation with a minimum of 6.
Valid values are from 1-5000.
==============================================================
rtsig-max & rtsig-nr:
The file rtsig-max can be used to tune the maximum number

View File

@@ -1,7 +1,7 @@
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.31.14.27
# Mon Nov 19 11:59:35 2012
# Mon Dec 10 20:11:59 2012
#
CONFIG_ARM=y
CONFIG_HAVE_PWM=y
@@ -31,7 +31,6 @@ CONFIG_CONSTRUCTORS=y
#
# General setup
#
CONFIG_SCHED_BFS=y
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
@@ -59,6 +58,7 @@ CONFIG_RCU_FANOUT=32
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_GROUP_SCHED is not set
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
@@ -66,6 +66,7 @@ CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
# CONFIG_PROC_PID_CPUSET is not set
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_CGROUP_MEM_RES_CTLR is not set
# CONFIG_SYSFS_DEPRECATED_V2 is not set
@@ -280,7 +281,7 @@ CONFIG_VMSPLIT_2G=y
# CONFIG_VMSPLIT_1G is not set
CONFIG_PAGE_OFFSET=0x80000000
# CONFIG_PREEMPT is not set
CONFIG_HZ=256
CONFIG_HZ=100
CONFIG_AEABI=y
# CONFIG_OABI_COMPAT is not set
# CONFIG_ARCH_SPARSEMEM_DEFAULT is not set

View File

@@ -61,6 +61,11 @@ static struct task_struct *spusched_task;
static struct timer_list spusched_timer;
static struct timer_list spuloadavg_timer;
/*
* Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
*/
#define NORMAL_PRIO 120
/*
* Frequency of the spu scheduler tick. By default we do one SPU scheduler
* tick for every 10 CPU scheduler ticks.

View File

@@ -444,10 +444,8 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
freq_target = 5;
this_dbs_info->requested_freq += freq_target;
if (this_dbs_info->requested_freq >= policy->max) {
if (this_dbs_info->requested_freq > policy->max)
this_dbs_info->requested_freq = policy->max;
cpu_nonscaling(policy->cpu);
}
__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
CPUFREQ_RELATION_H);
@@ -472,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
if (policy->cur == policy->min)
return;
cpu_scaling(policy->cpu);
__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
CPUFREQ_RELATION_H);
return;
@@ -588,7 +585,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
dbs_timer_init(this_dbs_info);
cpu_scaling(cpu);
break;
case CPUFREQ_GOV_STOP:
@@ -610,7 +606,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
mutex_unlock(&dbs_mutex);
cpu_nonscaling(cpu);
break;
case CPUFREQ_GOV_LIMITS:

View File

@@ -470,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
if (freq_next < policy->min)
freq_next = policy->min;
cpu_scaling(policy->cpu);
if (!dbs_tuners_ins.powersave_bias) {
__cpufreq_driver_target(policy, freq_next,
CPUFREQ_RELATION_L);
@@ -594,7 +593,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
mutex_unlock(&dbs_mutex);
dbs_timer_init(this_dbs_info);
cpu_scaling(cpu);
break;
case CPUFREQ_GOV_STOP:
@@ -606,7 +604,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
dbs_enable--;
mutex_unlock(&dbs_mutex);
cpu_nonscaling(cpu);
break;
case CPUFREQ_GOV_LIMITS:

View File

@@ -23,7 +23,6 @@
#include <linux/fs.h>
#include <linux/sysfs.h>
#include <linux/mutex.h>
#include <linux/sched.h>
/**
* A few values needed by the userspace governor
@@ -98,10 +97,6 @@ static int cpufreq_set(struct cpufreq_policy *policy, unsigned int freq)
* cpufreq_governor_userspace (lock userspace_mutex)
*/
ret = __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
if (freq == cpu_max_freq)
cpu_nonscaling(policy->cpu);
else
cpu_scaling(policy->cpu);
err:
mutex_unlock(&userspace_mutex);
@@ -147,7 +142,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
per_cpu(cpu_cur_freq, cpu));
mutex_unlock(&userspace_mutex);
cpu_scaling(cpu);
break;
case CPUFREQ_GOV_STOP:
mutex_lock(&userspace_mutex);
@@ -164,7 +158,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
per_cpu(cpu_set_freq, cpu) = 0;
dprintk("managing cpu %u stopped\n", cpu);
mutex_unlock(&userspace_mutex);
cpu_nonscaling(cpu);
break;
case CPUFREQ_GOV_LIMITS:
mutex_lock(&userspace_mutex);

View File

@@ -366,7 +366,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
static int proc_pid_schedstat(struct task_struct *task, char *buffer)
{
return sprintf(buffer, "%llu %llu %lu\n",
(unsigned long long)tsk_seruntime(task),
(unsigned long long)task->se.sum_exec_runtime,
(unsigned long long)task->sched_info.run_delay,
task->sched_info.pcount);
}

View File

@@ -109,68 +109,6 @@ extern struct cred init_cred;
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
*/
#ifdef CONFIG_SCHED_BFS
#define INIT_TASK(tsk) \
{ \
.state = 0, \
.stack = &init_thread_info, \
.usage = ATOMIC_INIT(2), \
.flags = PF_KTHREAD, \
.lock_depth = -1, \
.prio = NORMAL_PRIO, \
.static_prio = MAX_PRIO-20, \
.normal_prio = NORMAL_PRIO, \
.deadline = 0, \
.policy = SCHED_NORMAL, \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
.active_mm = &init_mm, \
.run_list = LIST_HEAD_INIT(tsk.run_list), \
.time_slice = HZ, \
.tasks = LIST_HEAD_INIT(tsk.tasks), \
.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
.real_parent = &tsk, \
.parent = &tsk, \
.children = LIST_HEAD_INIT(tsk.children), \
.sibling = LIST_HEAD_INIT(tsk.sibling), \
.group_leader = &tsk, \
.real_cred = &init_cred, \
.cred = &init_cred, \
.cred_guard_mutex = \
__MUTEX_INITIALIZER(tsk.cred_guard_mutex), \
.comm = "swapper", \
.thread = INIT_THREAD, \
.fs = &init_fs, \
.files = &init_files, \
.signal = &init_signals, \
.sighand = &init_sighand, \
.nsproxy = &init_nsproxy, \
.pending = { \
.list = LIST_HEAD_INIT(tsk.pending.list), \
.signal = {{0}}}, \
.blocked = {{0}}, \
.alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
.journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
.pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \
.timer_slack_ns = 50000, /* 50 usec default slack */ \
.pids = { \
[PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \
[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \
[PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
}, \
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
INIT_IDS \
INIT_PERF_COUNTERS(tsk) \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
}
#else /* CONFIG_SCHED_BFS */
#define INIT_TASK(tsk) \
{ \
.state = 0, \
@@ -230,14 +168,13 @@ extern struct cred init_cred;
}, \
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
INIT_IDS \
INIT_PERF_EVENTS(tsk) \
INIT_PERF_COUNTERS(tsk) \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
INIT_FTRACE_GRAPH \
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
}
#endif /* CONFIG_SCHED_BFS */
#define INIT_CPU_TIMERS(cpu_timers) \
{ \

View File

@@ -64,8 +64,6 @@ static inline int task_ioprio_class(struct io_context *ioc)
static inline int task_nice_ioprio(struct task_struct *task)
{
if (iso_task(task))
return 0;
return (task_nice(task) + 20) / 5;
}

View File

@@ -164,7 +164,7 @@ static inline u64 get_jiffies_64(void)
* Have the 32 bit jiffies value wrap 5 minutes after boot
* so jiffies wrap bugs show up earlier.
*/
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
/*
* Change timeval to jiffies, trying to avoid the

View File

@@ -36,16 +36,8 @@
#define SCHED_FIFO 1
#define SCHED_RR 2
#define SCHED_BATCH 3
/* SCHED_ISO: Implemented on BFS only */
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#ifdef CONFIG_SCHED_BFS
#define SCHED_ISO 4
#define SCHED_IDLEPRIO SCHED_IDLE
#define SCHED_MAX (SCHED_IDLEPRIO)
#define SCHED_RANGE(policy) ((policy) <= SCHED_MAX)
#endif
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK 0x40000000
@@ -148,7 +140,7 @@ extern int nr_processes(void);
extern unsigned long nr_running(void);
extern unsigned long nr_uninterruptible(void);
extern unsigned long nr_iowait(void);
extern void calc_global_load(void);
extern void calc_global_load(unsigned long ticks);
extern u64 cpu_nr_migrations(int cpu);
extern unsigned long get_parent_ip(unsigned long addr);
@@ -264,6 +256,9 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);
extern void init_idle_bootup_task(struct task_struct *idle);
extern int runqueue_is_locked(void);
extern void task_rq_unlock_wait(struct task_struct *p);
extern cpumask_var_t nohz_cpu_mask;
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
extern int select_nohz_load_balancer(int cpu);
@@ -1028,6 +1023,148 @@ struct uts_namespace;
struct rq;
struct sched_domain;
struct sched_class {
const struct sched_class *next;
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
void (*yield_task) (struct rq *rq);
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int sync);
unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
struct rq *busiest, unsigned long max_load_move,
struct sched_domain *sd, enum cpu_idle_type idle,
int *all_pinned, int *this_best_prio);
int (*move_one_task) (struct rq *this_rq, int this_cpu,
struct rq *busiest, struct sched_domain *sd,
enum cpu_idle_type idle);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
int (*needs_post_schedule) (struct rq *this_rq);
void (*post_schedule) (struct rq *this_rq);
void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
void (*set_cpus_allowed)(struct task_struct *p,
const struct cpumask *newmask);
void (*rq_online)(struct rq *rq);
void (*rq_offline)(struct rq *rq);
#endif
void (*set_curr_task) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
void (*task_new) (struct rq *rq, struct task_struct *p);
void (*switched_from) (struct rq *this_rq, struct task_struct *task,
int running);
void (*switched_to) (struct rq *this_rq, struct task_struct *task,
int running);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio, int running);
#ifdef CONFIG_FAIR_GROUP_SCHED
void (*moved_group) (struct task_struct *p);
#endif
};
struct load_weight {
unsigned long weight, inv_weight;
};
/*
* CFS stats for a schedulable entity (task, task-group etc)
*
* Current field usage histogram:
*
* 4 se->block_start
* 4 se->run_node
* 4 se->sleep_start
* 6 se->load.weight
*/
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 last_wakeup;
u64 avg_overlap;
u64 nr_migrations;
u64 start_runtime;
u64 avg_wakeup;
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
u64 wait_max;
u64 wait_count;
u64 wait_sum;
u64 sleep_start;
u64 sleep_max;
s64 sum_sleep_runtime;
u64 block_start;
u64 block_max;
u64 exec_max;
u64 slice_max;
u64 nr_migrations_cold;
u64 nr_failed_migrations_affine;
u64 nr_failed_migrations_running;
u64 nr_failed_migrations_hot;
u64 nr_forced_migrations;
u64 nr_forced2_migrations;
u64 nr_wakeups;
u64 nr_wakeups_sync;
u64 nr_wakeups_migrate;
u64 nr_wakeups_local;
u64 nr_wakeups_remote;
u64 nr_wakeups_affine;
u64 nr_wakeups_affine_attempts;
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
};
struct sched_rt_entity {
struct list_head run_list;
unsigned long timeout;
unsigned int time_slice;
int nr_cpus_allowed;
struct sched_rt_entity *back;
#ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity *parent;
/* rq on which this entity is (to be) queued: */
struct rt_rq *rt_rq;
/* rq "owned" by this entity/group: */
struct rt_rq *my_q;
#endif
};
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1037,33 +1174,17 @@ struct task_struct {
int lock_depth; /* BKL lock depth */
#ifndef CONFIG_SCHED_BFS
#ifdef CONFIG_SMP
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
int oncpu;
#endif
#endif
#else /* CONFIG_SCHED_BFS */
int oncpu;
#endif
int prio, static_prio, normal_prio;
unsigned int rt_priority;
#ifdef CONFIG_SCHED_BFS
int time_slice;
u64 deadline;
struct list_head run_list;
u64 last_ran;
u64 sched_time; /* sched_clock time spent running */
#ifdef CONFIG_SMP
int sticky; /* Soft affined flag */
#endif
unsigned long rt_timeout;
#else /* CONFIG_SCHED_BFS */
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
#endif
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* list of struct preempt_notifier: */
@@ -1158,9 +1279,6 @@ struct task_struct {
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
cputime_t utime, stime, utimescaled, stimescaled;
#ifdef CONFIG_SCHED_BFS
unsigned long utime_pc, stime_pc;
#endif
cputime_t gtime;
cputime_t prev_utime, prev_stime;
unsigned long nvcsw, nivcsw; /* context switch counts */
@@ -1370,66 +1488,6 @@ struct task_struct {
#endif /* CONFIG_TRACING */
};
#ifdef CONFIG_SCHED_BFS
extern int grunqueue_is_locked(void);
extern void grq_unlock_wait(void);
extern void cpu_scaling(int cpu);
extern void cpu_nonscaling(int cpu);
#define tsk_seruntime(t) ((t)->sched_time)
#define tsk_rttimeout(t) ((t)->rt_timeout)
#define task_rq_unlock_wait(tsk) grq_unlock_wait()
static inline void set_oom_timeslice(struct task_struct *p)
{
p->time_slice = HZ;
}
static inline void tsk_cpus_current(struct task_struct *p)
{
}
#define runqueue_is_locked(cpu) grunqueue_is_locked()
static inline void print_scheduler_version(void)
{
printk(KERN_INFO"BFS CPU scheduler v0.376 by Con Kolivas.\n");
}
static inline int iso_task(struct task_struct *p)
{
return (p->policy == SCHED_ISO);
}
#else
extern int runqueue_is_locked(int cpu);
extern void task_rq_unlock_wait(struct task_struct *p);
#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
#define tsk_rttimeout(t) ((t)->rt.timeout)
static inline void sched_exit(struct task_struct *p)
{
}
static inline void set_oom_timeslice(struct task_struct *p)
{
p->rt.time_slice = HZ;
}
static inline void tsk_cpus_current(struct task_struct *p)
{
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
}
static inline void print_scheduler_version(void)
{
printk(KERN_INFO"CFS CPU scheduler.\n");
}
static inline int iso_task(struct task_struct *p)
{
return 0;
}
#endif
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpumask(tsk) (&(tsk)->cpus_allowed)
@@ -1448,19 +1506,9 @@ static inline int iso_task(struct task_struct *p)
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
#ifdef CONFIG_SCHED_BFS
#define PRIO_RANGE (40)
#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE)
#define ISO_PRIO (MAX_RT_PRIO)
#define NORMAL_PRIO (MAX_RT_PRIO + 1)
#define IDLE_PRIO (MAX_RT_PRIO + 2)
#define PRIO_LIMIT ((IDLE_PRIO) + 1)
#else /* CONFIG_SCHED_BFS */
#define MAX_PRIO (MAX_RT_PRIO + 40)
#define NORMAL_PRIO DEFAULT_PRIO
#endif /* CONFIG_SCHED_BFS */
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
static inline int rt_prio(int prio)
{
@@ -1743,7 +1791,7 @@ task_sched_runtime(struct task_struct *task);
extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
/* sched_exec is called by processes performing an exec */
#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS)
#ifdef CONFIG_SMP
extern void sched_exec(void);
#else
#define sched_exec() {}
@@ -1897,9 +1945,6 @@ extern void wake_up_new_task(struct task_struct *tsk,
static inline void kick_process(struct task_struct *tsk) { }
#endif
extern void sched_fork(struct task_struct *p, int clone_flags);
#ifdef CONFIG_SCHED_BFS
extern void sched_exit(struct task_struct *p);
#endif
extern void sched_dead(struct task_struct *p);
extern void proc_caches_init(void);

View File

@@ -23,19 +23,6 @@ config CONSTRUCTORS
menu "General setup"
config SCHED_BFS
bool "BFS cpu scheduler"
---help---
The Brain Fuck CPU Scheduler for excellent interactivity and
responsiveness on the desktop and solid scalability on normal
hardware. Not recommended for 4096 CPUs.
Currently incompatible with the Group CPU scheduler, and RCU TORTURE
TEST so these options are disabled.
Say Y here.
default y
config EXPERIMENTAL
bool "Prompt for development and/or incomplete code/drivers"
---help---
@@ -456,7 +443,7 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config GROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL && !SCHED_BFS
depends on EXPERIMENTAL
default n
help
This feature lets CPU scheduler recognize task groups and control CPU
@@ -572,7 +559,7 @@ config PROC_PID_CPUSET
config CGROUP_CPUACCT
bool "Simple CPU accounting cgroup subsystem"
depends on CGROUPS && !SCHED_BFS
depends on CGROUPS
help
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.

View File

@@ -840,8 +840,6 @@ static noinline int init_post(void)
system_state = SYSTEM_RUNNING;
numa_default_policy();
print_scheduler_version();
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
printk(KERN_WARNING "Warning: unable to open an initial console.\n");

View File

@@ -2,7 +2,7 @@
# Makefile for the linux kernel.
#
obj-y = sched_bfs.o fork.o exec_domain.o panic.o printk.o \
obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
cpu.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
@@ -107,7 +107,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# me. I suspect most platforms don't need this, but until we know that for sure
# I turn this off for IA-64 only. Andreas Schwab says it's also needed on m68k
# to get a correct value for the wait-channel (WCHAN in ps). --davidm
CFLAGS_sched_bfs.o := $(PROFILING) -fno-omit-frame-pointer
CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer
endif
$(obj)/configs.o: $(obj)/config_data.h

View File

@@ -127,7 +127,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
*/
t1 = tsk->sched_info.pcount;
t2 = tsk->sched_info.run_delay;
t3 = tsk_seruntime(tsk);
t3 = tsk->se.sum_exec_runtime;
d->cpu_count += t1;

View File

@@ -120,7 +120,7 @@ static void __exit_signal(struct task_struct *tsk)
sig->inblock += task_io_get_inblock(tsk);
sig->oublock += task_io_get_oublock(tsk);
task_io_accounting_add(&sig->ioac, &tsk->ioac);
sig->sum_sched_runtime += tsk_seruntime(tsk);
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
sig = NULL; /* Marker for below. */
}

View File

@@ -1199,7 +1199,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* parent's CPU). This avoids alot of nasty races.
*/
p->cpus_allowed = current->cpus_allowed;
tsk_cpus_current(p);
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
!cpu_online(task_cpu(p))))
set_task_cpu(p, smp_processor_id());

View File

@@ -16,7 +16,7 @@
#include <linux/mutex.h>
#include <trace/events/sched.h>
#define KTHREAD_NICE_LEVEL (0)
#define KTHREAD_NICE_LEVEL (-5)
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
}
set_task_cpu(k, cpu);
k->cpus_allowed = cpumask_of_cpu(cpu);
k->rt.nr_cpus_allowed = 1;
k->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);

View File

@@ -249,7 +249,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
do {
times->utime = cputime_add(times->utime, t->utime);
times->stime = cputime_add(times->stime, t->stime);
times->sum_exec_runtime += tsk_seruntime(t);
times->sum_exec_runtime += t->se.sum_exec_runtime;
t = next_thread(t);
} while (t != tsk);
@@ -516,7 +516,7 @@ static void cleanup_timers(struct list_head *head,
void posix_cpu_timers_exit(struct task_struct *tsk)
{
cleanup_timers(tsk->cpu_timers,
tsk->utime, tsk->stime, tsk_seruntime(tsk));
tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -526,7 +526,7 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
cleanup_timers(tsk->signal->cpu_timers,
cputime_add(tsk->utime, sig->utime),
cputime_add(tsk->stime, sig->stime),
tsk_seruntime(tsk) + sig->sum_sched_runtime);
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}
static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
@@ -1017,7 +1017,7 @@ static void check_thread_timers(struct task_struct *tsk,
struct cpu_timer_list *t = list_first_entry(timers,
struct cpu_timer_list,
entry);
if (!--maxfire || tsk_seruntime(tsk) < t->expires.sched) {
if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
tsk->cputime_expires.sched_exp = t->expires.sched;
break;
}
@@ -1033,7 +1033,7 @@ static void check_thread_timers(struct task_struct *tsk,
unsigned long *soft = &sig->rlim[RLIMIT_RTTIME].rlim_cur;
if (hard != RLIM_INFINITY &&
tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
/*
* At the hard limit, we just die.
* No need to calculate anything else now.
@@ -1041,7 +1041,7 @@ static void check_thread_timers(struct task_struct *tsk,
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
return;
}
if (tsk_rttimeout(tsk) > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
if (tsk->rt.timeout > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
/*
* At the soft limit, send a SIGXCPU every second.
*/
@@ -1357,7 +1357,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
struct task_cputime task_sample = {
.utime = tsk->utime,
.stime = tsk->stime,
.sum_exec_runtime = tsk_seruntime(tsk)
.sum_exec_runtime = tsk->se.sum_exec_runtime
};
if (task_cputime_expired(&task_sample, &tsk->cputime_expires))

View File

@@ -1,6 +1,3 @@
#ifdef CONFIG_SCHED_BFS
#include "sched_bfs.c"
#else
/*
* kernel/sched.c
*
@@ -10801,4 +10798,3 @@ struct cgroup_subsys cpuacct_subsys = {
.subsys_id = cpuacct_subsys_id,
};
#endif /* CONFIG_CGROUP_CPUACCT */
#endif /* CONFIG_SCHED_BFS */

File diff suppressed because it is too large Load Diff

View File

@@ -100,15 +100,10 @@ static int neg_one = -1;
#endif
static int zero;
static int __maybe_unused one = 1;
static int __maybe_unused two = 2;
static unsigned long one_ul = 1;
static int __read_mostly one = 1;
static int __read_mostly one_hundred = 100;
#ifdef CONFIG_SCHED_BFS
extern int rr_interval;
extern int sched_iso_cpu;
static int __read_mostly one_thousand = 1000;
#endif
static int one_hundred = 100;
/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
@@ -243,7 +238,7 @@ static struct ctl_table root_table[] = {
{ .ctl_name = 0 }
};
#if defined(CONFIG_SCHED_DEBUG) && !defined(CONFIG_SCHED_BFS)
#ifdef CONFIG_SCHED_DEBUG
static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
@@ -251,15 +246,6 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
#endif
static struct ctl_table kern_table[] = {
#ifndef CONFIG_SCHED_BFS
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
#ifdef CONFIG_SCHED_DEBUG
{
.ctl_name = CTL_UNNUMBERED,
@@ -312,6 +298,14 @@ static struct ctl_table kern_table[] = {
.strategy = &sysctl_intvec,
.extra1 = &zero,
},
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_child_runs_first",
.data = &sysctl_sched_child_runs_first,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_features",
@@ -336,14 +330,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = CTL_UNNUMBERED,
.procname = "sched_time_avg",
.data = &sysctl_sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = &proc_dointvec,
},
{
.ctl_name = CTL_UNNUMBERED,
.procname = "timer_migration",
@@ -380,7 +366,6 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
#endif /* !CONFIG_SCHED_BFS */
#ifdef CONFIG_PROVE_LOCKING
{
.ctl_name = CTL_UNNUMBERED,
@@ -813,30 +798,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
#ifdef CONFIG_SCHED_BFS
{
.ctl_name = CTL_UNNUMBERED,
.procname = "rr_interval",
.data = &rr_interval,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
.strategy = &sysctl_intvec,
.extra1 = &one,
.extra2 = &one_thousand,
},
{
.ctl_name = CTL_UNNUMBERED,
.procname = "iso_cpu",
.data = &sched_iso_cpu,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec_minmax,
.strategy = &sysctl_intvec,
.extra1 = &zero,
.extra2 = &one_hundred,
},
#endif
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
{
.ctl_name = KERN_SPIN_RETRY,

View File

@@ -1153,7 +1153,8 @@ void update_process_times(int user_tick)
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Accounting is done within sched_bfs.c */
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_tick);
@@ -1197,7 +1198,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
calc_global_load();
calc_global_load(ticks);
}
#ifdef __ARCH_WANT_SYS_ALARM

View File

@@ -275,10 +275,10 @@ unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
void trace_wake_up(void)
{
/*
* The grunqueue_is_locked() can fail, but this is the best we
* The runqueue_is_locked() can fail, but this is the best we
* have for now:
*/
if (!(trace_flags & TRACE_ITER_BLOCK) && !grunqueue_is_locked())
if (!(trace_flags & TRACE_ITER_BLOCK) && !runqueue_is_locked())
wake_up(&trace_wait);
}

View File

@@ -317,6 +317,8 @@ static int worker_thread(void *__cwq)
if (cwq->wq->freezeable)
set_freezable();
set_user_nice(current, -5);
for (;;) {
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
if (!freezing(current) &&

View File

@@ -723,37 +723,6 @@ config RCU_TORTURE_TEST_RUNNABLE
Say N here if you want the RCU torture tests to start only
after being manually enabled via /proc.
config RCU_TORTURE_TEST
tristate "torture tests for RCU"
depends on DEBUG_KERNEL && !SCHED_BFS
default n
help
This option provides a kernel module that runs torture tests
on the RCU infrastructure. The kernel module may be built
after the fact on the running kernel to be tested, if desired.
Say Y here if you want RCU torture tests to be built into
the kernel.
Say M if you want the RCU torture tests to build as a module.
Say N if you are unsure.
config RCU_TORTURE_TEST_RUNNABLE
bool "torture tests for RCU runnable by default"
depends on RCU_TORTURE_TEST = y
default n
help
This option provides a way to build the RCU torture tests
directly into the kernel without them starting up at boot
time. You can use /proc/sys/kernel/rcutorture_runnable
to manually override this setting. This /proc file is
available only when the RCU torture tests have been built
into the kernel.
Say Y here if you want the RCU torture tests to start during
boot (you probably don't).
Say N here if you want the RCU torture tests to start only
after being manually enabled via /proc.
config RCU_CPU_STALL_DETECTOR
bool "Check for stalled CPUs delaying RCU grace periods"
depends on CLASSIC_RCU || TREE_RCU

View File

@@ -338,7 +338,7 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->time_slice = HZ;
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);