mirror of
https://github.com/genesi/linux-legacy.git
synced 2026-02-04 00:04:43 +00:00
BFS: back out BFS scheduler for stability reasons
It works faster sometimes, other times not.. the code is so old now
This commit is contained in:
@@ -1,326 +0,0 @@
|
||||
BFS - The Brain Fuck Scheduler by Con Kolivas.
|
||||
|
||||
Goals.
|
||||
|
||||
The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
|
||||
completely do away with the complex designs of the past for the cpu process
|
||||
scheduler and instead implement one that is very simple in basic design.
|
||||
The main focus of BFS is to achieve excellent desktop interactivity and
|
||||
responsiveness without heuristics and tuning knobs that are difficult to
|
||||
understand, impossible to model and predict the effect of, and when tuned to
|
||||
one workload cause massive detriment to another.
|
||||
|
||||
|
||||
Design summary.
|
||||
|
||||
BFS is best described as a single runqueue, O(n) lookup, earliest effective
|
||||
virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
|
||||
deadline first) and my previous Staircase Deadline scheduler. Each component
|
||||
shall be described in order to understand the significance of, and reasoning for
|
||||
it. The codebase when the first stable version was released was approximately
|
||||
9000 lines less code than the existing mainline linux kernel scheduler (in
|
||||
2.6.31). This does not even take into account the removal of documentation and
|
||||
the cgroups code that is not used.
|
||||
|
||||
Design reasoning.
|
||||
|
||||
The single runqueue refers to the queued but not running processes for the
|
||||
entire system, regardless of the number of CPUs. The reason for going back to
|
||||
a single runqueue design is that once multiple runqueues are introduced,
|
||||
per-CPU or otherwise, there will be complex interactions as each runqueue will
|
||||
be responsible for the scheduling latency and fairness of the tasks only on its
|
||||
own runqueue, and to achieve fairness and low latency across multiple CPUs, any
|
||||
advantage in throughput of having CPU local tasks causes other disadvantages.
|
||||
This is due to requiring a very complex balancing system to at best achieve some
|
||||
semblance of fairness across CPUs and can only maintain relatively low latency
|
||||
for tasks bound to the same CPUs, not across them. To increase said fairness
|
||||
and latency across CPUs, the advantage of local runqueue locking, which makes
|
||||
for better scalability, is lost due to having to grab multiple locks.
|
||||
|
||||
A significant feature of BFS is that all accounting is done purely based on CPU
|
||||
used and nowhere is sleep time used in any way to determine entitlement or
|
||||
interactivity. Interactivity "estimators" that use some kind of sleep/run
|
||||
algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
|
||||
tasks that aren't interactive as being so. The reason for this is that it is
|
||||
close to impossible to determine that when a task is sleeping, whether it is
|
||||
doing it voluntarily, as in a userspace application waiting for input in the
|
||||
form of a mouse click or otherwise, or involuntarily, because it is waiting for
|
||||
another thread, process, I/O, kernel activity or whatever. Thus, such an
|
||||
estimator will introduce corner cases, and more heuristics will be required to
|
||||
cope with those corner cases, introducing more corner cases and failed
|
||||
interactivity detection and so on. Interactivity in BFS is built into the design
|
||||
by virtue of the fact that tasks that are waking up have not used up their quota
|
||||
of CPU time, and have earlier effective deadlines, thereby making it very likely
|
||||
they will preempt any CPU bound task of equivalent nice level. See below for
|
||||
more information on the virtual deadline mechanism. Even if they do not preempt
|
||||
a running task, because the rr interval is guaranteed to have a bound upper
|
||||
limit on how long a task will wait for, it will be scheduled within a timeframe
|
||||
that will not cause visible interface jitter.
|
||||
|
||||
|
||||
Design details.
|
||||
|
||||
Task insertion.
|
||||
|
||||
BFS inserts tasks into each relevant queue as an O(1) insertion into a double
|
||||
linked list. On insertion, *every* running queue is checked to see if the newly
|
||||
queued task can run on any idle queue, or preempt the lowest running task on the
|
||||
system. This is how the cross-CPU scheduling of BFS achieves significantly lower
|
||||
latency per extra CPU the system has. In this case the lookup is, in the worst
|
||||
case scenario, O(n) where n is the number of CPUs on the system.
|
||||
|
||||
Data protection.
|
||||
|
||||
BFS has one single lock protecting the process local data of every task in the
|
||||
global queue. Thus every insertion, removal and modification of task data in the
|
||||
global runqueue needs to grab the global lock. However, once a task is taken by
|
||||
a CPU, the CPU has its own local data copy of the running process' accounting
|
||||
information which only that CPU accesses and modifies (such as during a
|
||||
timer tick) thus allowing the accounting data to be updated lockless. Once a
|
||||
CPU has taken a task to run, it removes it from the global queue. Thus the
|
||||
global queue only ever has, at most,
|
||||
|
||||
(number of tasks requesting cpu time) - (number of logical CPUs) + 1
|
||||
|
||||
tasks in the global queue. This value is relevant for the time taken to look up
|
||||
tasks during scheduling. This will increase if many tasks with CPU affinity set
|
||||
in their policy to limit which CPUs they're allowed to run on if they outnumber
|
||||
the number of CPUs. The +1 is because when rescheduling a task, the CPU's
|
||||
currently running task is put back on the queue. Lookup will be described after
|
||||
the virtual deadline mechanism is explained.
|
||||
|
||||
Virtual deadline.
|
||||
|
||||
The key to achieving low latency, scheduling fairness, and "nice level"
|
||||
distribution in BFS is entirely in the virtual deadline mechanism. The one
|
||||
tunable in BFS is the rr_interval, or "round robin interval". This is the
|
||||
maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
|
||||
tasks of the same nice level will be running for, or looking at it the other
|
||||
way around, the longest duration two tasks of the same nice level will be
|
||||
delayed for. When a task requests cpu time, it is given a quota (time_slice)
|
||||
equal to the rr_interval and a virtual deadline. The virtual deadline is
|
||||
offset from the current time in jiffies by this equation:
|
||||
|
||||
jiffies + (prio_ratio * rr_interval)
|
||||
|
||||
The prio_ratio is determined as a ratio compared to the baseline of nice -20
|
||||
and increases by 10% per nice level. The deadline is a virtual one only in that
|
||||
no guarantee is placed that a task will actually be scheduled by this time, but
|
||||
it is used to compare which task should go next. There are three components to
|
||||
how a task is next chosen. First is time_slice expiration. If a task runs out
|
||||
of its time_slice, it is descheduled, the time_slice is refilled, and the
|
||||
deadline reset to that formula above. Second is sleep, where a task no longer
|
||||
is requesting CPU for whatever reason. The time_slice and deadline are _not_
|
||||
adjusted in this case and are just carried over for when the task is next
|
||||
scheduled. Third is preemption, and that is when a newly waking task is deemed
|
||||
higher priority than a currently running task on any cpu by virtue of the fact
|
||||
that it has an earlier virtual deadline than the currently running task. The
|
||||
earlier deadline is the key to which task is next chosen for the first and
|
||||
second cases. Once a task is descheduled, it is put back on the queue, and an
|
||||
O(n) lookup of all queued-but-not-running tasks is done to determine which has
|
||||
the earliest deadline and that task is chosen to receive CPU next.
|
||||
|
||||
The CPU proportion of different nice tasks works out to be approximately the
|
||||
|
||||
(prio_ratio difference)^2
|
||||
|
||||
The reason it is squared is that a task's deadline does not change while it is
|
||||
running unless it runs out of time_slice. Thus, even if the time actually
|
||||
passes the deadline of another task that is queued, it will not get CPU time
|
||||
unless the current running task deschedules, and the time "base" (jiffies) is
|
||||
constantly moving.
|
||||
|
||||
Task lookup.
|
||||
|
||||
BFS has 103 priority queues. 100 of these are dedicated to the static priority
|
||||
of realtime tasks, and the remaining 3 are, in order of best to worst priority,
|
||||
SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
|
||||
scheduling). When a task of these priorities is queued, a bitmap of running
|
||||
priorities is set showing which of these priorities has tasks waiting for CPU
|
||||
time. When a CPU is made to reschedule, the lookup for the next task to get
|
||||
CPU time is performed in the following way:
|
||||
|
||||
First the bitmap is checked to see what static priority tasks are queued. If
|
||||
any realtime priorities are found, the corresponding queue is checked and the
|
||||
first task listed there is taken (provided CPU affinity is suitable) and lookup
|
||||
is complete. If the priority corresponds to a SCHED_ISO task, they are also
|
||||
taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
|
||||
to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
|
||||
stage, every task in the runlist that corresponds to that priority is checked
|
||||
to see which has the earliest set deadline, and (provided it has suitable CPU
|
||||
affinity) it is taken off the runqueue and given the CPU. If a task has an
|
||||
expired deadline, it is taken and the rest of the lookup aborted (as they are
|
||||
chosen in FIFO order).
|
||||
|
||||
Thus, the lookup is O(n) in the worst case only, where n is as described
|
||||
earlier, as tasks may be chosen before the whole task list is looked over.
|
||||
|
||||
|
||||
Scalability.
|
||||
|
||||
The major limitations of BFS will be that of scalability, as the separate
|
||||
runqueue designs will have less lock contention as the number of CPUs rises.
|
||||
However they do not scale linearly even with separate runqueues as multiple
|
||||
runqueues will need to be locked concurrently on such designs to be able to
|
||||
achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
|
||||
across CPUs, and to achieve low enough latency for tasks on a busy CPU when
|
||||
other CPUs would be more suited. BFS has the advantage that it requires no
|
||||
balancing algorithm whatsoever, as balancing occurs by proxy simply because
|
||||
all CPUs draw off the global runqueue, in priority and deadline order. Despite
|
||||
the fact that scalability is _not_ the prime concern of BFS, it both shows very
|
||||
good scalability to smaller numbers of CPUs and is likely a more scalable design
|
||||
at these numbers of CPUs.
|
||||
|
||||
It also has some very low overhead scalability features built into the design
|
||||
when it has been deemed their overhead is so marginal that they're worth adding.
|
||||
The first is the local copy of the running process' data to the CPU it's running
|
||||
on to allow that data to be updated lockless where possible. Then there is
|
||||
deference paid to the last CPU a task was running on, by trying that CPU first
|
||||
when looking for an idle CPU to use the next time it's scheduled. Finally there
|
||||
is the notion of "sticky" tasks that are flagged when they are involuntarily
|
||||
descheduled, meaning they still want further CPU time. This sticky flag is
|
||||
used to bias heavily against those tasks being scheduled on a different CPU
|
||||
unless that CPU would be otherwise idle. When a cpu frequency governor is used
|
||||
that scales with CPU load, such as ondemand, sticky tasks are not scheduled
|
||||
on a different CPU at all, preferring instead to go idle. This means the CPU
|
||||
they were bound to is more likely to increase its speed while the other CPU
|
||||
will go idle, thus speeding up total task execution time and likely decreasing
|
||||
power usage. This is the only scenario where BFS will allow a CPU to go idle
|
||||
in preference to scheduling a task on the earliest available spare CPU.
|
||||
|
||||
The real cost of migrating a task from one CPU to another is entirely dependant
|
||||
on the cache footprint of the task, how cache intensive the task is, how long
|
||||
it's been running on that CPU to take up the bulk of its cache, how big the CPU
|
||||
cache is, how fast and how layered the CPU cache is, how fast a context switch
|
||||
is... and so on. In other words, it's close to random in the real world where we
|
||||
do more than just one sole workload. The only thing we can be sure of is that
|
||||
it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
|
||||
utilising idle CPUs is more important than cache locality, and cache locality
|
||||
only plays a part after that.
|
||||
|
||||
Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
|
||||
However this benchmarking was performed on an earlier design that was far less
|
||||
scalable than the current one so it's hard to know how scalable it is in terms
|
||||
of both CPUs (due to the global runqueue) and heavily loaded machines (due to
|
||||
O(n) lookup) at this stage. Note that in terms of scalability, the number of
|
||||
_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
|
||||
quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
|
||||
results are very promising indeed, without needing to tweak any knobs, features
|
||||
or options. Benchmark contributions are most welcome.
|
||||
|
||||
|
||||
Features
|
||||
|
||||
As the initial prime target audience for BFS was the average desktop user, it
|
||||
was designed to not need tweaking, tuning or have features set to obtain benefit
|
||||
from it. Thus the number of knobs and features has been kept to an absolute
|
||||
minimum and should not require extra user input for the vast majority of cases.
|
||||
There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
|
||||
and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
|
||||
to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
|
||||
support for CGROUPS. The average user should neither need to know what these
|
||||
are, nor should they need to be using them to have good desktop behaviour.
|
||||
|
||||
rr_interval
|
||||
|
||||
There is only one "scheduler" tunable, the round robin interval. This can be
|
||||
accessed in
|
||||
|
||||
/proc/sys/kernel/rr_interval
|
||||
|
||||
The value is in milliseconds, and the default value is set to 6ms. Valid values
|
||||
are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
|
||||
decreasing throughput, while increasing it will improve throughput, but at the
|
||||
cost of worsening latencies. The accuracy of the rr interval is limited by HZ
|
||||
resolution of the kernel configuration. Thus, the worst case latencies are
|
||||
usually slightly higher than this actual value. BFS uses "dithering" to try and
|
||||
minimise the effect the Hz limitation has. The default value of 6 is not an
|
||||
arbitrary one. It is based on the fact that humans can detect jitter at
|
||||
approximately 7ms, so aiming for much lower latencies is pointless under most
|
||||
circumstances. It is worth noting this fact when comparing the latency
|
||||
performance of BFS to other schedulers. Worst case latencies being higher than
|
||||
7ms are far worse than average latencies not being in the microsecond range.
|
||||
Experimentation has shown that rr intervals being increased up to 300 can
|
||||
improve throughput but beyond that, scheduling noise from elsewhere prevents
|
||||
further demonstrable throughput.
|
||||
|
||||
Isochronous scheduling.
|
||||
|
||||
Isochronous scheduling is a unique scheduling policy designed to provide
|
||||
near-real-time performance to unprivileged (ie non-root) users without the
|
||||
ability to starve the machine indefinitely. Isochronous tasks (which means
|
||||
"same time") are set using, for example, the schedtool application like so:
|
||||
|
||||
schedtool -I -e amarok
|
||||
|
||||
This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
|
||||
is that it has a priority level between true realtime tasks and SCHED_NORMAL
|
||||
which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
|
||||
if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
|
||||
rate). However if ISO tasks run for more than a tunable finite amount of time,
|
||||
they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
|
||||
time is the percentage of _total CPU_ available across the machine, configurable
|
||||
as a percentage in the following "resource handling" tunable (as opposed to a
|
||||
scheduler tunable):
|
||||
|
||||
/proc/sys/kernel/iso_cpu
|
||||
|
||||
and is set to 70% by default. It is calculated over a rolling 5 second average
|
||||
Because it is the total CPU available, it means that on a multi CPU machine, it
|
||||
is possible to have an ISO task running as realtime scheduling indefinitely on
|
||||
just one CPU, as the other CPUs will be available. Setting this to 100 is the
|
||||
equivalent of giving all users SCHED_RR access and setting it to 0 removes the
|
||||
ability to run any pseudo-realtime tasks.
|
||||
|
||||
A feature of BFS is that it detects when an application tries to obtain a
|
||||
realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
|
||||
appropriate privileges to use those policies. When it detects this, it will
|
||||
give the task SCHED_ISO policy instead. Thus it is transparent to the user.
|
||||
Because some applications constantly set their policy as well as their nice
|
||||
level, there is potential for them to undo the override specified by the user
|
||||
on the command line of setting the policy to SCHED_ISO. To counter this, once
|
||||
a task has been set to SCHED_ISO policy, it needs superuser privileges to set
|
||||
it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
|
||||
processes and threads will also inherit the ISO policy.
|
||||
|
||||
Idleprio scheduling.
|
||||
|
||||
Idleprio scheduling is a scheduling policy designed to give out CPU to a task
|
||||
_only_ when the CPU would be otherwise idle. The idea behind this is to allow
|
||||
ultra low priority tasks to be run in the background that have virtually no
|
||||
effect on the foreground tasks. This is ideally suited to distributed computing
|
||||
clients (like setiathome, folding, mprime etc) but can also be used to start
|
||||
a video encode or so on without any slowdown of other tasks. To avoid this
|
||||
policy from grabbing shared resources and holding them indefinitely, if it
|
||||
detects a state where the task is waiting on I/O, the machine is about to
|
||||
suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
|
||||
per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
|
||||
it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
|
||||
be set to start as SCHED_IDLEPRIO with the schedtool command like so:
|
||||
|
||||
schedtool -D -e ./mprime
|
||||
|
||||
Subtick accounting.
|
||||
|
||||
It is surprisingly difficult to get accurate CPU accounting, and in many cases,
|
||||
the accounting is done by simply determining what is happening at the precise
|
||||
moment a timer tick fires off. This becomes increasingly inaccurate as the
|
||||
timer tick frequency (HZ) is lowered. It is possible to create an application
|
||||
which uses almost 100% CPU, yet by being descheduled at the right time, records
|
||||
zero CPU usage. While the main problem with this is that there are possible
|
||||
security implications, it is also difficult to determine how much CPU a task
|
||||
really does use. BFS tries to use the sub-tick accounting from the TSC clock,
|
||||
where possible, to determine real CPU usage. This is not entirely reliable, but
|
||||
is far more likely to produce accurate CPU usage data than the existing designs
|
||||
and will not show tasks as consuming no CPU usage when they actually are. Thus,
|
||||
the amount of CPU reported as being used by BFS will more accurately represent
|
||||
how much CPU the task itself is using (as is shown for example by the 'time'
|
||||
application), so the reported values may be quite different to other schedulers.
|
||||
Values reported as the 'load' are more prone to problems with this design, but
|
||||
per process values are closer to real usage. When comparing throughput of BFS
|
||||
to other designs, it is important to compare the actual completed work in terms
|
||||
of total wall clock time taken and total work done, rather than the reported
|
||||
"cpu usage".
|
||||
|
||||
|
||||
Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
|
||||
@@ -27,7 +27,6 @@ show up in /proc/sys/kernel:
|
||||
- domainname
|
||||
- hostname
|
||||
- hotplug
|
||||
- iso_cpu
|
||||
- java-appletviewer [ binfmt_java, obsolete ]
|
||||
- java-interpreter [ binfmt_java, obsolete ]
|
||||
- kstack_depth_to_print [ X86 only ]
|
||||
@@ -50,7 +49,6 @@ show up in /proc/sys/kernel:
|
||||
- randomize_va_space
|
||||
- real-root-dev ==> Documentation/initrd.txt
|
||||
- reboot-cmd [ SPARC only ]
|
||||
- rr_interval
|
||||
- rtsig-max
|
||||
- rtsig-nr
|
||||
- sem
|
||||
@@ -173,16 +171,6 @@ Default value is "/sbin/hotplug".
|
||||
|
||||
==============================================================
|
||||
|
||||
iso_cpu: (BFS CPU scheduler only).
|
||||
|
||||
This sets the percentage cpu that the unprivileged SCHED_ISO tasks can
|
||||
run effectively at realtime priority, averaged over a rolling five
|
||||
seconds over the -whole- system, meaning all cpus.
|
||||
|
||||
Set to 70 (percent) by default.
|
||||
|
||||
==============================================================
|
||||
|
||||
l2cr: (PPC only)
|
||||
|
||||
This flag controls the L2 cache of G3 processor boards. If
|
||||
@@ -345,20 +333,6 @@ rebooting. ???
|
||||
|
||||
==============================================================
|
||||
|
||||
rr_interval: (BFS CPU scheduler only)
|
||||
|
||||
This is the smallest duration that any cpu process scheduling unit
|
||||
will run for. Increasing this value can increase throughput of cpu
|
||||
bound tasks substantially but at the expense of increased latencies
|
||||
overall. Conversely decreasing it will decrease average and maximum
|
||||
latencies but at the expense of throughput. This value is in
|
||||
milliseconds and the default value chosen depends on the number of
|
||||
cpus available at scheduler initialisation with a minimum of 6.
|
||||
|
||||
Valid values are from 1-5000.
|
||||
|
||||
==============================================================
|
||||
|
||||
rtsig-max & rtsig-nr:
|
||||
|
||||
The file rtsig-max can be used to tune the maximum number
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
#
|
||||
# Automatically generated make config: don't edit
|
||||
# Linux kernel version: 2.6.31.14.27
|
||||
# Mon Nov 19 11:59:35 2012
|
||||
# Mon Dec 10 20:11:59 2012
|
||||
#
|
||||
CONFIG_ARM=y
|
||||
CONFIG_HAVE_PWM=y
|
||||
@@ -31,7 +31,6 @@ CONFIG_CONSTRUCTORS=y
|
||||
#
|
||||
# General setup
|
||||
#
|
||||
CONFIG_SCHED_BFS=y
|
||||
CONFIG_EXPERIMENTAL=y
|
||||
CONFIG_BROKEN_ON_SMP=y
|
||||
CONFIG_INIT_ENV_ARG_LIMIT=32
|
||||
@@ -59,6 +58,7 @@ CONFIG_RCU_FANOUT=32
|
||||
CONFIG_IKCONFIG=y
|
||||
CONFIG_IKCONFIG_PROC=y
|
||||
CONFIG_LOG_BUF_SHIFT=16
|
||||
# CONFIG_GROUP_SCHED is not set
|
||||
CONFIG_CGROUPS=y
|
||||
# CONFIG_CGROUP_DEBUG is not set
|
||||
CONFIG_CGROUP_NS=y
|
||||
@@ -66,6 +66,7 @@ CONFIG_CGROUP_FREEZER=y
|
||||
# CONFIG_CGROUP_DEVICE is not set
|
||||
CONFIG_CPUSETS=y
|
||||
# CONFIG_PROC_PID_CPUSET is not set
|
||||
CONFIG_CGROUP_CPUACCT=y
|
||||
CONFIG_RESOURCE_COUNTERS=y
|
||||
# CONFIG_CGROUP_MEM_RES_CTLR is not set
|
||||
# CONFIG_SYSFS_DEPRECATED_V2 is not set
|
||||
@@ -280,7 +281,7 @@ CONFIG_VMSPLIT_2G=y
|
||||
# CONFIG_VMSPLIT_1G is not set
|
||||
CONFIG_PAGE_OFFSET=0x80000000
|
||||
# CONFIG_PREEMPT is not set
|
||||
CONFIG_HZ=256
|
||||
CONFIG_HZ=100
|
||||
CONFIG_AEABI=y
|
||||
# CONFIG_OABI_COMPAT is not set
|
||||
# CONFIG_ARCH_SPARSEMEM_DEFAULT is not set
|
||||
|
||||
@@ -61,6 +61,11 @@ static struct task_struct *spusched_task;
|
||||
static struct timer_list spusched_timer;
|
||||
static struct timer_list spuloadavg_timer;
|
||||
|
||||
/*
|
||||
* Priority of a normal, non-rt, non-niced'd process (aka nice level 0).
|
||||
*/
|
||||
#define NORMAL_PRIO 120
|
||||
|
||||
/*
|
||||
* Frequency of the spu scheduler tick. By default we do one SPU scheduler
|
||||
* tick for every 10 CPU scheduler ticks.
|
||||
|
||||
@@ -444,10 +444,8 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
|
||||
freq_target = 5;
|
||||
|
||||
this_dbs_info->requested_freq += freq_target;
|
||||
if (this_dbs_info->requested_freq >= policy->max) {
|
||||
if (this_dbs_info->requested_freq > policy->max)
|
||||
this_dbs_info->requested_freq = policy->max;
|
||||
cpu_nonscaling(policy->cpu);
|
||||
}
|
||||
|
||||
__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
|
||||
CPUFREQ_RELATION_H);
|
||||
@@ -472,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
|
||||
if (policy->cur == policy->min)
|
||||
return;
|
||||
|
||||
cpu_scaling(policy->cpu);
|
||||
__cpufreq_driver_target(policy, this_dbs_info->requested_freq,
|
||||
CPUFREQ_RELATION_H);
|
||||
return;
|
||||
@@ -588,7 +585,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
|
||||
|
||||
dbs_timer_init(this_dbs_info);
|
||||
|
||||
cpu_scaling(cpu);
|
||||
break;
|
||||
|
||||
case CPUFREQ_GOV_STOP:
|
||||
@@ -610,7 +606,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
|
||||
|
||||
mutex_unlock(&dbs_mutex);
|
||||
|
||||
cpu_nonscaling(cpu);
|
||||
break;
|
||||
|
||||
case CPUFREQ_GOV_LIMITS:
|
||||
|
||||
@@ -470,7 +470,6 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
|
||||
if (freq_next < policy->min)
|
||||
freq_next = policy->min;
|
||||
|
||||
cpu_scaling(policy->cpu);
|
||||
if (!dbs_tuners_ins.powersave_bias) {
|
||||
__cpufreq_driver_target(policy, freq_next,
|
||||
CPUFREQ_RELATION_L);
|
||||
@@ -594,7 +593,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
|
||||
mutex_unlock(&dbs_mutex);
|
||||
|
||||
dbs_timer_init(this_dbs_info);
|
||||
cpu_scaling(cpu);
|
||||
break;
|
||||
|
||||
case CPUFREQ_GOV_STOP:
|
||||
@@ -606,7 +604,6 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
|
||||
dbs_enable--;
|
||||
mutex_unlock(&dbs_mutex);
|
||||
|
||||
cpu_nonscaling(cpu);
|
||||
break;
|
||||
|
||||
case CPUFREQ_GOV_LIMITS:
|
||||
|
||||
@@ -23,7 +23,6 @@
|
||||
#include <linux/fs.h>
|
||||
#include <linux/sysfs.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/sched.h>
|
||||
|
||||
/**
|
||||
* A few values needed by the userspace governor
|
||||
@@ -98,10 +97,6 @@ static int cpufreq_set(struct cpufreq_policy *policy, unsigned int freq)
|
||||
* cpufreq_governor_userspace (lock userspace_mutex)
|
||||
*/
|
||||
ret = __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
|
||||
if (freq == cpu_max_freq)
|
||||
cpu_nonscaling(policy->cpu);
|
||||
else
|
||||
cpu_scaling(policy->cpu);
|
||||
|
||||
err:
|
||||
mutex_unlock(&userspace_mutex);
|
||||
@@ -147,7 +142,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
|
||||
per_cpu(cpu_cur_freq, cpu));
|
||||
|
||||
mutex_unlock(&userspace_mutex);
|
||||
cpu_scaling(cpu);
|
||||
break;
|
||||
case CPUFREQ_GOV_STOP:
|
||||
mutex_lock(&userspace_mutex);
|
||||
@@ -164,7 +158,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,
|
||||
per_cpu(cpu_set_freq, cpu) = 0;
|
||||
dprintk("managing cpu %u stopped\n", cpu);
|
||||
mutex_unlock(&userspace_mutex);
|
||||
cpu_nonscaling(cpu);
|
||||
break;
|
||||
case CPUFREQ_GOV_LIMITS:
|
||||
mutex_lock(&userspace_mutex);
|
||||
|
||||
@@ -366,7 +366,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
|
||||
static int proc_pid_schedstat(struct task_struct *task, char *buffer)
|
||||
{
|
||||
return sprintf(buffer, "%llu %llu %lu\n",
|
||||
(unsigned long long)tsk_seruntime(task),
|
||||
(unsigned long long)task->se.sum_exec_runtime,
|
||||
(unsigned long long)task->sched_info.run_delay,
|
||||
task->sched_info.pcount);
|
||||
}
|
||||
|
||||
@@ -109,68 +109,6 @@ extern struct cred init_cred;
|
||||
* INIT_TASK is used to set up the first task table, touch at
|
||||
* your own risk!. Base=0, limit=0x1fffff (=2MB)
|
||||
*/
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
#define INIT_TASK(tsk) \
|
||||
{ \
|
||||
.state = 0, \
|
||||
.stack = &init_thread_info, \
|
||||
.usage = ATOMIC_INIT(2), \
|
||||
.flags = PF_KTHREAD, \
|
||||
.lock_depth = -1, \
|
||||
.prio = NORMAL_PRIO, \
|
||||
.static_prio = MAX_PRIO-20, \
|
||||
.normal_prio = NORMAL_PRIO, \
|
||||
.deadline = 0, \
|
||||
.policy = SCHED_NORMAL, \
|
||||
.cpus_allowed = CPU_MASK_ALL, \
|
||||
.mm = NULL, \
|
||||
.active_mm = &init_mm, \
|
||||
.run_list = LIST_HEAD_INIT(tsk.run_list), \
|
||||
.time_slice = HZ, \
|
||||
.tasks = LIST_HEAD_INIT(tsk.tasks), \
|
||||
.pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \
|
||||
.ptraced = LIST_HEAD_INIT(tsk.ptraced), \
|
||||
.ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \
|
||||
.real_parent = &tsk, \
|
||||
.parent = &tsk, \
|
||||
.children = LIST_HEAD_INIT(tsk.children), \
|
||||
.sibling = LIST_HEAD_INIT(tsk.sibling), \
|
||||
.group_leader = &tsk, \
|
||||
.real_cred = &init_cred, \
|
||||
.cred = &init_cred, \
|
||||
.cred_guard_mutex = \
|
||||
__MUTEX_INITIALIZER(tsk.cred_guard_mutex), \
|
||||
.comm = "swapper", \
|
||||
.thread = INIT_THREAD, \
|
||||
.fs = &init_fs, \
|
||||
.files = &init_files, \
|
||||
.signal = &init_signals, \
|
||||
.sighand = &init_sighand, \
|
||||
.nsproxy = &init_nsproxy, \
|
||||
.pending = { \
|
||||
.list = LIST_HEAD_INIT(tsk.pending.list), \
|
||||
.signal = {{0}}}, \
|
||||
.blocked = {{0}}, \
|
||||
.alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
|
||||
.journal_info = NULL, \
|
||||
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
|
||||
.fs_excl = ATOMIC_INIT(0), \
|
||||
.pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \
|
||||
.timer_slack_ns = 50000, /* 50 usec default slack */ \
|
||||
.pids = { \
|
||||
[PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \
|
||||
[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \
|
||||
[PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \
|
||||
}, \
|
||||
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
|
||||
INIT_IDS \
|
||||
INIT_PERF_COUNTERS(tsk) \
|
||||
INIT_TRACE_IRQFLAGS \
|
||||
INIT_LOCKDEP \
|
||||
INIT_FTRACE_GRAPH \
|
||||
INIT_TRACE_RECURSION \
|
||||
}
|
||||
#else /* CONFIG_SCHED_BFS */
|
||||
#define INIT_TASK(tsk) \
|
||||
{ \
|
||||
.state = 0, \
|
||||
@@ -230,14 +168,13 @@ extern struct cred init_cred;
|
||||
}, \
|
||||
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
|
||||
INIT_IDS \
|
||||
INIT_PERF_EVENTS(tsk) \
|
||||
INIT_PERF_COUNTERS(tsk) \
|
||||
INIT_TRACE_IRQFLAGS \
|
||||
INIT_LOCKDEP \
|
||||
INIT_FTRACE_GRAPH \
|
||||
INIT_TRACE_RECURSION \
|
||||
INIT_TASK_RCU_PREEMPT(tsk) \
|
||||
}
|
||||
#endif /* CONFIG_SCHED_BFS */
|
||||
|
||||
|
||||
#define INIT_CPU_TIMERS(cpu_timers) \
|
||||
{ \
|
||||
|
||||
@@ -64,8 +64,6 @@ static inline int task_ioprio_class(struct io_context *ioc)
|
||||
|
||||
static inline int task_nice_ioprio(struct task_struct *task)
|
||||
{
|
||||
if (iso_task(task))
|
||||
return 0;
|
||||
return (task_nice(task) + 20) / 5;
|
||||
}
|
||||
|
||||
|
||||
@@ -164,7 +164,7 @@ static inline u64 get_jiffies_64(void)
|
||||
* Have the 32 bit jiffies value wrap 5 minutes after boot
|
||||
* so jiffies wrap bugs show up earlier.
|
||||
*/
|
||||
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-10*HZ))
|
||||
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))
|
||||
|
||||
/*
|
||||
* Change timeval to jiffies, trying to avoid the
|
||||
|
||||
@@ -36,16 +36,8 @@
|
||||
#define SCHED_FIFO 1
|
||||
#define SCHED_RR 2
|
||||
#define SCHED_BATCH 3
|
||||
/* SCHED_ISO: Implemented on BFS only */
|
||||
/* SCHED_ISO: reserved but not implemented yet */
|
||||
#define SCHED_IDLE 5
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
#define SCHED_ISO 4
|
||||
#define SCHED_IDLEPRIO SCHED_IDLE
|
||||
|
||||
#define SCHED_MAX (SCHED_IDLEPRIO)
|
||||
#define SCHED_RANGE(policy) ((policy) <= SCHED_MAX)
|
||||
#endif
|
||||
|
||||
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
|
||||
#define SCHED_RESET_ON_FORK 0x40000000
|
||||
|
||||
@@ -148,7 +140,7 @@ extern int nr_processes(void);
|
||||
extern unsigned long nr_running(void);
|
||||
extern unsigned long nr_uninterruptible(void);
|
||||
extern unsigned long nr_iowait(void);
|
||||
extern void calc_global_load(void);
|
||||
extern void calc_global_load(unsigned long ticks);
|
||||
extern u64 cpu_nr_migrations(int cpu);
|
||||
|
||||
extern unsigned long get_parent_ip(unsigned long addr);
|
||||
@@ -264,6 +256,9 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
|
||||
extern void init_idle(struct task_struct *idle, int cpu);
|
||||
extern void init_idle_bootup_task(struct task_struct *idle);
|
||||
|
||||
extern int runqueue_is_locked(void);
|
||||
extern void task_rq_unlock_wait(struct task_struct *p);
|
||||
|
||||
extern cpumask_var_t nohz_cpu_mask;
|
||||
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
|
||||
extern int select_nohz_load_balancer(int cpu);
|
||||
@@ -1028,6 +1023,148 @@ struct uts_namespace;
|
||||
struct rq;
|
||||
struct sched_domain;
|
||||
|
||||
struct sched_class {
|
||||
const struct sched_class *next;
|
||||
|
||||
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
|
||||
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
|
||||
void (*yield_task) (struct rq *rq);
|
||||
|
||||
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int sync);
|
||||
|
||||
struct task_struct * (*pick_next_task) (struct rq *rq);
|
||||
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
|
||||
|
||||
#ifdef CONFIG_SMP
|
||||
int (*select_task_rq)(struct task_struct *p, int sync);
|
||||
|
||||
unsigned long (*load_balance) (struct rq *this_rq, int this_cpu,
|
||||
struct rq *busiest, unsigned long max_load_move,
|
||||
struct sched_domain *sd, enum cpu_idle_type idle,
|
||||
int *all_pinned, int *this_best_prio);
|
||||
|
||||
int (*move_one_task) (struct rq *this_rq, int this_cpu,
|
||||
struct rq *busiest, struct sched_domain *sd,
|
||||
enum cpu_idle_type idle);
|
||||
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
|
||||
int (*needs_post_schedule) (struct rq *this_rq);
|
||||
void (*post_schedule) (struct rq *this_rq);
|
||||
void (*task_wake_up) (struct rq *this_rq, struct task_struct *task);
|
||||
|
||||
void (*set_cpus_allowed)(struct task_struct *p,
|
||||
const struct cpumask *newmask);
|
||||
|
||||
void (*rq_online)(struct rq *rq);
|
||||
void (*rq_offline)(struct rq *rq);
|
||||
#endif
|
||||
|
||||
void (*set_curr_task) (struct rq *rq);
|
||||
void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
|
||||
void (*task_new) (struct rq *rq, struct task_struct *p);
|
||||
|
||||
void (*switched_from) (struct rq *this_rq, struct task_struct *task,
|
||||
int running);
|
||||
void (*switched_to) (struct rq *this_rq, struct task_struct *task,
|
||||
int running);
|
||||
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
|
||||
int oldprio, int running);
|
||||
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
void (*moved_group) (struct task_struct *p);
|
||||
#endif
|
||||
};
|
||||
|
||||
struct load_weight {
|
||||
unsigned long weight, inv_weight;
|
||||
};
|
||||
|
||||
/*
|
||||
* CFS stats for a schedulable entity (task, task-group etc)
|
||||
*
|
||||
* Current field usage histogram:
|
||||
*
|
||||
* 4 se->block_start
|
||||
* 4 se->run_node
|
||||
* 4 se->sleep_start
|
||||
* 6 se->load.weight
|
||||
*/
|
||||
struct sched_entity {
|
||||
struct load_weight load; /* for load-balancing */
|
||||
struct rb_node run_node;
|
||||
struct list_head group_node;
|
||||
unsigned int on_rq;
|
||||
|
||||
u64 exec_start;
|
||||
u64 sum_exec_runtime;
|
||||
u64 vruntime;
|
||||
u64 prev_sum_exec_runtime;
|
||||
|
||||
u64 last_wakeup;
|
||||
u64 avg_overlap;
|
||||
|
||||
u64 nr_migrations;
|
||||
|
||||
u64 start_runtime;
|
||||
u64 avg_wakeup;
|
||||
|
||||
#ifdef CONFIG_SCHEDSTATS
|
||||
u64 wait_start;
|
||||
u64 wait_max;
|
||||
u64 wait_count;
|
||||
u64 wait_sum;
|
||||
|
||||
u64 sleep_start;
|
||||
u64 sleep_max;
|
||||
s64 sum_sleep_runtime;
|
||||
|
||||
u64 block_start;
|
||||
u64 block_max;
|
||||
u64 exec_max;
|
||||
u64 slice_max;
|
||||
|
||||
u64 nr_migrations_cold;
|
||||
u64 nr_failed_migrations_affine;
|
||||
u64 nr_failed_migrations_running;
|
||||
u64 nr_failed_migrations_hot;
|
||||
u64 nr_forced_migrations;
|
||||
u64 nr_forced2_migrations;
|
||||
|
||||
u64 nr_wakeups;
|
||||
u64 nr_wakeups_sync;
|
||||
u64 nr_wakeups_migrate;
|
||||
u64 nr_wakeups_local;
|
||||
u64 nr_wakeups_remote;
|
||||
u64 nr_wakeups_affine;
|
||||
u64 nr_wakeups_affine_attempts;
|
||||
u64 nr_wakeups_passive;
|
||||
u64 nr_wakeups_idle;
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_FAIR_GROUP_SCHED
|
||||
struct sched_entity *parent;
|
||||
/* rq on which this entity is (to be) queued: */
|
||||
struct cfs_rq *cfs_rq;
|
||||
/* rq "owned" by this entity/group: */
|
||||
struct cfs_rq *my_q;
|
||||
#endif
|
||||
};
|
||||
|
||||
struct sched_rt_entity {
|
||||
struct list_head run_list;
|
||||
unsigned long timeout;
|
||||
unsigned int time_slice;
|
||||
int nr_cpus_allowed;
|
||||
|
||||
struct sched_rt_entity *back;
|
||||
#ifdef CONFIG_RT_GROUP_SCHED
|
||||
struct sched_rt_entity *parent;
|
||||
/* rq on which this entity is (to be) queued: */
|
||||
struct rt_rq *rt_rq;
|
||||
/* rq "owned" by this entity/group: */
|
||||
struct rt_rq *my_q;
|
||||
#endif
|
||||
};
|
||||
|
||||
struct task_struct {
|
||||
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
|
||||
void *stack;
|
||||
@@ -1037,33 +1174,17 @@ struct task_struct {
|
||||
|
||||
int lock_depth; /* BKL lock depth */
|
||||
|
||||
#ifndef CONFIG_SCHED_BFS
|
||||
#ifdef CONFIG_SMP
|
||||
#ifdef __ARCH_WANT_UNLOCKED_CTXSW
|
||||
int oncpu;
|
||||
#endif
|
||||
#endif
|
||||
#else /* CONFIG_SCHED_BFS */
|
||||
int oncpu;
|
||||
#endif
|
||||
|
||||
int prio, static_prio, normal_prio;
|
||||
unsigned int rt_priority;
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
int time_slice;
|
||||
u64 deadline;
|
||||
struct list_head run_list;
|
||||
u64 last_ran;
|
||||
u64 sched_time; /* sched_clock time spent running */
|
||||
#ifdef CONFIG_SMP
|
||||
int sticky; /* Soft affined flag */
|
||||
#endif
|
||||
unsigned long rt_timeout;
|
||||
#else /* CONFIG_SCHED_BFS */
|
||||
const struct sched_class *sched_class;
|
||||
struct sched_entity se;
|
||||
struct sched_rt_entity rt;
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_PREEMPT_NOTIFIERS
|
||||
/* list of struct preempt_notifier: */
|
||||
@@ -1158,9 +1279,6 @@ struct task_struct {
|
||||
int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */
|
||||
|
||||
cputime_t utime, stime, utimescaled, stimescaled;
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
unsigned long utime_pc, stime_pc;
|
||||
#endif
|
||||
cputime_t gtime;
|
||||
cputime_t prev_utime, prev_stime;
|
||||
unsigned long nvcsw, nivcsw; /* context switch counts */
|
||||
@@ -1370,66 +1488,6 @@ struct task_struct {
|
||||
#endif /* CONFIG_TRACING */
|
||||
};
|
||||
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
extern int grunqueue_is_locked(void);
|
||||
extern void grq_unlock_wait(void);
|
||||
extern void cpu_scaling(int cpu);
|
||||
extern void cpu_nonscaling(int cpu);
|
||||
#define tsk_seruntime(t) ((t)->sched_time)
|
||||
#define tsk_rttimeout(t) ((t)->rt_timeout)
|
||||
#define task_rq_unlock_wait(tsk) grq_unlock_wait()
|
||||
|
||||
static inline void set_oom_timeslice(struct task_struct *p)
|
||||
{
|
||||
p->time_slice = HZ;
|
||||
}
|
||||
|
||||
static inline void tsk_cpus_current(struct task_struct *p)
|
||||
{
|
||||
}
|
||||
|
||||
#define runqueue_is_locked(cpu) grunqueue_is_locked()
|
||||
|
||||
static inline void print_scheduler_version(void)
|
||||
{
|
||||
printk(KERN_INFO"BFS CPU scheduler v0.376 by Con Kolivas.\n");
|
||||
}
|
||||
|
||||
static inline int iso_task(struct task_struct *p)
|
||||
{
|
||||
return (p->policy == SCHED_ISO);
|
||||
}
|
||||
#else
|
||||
extern int runqueue_is_locked(int cpu);
|
||||
extern void task_rq_unlock_wait(struct task_struct *p);
|
||||
#define tsk_seruntime(t) ((t)->se.sum_exec_runtime)
|
||||
#define tsk_rttimeout(t) ((t)->rt.timeout)
|
||||
|
||||
static inline void sched_exit(struct task_struct *p)
|
||||
{
|
||||
}
|
||||
|
||||
static inline void set_oom_timeslice(struct task_struct *p)
|
||||
{
|
||||
p->rt.time_slice = HZ;
|
||||
}
|
||||
|
||||
static inline void tsk_cpus_current(struct task_struct *p)
|
||||
{
|
||||
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
|
||||
}
|
||||
|
||||
static inline void print_scheduler_version(void)
|
||||
{
|
||||
printk(KERN_INFO"CFS CPU scheduler.\n");
|
||||
}
|
||||
|
||||
static inline int iso_task(struct task_struct *p)
|
||||
{
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* Future-safe accessor for struct task_struct's cpus_allowed. */
|
||||
#define tsk_cpumask(tsk) (&(tsk)->cpus_allowed)
|
||||
|
||||
@@ -1448,19 +1506,9 @@ static inline int iso_task(struct task_struct *p)
|
||||
|
||||
#define MAX_USER_RT_PRIO 100
|
||||
#define MAX_RT_PRIO MAX_USER_RT_PRIO
|
||||
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
|
||||
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
#define PRIO_RANGE (40)
|
||||
#define MAX_PRIO (MAX_RT_PRIO + PRIO_RANGE)
|
||||
#define ISO_PRIO (MAX_RT_PRIO)
|
||||
#define NORMAL_PRIO (MAX_RT_PRIO + 1)
|
||||
#define IDLE_PRIO (MAX_RT_PRIO + 2)
|
||||
#define PRIO_LIMIT ((IDLE_PRIO) + 1)
|
||||
#else /* CONFIG_SCHED_BFS */
|
||||
#define MAX_PRIO (MAX_RT_PRIO + 40)
|
||||
#define NORMAL_PRIO DEFAULT_PRIO
|
||||
#endif /* CONFIG_SCHED_BFS */
|
||||
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
|
||||
|
||||
static inline int rt_prio(int prio)
|
||||
{
|
||||
@@ -1743,7 +1791,7 @@ task_sched_runtime(struct task_struct *task);
|
||||
extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
|
||||
|
||||
/* sched_exec is called by processes performing an exec */
|
||||
#if defined(CONFIG_SMP) && !defined(CONFIG_SCHED_BFS)
|
||||
#ifdef CONFIG_SMP
|
||||
extern void sched_exec(void);
|
||||
#else
|
||||
#define sched_exec() {}
|
||||
@@ -1897,9 +1945,6 @@ extern void wake_up_new_task(struct task_struct *tsk,
|
||||
static inline void kick_process(struct task_struct *tsk) { }
|
||||
#endif
|
||||
extern void sched_fork(struct task_struct *p, int clone_flags);
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
extern void sched_exit(struct task_struct *p);
|
||||
#endif
|
||||
extern void sched_dead(struct task_struct *p);
|
||||
|
||||
extern void proc_caches_init(void);
|
||||
|
||||
17
init/Kconfig
17
init/Kconfig
@@ -23,19 +23,6 @@ config CONSTRUCTORS
|
||||
|
||||
menu "General setup"
|
||||
|
||||
config SCHED_BFS
|
||||
bool "BFS cpu scheduler"
|
||||
---help---
|
||||
The Brain Fuck CPU Scheduler for excellent interactivity and
|
||||
responsiveness on the desktop and solid scalability on normal
|
||||
hardware. Not recommended for 4096 CPUs.
|
||||
|
||||
Currently incompatible with the Group CPU scheduler, and RCU TORTURE
|
||||
TEST so these options are disabled.
|
||||
|
||||
Say Y here.
|
||||
default y
|
||||
|
||||
config EXPERIMENTAL
|
||||
bool "Prompt for development and/or incomplete code/drivers"
|
||||
---help---
|
||||
@@ -456,7 +443,7 @@ config HAVE_UNSTABLE_SCHED_CLOCK
|
||||
|
||||
config GROUP_SCHED
|
||||
bool "Group CPU scheduler"
|
||||
depends on EXPERIMENTAL && !SCHED_BFS
|
||||
depends on EXPERIMENTAL
|
||||
default n
|
||||
help
|
||||
This feature lets CPU scheduler recognize task groups and control CPU
|
||||
@@ -572,7 +559,7 @@ config PROC_PID_CPUSET
|
||||
|
||||
config CGROUP_CPUACCT
|
||||
bool "Simple CPU accounting cgroup subsystem"
|
||||
depends on CGROUPS && !SCHED_BFS
|
||||
depends on CGROUPS
|
||||
help
|
||||
Provides a simple Resource Controller for monitoring the
|
||||
total CPU consumed by the tasks in a cgroup.
|
||||
|
||||
@@ -840,8 +840,6 @@ static noinline int init_post(void)
|
||||
system_state = SYSTEM_RUNNING;
|
||||
numa_default_policy();
|
||||
|
||||
print_scheduler_version();
|
||||
|
||||
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
|
||||
printk(KERN_WARNING "Warning: unable to open an initial console.\n");
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
# Makefile for the linux kernel.
|
||||
#
|
||||
|
||||
obj-y = sched_bfs.o fork.o exec_domain.o panic.o printk.o \
|
||||
obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
|
||||
cpu.o exit.o itimer.o time.o softirq.o resource.o \
|
||||
sysctl.o capability.o ptrace.o timer.o user.o \
|
||||
signal.o sys.o kmod.o workqueue.o pid.o \
|
||||
@@ -107,7 +107,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
|
||||
# me. I suspect most platforms don't need this, but until we know that for sure
|
||||
# I turn this off for IA-64 only. Andreas Schwab says it's also needed on m68k
|
||||
# to get a correct value for the wait-channel (WCHAN in ps). --davidm
|
||||
CFLAGS_sched_bfs.o := $(PROFILING) -fno-omit-frame-pointer
|
||||
CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer
|
||||
endif
|
||||
|
||||
$(obj)/configs.o: $(obj)/config_data.h
|
||||
|
||||
@@ -127,7 +127,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
|
||||
*/
|
||||
t1 = tsk->sched_info.pcount;
|
||||
t2 = tsk->sched_info.run_delay;
|
||||
t3 = tsk_seruntime(tsk);
|
||||
t3 = tsk->se.sum_exec_runtime;
|
||||
|
||||
d->cpu_count += t1;
|
||||
|
||||
|
||||
@@ -120,7 +120,7 @@ static void __exit_signal(struct task_struct *tsk)
|
||||
sig->inblock += task_io_get_inblock(tsk);
|
||||
sig->oublock += task_io_get_oublock(tsk);
|
||||
task_io_accounting_add(&sig->ioac, &tsk->ioac);
|
||||
sig->sum_sched_runtime += tsk_seruntime(tsk);
|
||||
sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
|
||||
sig = NULL; /* Marker for below. */
|
||||
}
|
||||
|
||||
|
||||
@@ -1199,7 +1199,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|
||||
* parent's CPU). This avoids alot of nasty races.
|
||||
*/
|
||||
p->cpus_allowed = current->cpus_allowed;
|
||||
tsk_cpus_current(p);
|
||||
p->rt.nr_cpus_allowed = current->rt.nr_cpus_allowed;
|
||||
if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) ||
|
||||
!cpu_online(task_cpu(p))))
|
||||
set_task_cpu(p, smp_processor_id());
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
#include <linux/mutex.h>
|
||||
#include <trace/events/sched.h>
|
||||
|
||||
#define KTHREAD_NICE_LEVEL (0)
|
||||
#define KTHREAD_NICE_LEVEL (-5)
|
||||
|
||||
static DEFINE_SPINLOCK(kthread_create_lock);
|
||||
static LIST_HEAD(kthread_create_list);
|
||||
@@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
|
||||
}
|
||||
set_task_cpu(k, cpu);
|
||||
k->cpus_allowed = cpumask_of_cpu(cpu);
|
||||
k->rt.nr_cpus_allowed = 1;
|
||||
k->flags |= PF_THREAD_BOUND;
|
||||
}
|
||||
EXPORT_SYMBOL(kthread_bind);
|
||||
|
||||
@@ -249,7 +249,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
|
||||
do {
|
||||
times->utime = cputime_add(times->utime, t->utime);
|
||||
times->stime = cputime_add(times->stime, t->stime);
|
||||
times->sum_exec_runtime += tsk_seruntime(t);
|
||||
times->sum_exec_runtime += t->se.sum_exec_runtime;
|
||||
|
||||
t = next_thread(t);
|
||||
} while (t != tsk);
|
||||
@@ -516,7 +516,7 @@ static void cleanup_timers(struct list_head *head,
|
||||
void posix_cpu_timers_exit(struct task_struct *tsk)
|
||||
{
|
||||
cleanup_timers(tsk->cpu_timers,
|
||||
tsk->utime, tsk->stime, tsk_seruntime(tsk));
|
||||
tsk->utime, tsk->stime, tsk->se.sum_exec_runtime);
|
||||
|
||||
}
|
||||
void posix_cpu_timers_exit_group(struct task_struct *tsk)
|
||||
@@ -526,7 +526,7 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
|
||||
cleanup_timers(tsk->signal->cpu_timers,
|
||||
cputime_add(tsk->utime, sig->utime),
|
||||
cputime_add(tsk->stime, sig->stime),
|
||||
tsk_seruntime(tsk) + sig->sum_sched_runtime);
|
||||
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
|
||||
}
|
||||
|
||||
static void clear_dead_task(struct k_itimer *timer, union cpu_time_count now)
|
||||
@@ -1017,7 +1017,7 @@ static void check_thread_timers(struct task_struct *tsk,
|
||||
struct cpu_timer_list *t = list_first_entry(timers,
|
||||
struct cpu_timer_list,
|
||||
entry);
|
||||
if (!--maxfire || tsk_seruntime(tsk) < t->expires.sched) {
|
||||
if (!--maxfire || tsk->se.sum_exec_runtime < t->expires.sched) {
|
||||
tsk->cputime_expires.sched_exp = t->expires.sched;
|
||||
break;
|
||||
}
|
||||
@@ -1033,7 +1033,7 @@ static void check_thread_timers(struct task_struct *tsk,
|
||||
unsigned long *soft = &sig->rlim[RLIMIT_RTTIME].rlim_cur;
|
||||
|
||||
if (hard != RLIM_INFINITY &&
|
||||
tsk_rttimeout(tsk) > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
|
||||
tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
|
||||
/*
|
||||
* At the hard limit, we just die.
|
||||
* No need to calculate anything else now.
|
||||
@@ -1041,7 +1041,7 @@ static void check_thread_timers(struct task_struct *tsk,
|
||||
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
|
||||
return;
|
||||
}
|
||||
if (tsk_rttimeout(tsk) > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
|
||||
if (tsk->rt.timeout > DIV_ROUND_UP(*soft, USEC_PER_SEC/HZ)) {
|
||||
/*
|
||||
* At the soft limit, send a SIGXCPU every second.
|
||||
*/
|
||||
@@ -1357,7 +1357,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
|
||||
struct task_cputime task_sample = {
|
||||
.utime = tsk->utime,
|
||||
.stime = tsk->stime,
|
||||
.sum_exec_runtime = tsk_seruntime(tsk)
|
||||
.sum_exec_runtime = tsk->se.sum_exec_runtime
|
||||
};
|
||||
|
||||
if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
|
||||
|
||||
@@ -1,6 +1,3 @@
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
#include "sched_bfs.c"
|
||||
#else
|
||||
/*
|
||||
* kernel/sched.c
|
||||
*
|
||||
@@ -10801,4 +10798,3 @@ struct cgroup_subsys cpuacct_subsys = {
|
||||
.subsys_id = cpuacct_subsys_id,
|
||||
};
|
||||
#endif /* CONFIG_CGROUP_CPUACCT */
|
||||
#endif /* CONFIG_SCHED_BFS */
|
||||
|
||||
6737
kernel/sched_bfs.c
6737
kernel/sched_bfs.c
File diff suppressed because it is too large
Load Diff
@@ -100,15 +100,10 @@ static int neg_one = -1;
|
||||
#endif
|
||||
|
||||
static int zero;
|
||||
static int __maybe_unused one = 1;
|
||||
static int __maybe_unused two = 2;
|
||||
static unsigned long one_ul = 1;
|
||||
static int __read_mostly one = 1;
|
||||
static int __read_mostly one_hundred = 100;
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
extern int rr_interval;
|
||||
extern int sched_iso_cpu;
|
||||
static int __read_mostly one_thousand = 1000;
|
||||
#endif
|
||||
static int one_hundred = 100;
|
||||
|
||||
/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
|
||||
static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
|
||||
@@ -243,7 +238,7 @@ static struct ctl_table root_table[] = {
|
||||
{ .ctl_name = 0 }
|
||||
};
|
||||
|
||||
#if defined(CONFIG_SCHED_DEBUG) && !defined(CONFIG_SCHED_BFS)
|
||||
#ifdef CONFIG_SCHED_DEBUG
|
||||
static int min_sched_granularity_ns = 100000; /* 100 usecs */
|
||||
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
||||
static int min_wakeup_granularity_ns; /* 0 usecs */
|
||||
@@ -251,15 +246,6 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
|
||||
#endif
|
||||
|
||||
static struct ctl_table kern_table[] = {
|
||||
#ifndef CONFIG_SCHED_BFS
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "sched_child_runs_first",
|
||||
.data = &sysctl_sched_child_runs_first,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
#ifdef CONFIG_SCHED_DEBUG
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
@@ -312,6 +298,14 @@ static struct ctl_table kern_table[] = {
|
||||
.strategy = &sysctl_intvec,
|
||||
.extra1 = &zero,
|
||||
},
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "sched_child_runs_first",
|
||||
.data = &sysctl_sched_child_runs_first,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "sched_features",
|
||||
@@ -336,14 +330,6 @@ static struct ctl_table kern_table[] = {
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "sched_time_avg",
|
||||
.data = &sysctl_sched_time_avg,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "timer_migration",
|
||||
@@ -380,7 +366,6 @@ static struct ctl_table kern_table[] = {
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
#endif /* !CONFIG_SCHED_BFS */
|
||||
#ifdef CONFIG_PROVE_LOCKING
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
@@ -813,30 +798,6 @@ static struct ctl_table kern_table[] = {
|
||||
.proc_handler = &proc_dointvec,
|
||||
},
|
||||
#endif
|
||||
#ifdef CONFIG_SCHED_BFS
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "rr_interval",
|
||||
.data = &rr_interval,
|
||||
.maxlen = sizeof (int),
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec_minmax,
|
||||
.strategy = &sysctl_intvec,
|
||||
.extra1 = &one,
|
||||
.extra2 = &one_thousand,
|
||||
},
|
||||
{
|
||||
.ctl_name = CTL_UNNUMBERED,
|
||||
.procname = "iso_cpu",
|
||||
.data = &sched_iso_cpu,
|
||||
.maxlen = sizeof (int),
|
||||
.mode = 0644,
|
||||
.proc_handler = &proc_dointvec_minmax,
|
||||
.strategy = &sysctl_intvec,
|
||||
.extra1 = &zero,
|
||||
.extra2 = &one_hundred,
|
||||
},
|
||||
#endif
|
||||
#if defined(CONFIG_S390) && defined(CONFIG_SMP)
|
||||
{
|
||||
.ctl_name = KERN_SPIN_RETRY,
|
||||
|
||||
@@ -1153,7 +1153,8 @@ void update_process_times(int user_tick)
|
||||
struct task_struct *p = current;
|
||||
int cpu = smp_processor_id();
|
||||
|
||||
/* Accounting is done within sched_bfs.c */
|
||||
/* Note: this timer irq context must be accounted for as well. */
|
||||
account_process_tick(p, user_tick);
|
||||
run_local_timers();
|
||||
if (rcu_pending(cpu))
|
||||
rcu_check_callbacks(cpu, user_tick);
|
||||
@@ -1197,7 +1198,7 @@ void do_timer(unsigned long ticks)
|
||||
{
|
||||
jiffies_64 += ticks;
|
||||
update_wall_time();
|
||||
calc_global_load();
|
||||
calc_global_load(ticks);
|
||||
}
|
||||
|
||||
#ifdef __ARCH_WANT_SYS_ALARM
|
||||
|
||||
@@ -275,10 +275,10 @@ unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
|
||||
void trace_wake_up(void)
|
||||
{
|
||||
/*
|
||||
* The grunqueue_is_locked() can fail, but this is the best we
|
||||
* The runqueue_is_locked() can fail, but this is the best we
|
||||
* have for now:
|
||||
*/
|
||||
if (!(trace_flags & TRACE_ITER_BLOCK) && !grunqueue_is_locked())
|
||||
if (!(trace_flags & TRACE_ITER_BLOCK) && !runqueue_is_locked())
|
||||
wake_up(&trace_wait);
|
||||
}
|
||||
|
||||
|
||||
@@ -317,6 +317,8 @@ static int worker_thread(void *__cwq)
|
||||
if (cwq->wq->freezeable)
|
||||
set_freezable();
|
||||
|
||||
set_user_nice(current, -5);
|
||||
|
||||
for (;;) {
|
||||
prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
|
||||
if (!freezing(current) &&
|
||||
|
||||
@@ -723,37 +723,6 @@ config RCU_TORTURE_TEST_RUNNABLE
|
||||
Say N here if you want the RCU torture tests to start only
|
||||
after being manually enabled via /proc.
|
||||
|
||||
config RCU_TORTURE_TEST
|
||||
tristate "torture tests for RCU"
|
||||
depends on DEBUG_KERNEL && !SCHED_BFS
|
||||
default n
|
||||
help
|
||||
This option provides a kernel module that runs torture tests
|
||||
on the RCU infrastructure. The kernel module may be built
|
||||
after the fact on the running kernel to be tested, if desired.
|
||||
|
||||
Say Y here if you want RCU torture tests to be built into
|
||||
the kernel.
|
||||
Say M if you want the RCU torture tests to build as a module.
|
||||
Say N if you are unsure.
|
||||
|
||||
config RCU_TORTURE_TEST_RUNNABLE
|
||||
bool "torture tests for RCU runnable by default"
|
||||
depends on RCU_TORTURE_TEST = y
|
||||
default n
|
||||
help
|
||||
This option provides a way to build the RCU torture tests
|
||||
directly into the kernel without them starting up at boot
|
||||
time. You can use /proc/sys/kernel/rcutorture_runnable
|
||||
to manually override this setting. This /proc file is
|
||||
available only when the RCU torture tests have been built
|
||||
into the kernel.
|
||||
|
||||
Say Y here if you want the RCU torture tests to start during
|
||||
boot (you probably don't).
|
||||
Say N here if you want the RCU torture tests to start only
|
||||
after being manually enabled via /proc.
|
||||
|
||||
config RCU_CPU_STALL_DETECTOR
|
||||
bool "Check for stalled CPUs delaying RCU grace periods"
|
||||
depends on CLASSIC_RCU || TREE_RCU
|
||||
|
||||
@@ -338,7 +338,7 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
|
||||
* all the memory it needs. That way it should be able to
|
||||
* exit() and clear out its resources quickly...
|
||||
*/
|
||||
p->time_slice = HZ;
|
||||
p->rt.time_slice = HZ;
|
||||
set_tsk_thread_flag(p, TIF_MEMDIE);
|
||||
|
||||
force_sig(SIGKILL, p);
|
||||
|
||||
Reference in New Issue
Block a user