Scheduling, Priority Calculation and the nice value.

Scheduling, Priority Calculation and the nice value.
	Chapter 7. System Tuning

The scheduling policy in Linux is an important part of the theory behind Performance Tuning. I know that we have spoken about scheduling previously, in connection with time slicing, the run queue and sleep queue, parent and child processes etcetera

Two of the most critical parts of a kernel are the memory subsystem and the scheduler. This is because they influence the design and affect the performance of almost every other part of the kernel and the Operating System.

The scheduler watches processes and re-calculates process position on run-queue and CPU all the time. Generally those that are CPU deprived are given a higher priority and those that have been using a lot of CPU time and penalised for that and therefore probably end up with a lower priority.

If we group types of processes we could say that there are just three types:

No user interaction required - usually a lengthy process that should run in the background (parent does not wait().) Could be a database update or search scientific number sorting and crunching or even a compilation of a program. These processes are given a low priority by the scheduler and penalised heavily for being left in the background and considered not urgent at all. **Remember that, if you schedule a process to run in the background and you choose the default priority (not override with nice or BSD version renice) then your process will run extremely slowly especially on a busy machine a all other processes will have a higher priority by default.^[6]
The processes that are waiting for IO are an interesting bunch and we can really get our teeth into looking at these (almost literally - See below for Process Table wait states). If a user response is required the process will wait() for the response, but once the response is received the user "customer" demands good response times and therefore this process is given a higher priority and put back on the run queue for the next bout of re-calculation of position. It is said that the delay from user response to result should be only between 50 and 150 milliseconds.
The last section are called real-time processes and for these the scheduler drops everything else, to the point that if a process is already on the processor, the scheduler switch the process out to TASK_RUNNING state in which it waits to run again on the CPU - this in order to allow for the real-time process to run.

Generally these are system processes (daemons) but they could also be processes that interact with hardware devices.

Linux and Unix schedulers generally favour IO bound processes over the CPU bound processes.

The scheduler recalculates the priorities of the processes on the run queue using an algorithm which we will discuss after this, however in the older Unix variants, the scheduler would run a recalculation on a regular basis or at the end of each time slice.

On the Linux run queue the priority is only calculated once all the processes have used up there allocated time slices (called a quantum) and then the scheduler recalculates their position or their new position on the run queue.

The algorithm

The scheduler is implemented in the 'main kernel file' kernel/sched.c.

The corresponding header file include/linux/sched.h is included (either explicitly or indirectly) by virtually every kernel source file.

The following is an extract from the man pages:

&quot;The fields of task structure relevant to scheduler include: 
p-&gt;need_resched: this field is set if schedule() should be invoked at 
the &apos;next opportunity&apos;. 

p-&gt;counter: number of clock ticks left to run in this scheduling slice,
decremented by a timer. When this field becomes lower than or equal to zero, 
it is reset to 0 and p-&gt;need_resched is set. This is also sometimes called
&apos;dynamic priority&apos; of a process because it can change by itself.

p-&gt;priority: the process&apos; static priority, only changed through 
well-known system calls like nice(2), POSIX.1b sched_setparam(2) or 
.4BSD/SVR4 setpriority(2).

p-&gt;rt_priority: real-time priority

p-&gt;policy: the scheduling policy, specifies which scheduling class the task belongs to. &quot;

Scheduler Classes:
SCHED_OTHER (traditional UNIX process), 
SCHED_FIFO (POSIX.1b FIFO real-time process) 
SCHED_RR (POSIX round-robin real-time process).

The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function.

A Simple Explanation:

Each process gets a time slice that's 1/100th of a second long.

At the end of each time slice, the do_timer() function is called and priorities are recalculated.

Each time a system call returns to user mode, do_timer() is also called to update the times.

Scheduling processes is not as easy as just finding the process that has been waiting the longest. Some operating systems do this kind of scheduling, which is referred to as "round-robin." The processes could be thought of as sitting in a circle. The scheduling algorithm could then be though of as a pointer that moves around the circle, getting to each process in turn.

The Linux scheduler does a modified version of round-robin scheduling, however, processes with a higher priority get to run more often and longer.

The Nice Value

The nice value is a property that exists for every process. It is not, as is often misunderstood, the priority of a process. It is a number that influences the priority property of a process.

The nice value and the priority of a process are two different things as reported by the "ps -el" (See Screen Shot below). You will see a column labelled or titled "NI" for the nice value and "PRI" for the priority.

riaan@linux:~&gt; ps -el
  UID   PID  PPID CPU PRI NI   VSZ  RSS MWCHAN STAT  TT       TIME COMMAND
 1001   650   649   0   8  0  1324 1184 wait   SLs   p0    0:00.01 zsh
 1001   664   650   0  76  0  3764 3056 select SL+   p0    0:02.53 ssh -v -C 
 1001  1113  1112   0   8  0  1332 1192 wait   SLs   p1    0:00.03 zsh

The nice number can be one of 40 different values on System V the range for nice values is 0 through 39 and on BSD the range is -20 through 19, Linux uses the BSD style.

A higher nice value makes a process run at a lower priority and therefore slower and conversely a lower nice value gives a process a higher priority and it therefore runs faster.

The default nice value of a process is the middle value (0 on Linux) running a process in the background of some shells will result with the shell modifying the process's nice value to make it run a bit slower.

Normal users can change their process's nice values higher from the default middle value (lower priority; therefore being nice to other users) and only root can change processes to run with a lower nice value than the default middle value (higher priority).

Linux also allows you to be nice to your fellow processes. If you feel that your work is not as important as someone else's, you might want to consider being nice to him or her. This is done with the nice command, the syntax of which is:

nice &gt;nice_value&lt; &gt;command&gt;

For example, if you wanted to run the date command with a lower priority, you could run it like this:

nice -10 date

This decreases the start priority of the date command by 10.

To change the nice value of an existing process use the renice command. You can lookup its man page.

The nice value only affects running processes, but child processes inherit the nice value of their parent.

The nice value has incredible power with the scheduler and if used correctly e.g. for a report for the financial director to get there as fast as possible, can be of great benefit to the system administrator and the end-user.

The numeric value calculated for the priority is the opposite of what we normally think of as priority the lower the number the higher the priority.

Scheduling code - From the process table perspective

It is the scheduler that must select the most deserving process to run out of all of the runnable processes in the system. A runnable process is one, which is waiting only for a CPU to run on.

Linux uses a reasonably simple priority based scheduling algorithm to choose between the current processes in the system. When it has chosen a new process to run it saves the state of the current process, the processor specific registers and other context being saved in the processes task_struct data structure.

Interlude: task_struct

This structure represents the states of all tasks running in the systems.

All executing processes have an entry in the process table. The first entry in the process table is the special init process, which is the first process started at boot time.

There is a field that represents the process state, a field that indicates the processes priority
a field, which holds the number of clock ticks (counter), which the process can continue executing without forced rescheduling.
It also contains the schedule policy (SCHED_OTHER, SCHED_FIFO, SCHED_RR) to determine how to schedule the process.

In order to keep track of all executing processes through from the parent to child processes etcetera,

next_task and prev_task
There is a nested structure, mm_struct, which contains a process's memory management information, (such as start and end address of the code segment
Process ID information is also kept within the task_struct.
The process and group id are stored.
File specific process data is located in a fs_struct substructure.
Finally, there are fields that hold timing information; for example, the amount of time the process has spent in user mode
other information less crucial to scheduling.

It is this information saved in the task_struct that is used by the scheduler to restore the state of the new process to run and then gives control of the system to that process.

For the scheduler to fairly allocate CPU time between the runnable processes in the system it keeps information in the task_struct for each process:

Policies

There are the three scheduling policies that influence the process. Remember that the Real time processes have a higher priority than all of the other processes. In round robin scheduling, each runnable real time process is run in turn and in first in, first out scheduling each runnable process is run in the order that it is in on the run queue and that order is never changed.

The priority that the scheduler will give to this process is the value used for recalculation when all runnable processes have a counter value of 0. You can alter the priority of a process by means of system calls and the renice command.

This field allows the scheduler to give each real time process a relative priority. The priority of a real time processes can be altered using system calls.

The counter variable holds the amount of time (in jiffies) that this process is allowed to run for. It is set to priority when the process is first run and is decremented each clock tick.

More detail on scheduling

System Call	Description
nice( )	Change the priority of a conventional process.
getpriority( )	Get the max priority of a group of conventional processes.
setpriority( )	Set the priority of a group of conventional processes.
sched_getscheduler( )	Get the scheduling policy of a process.
sched_setscheduler( )	Set the scheduling policy and priority of a process.
sched_getparam( )	Get the scheduling priority of a process.
sched_setparam( )	Set the priority of a process.
sched_yield( )	Relinquish the processor voluntarily without blocking
sched_get_ priority_min( )	Get the minimum priority value for a policy.
sched_get_ priority_max( )	Get the maximum priority value for a policy.
sched_rr_get_interval( )	Get the time quantum value for the Round Robin policy.

sched_priority

Can have a value in the range 0 to 99. In order to determine the process that runs next, the Linux scheduler looks for the non-empty list with the highest static priority and takes the process at the head of this list. The scheduling policy determines for each process, where it will be inserted into the list of processes with equal static priority and how it will move inside this list.

SCHED_FIFO

For real time processes with a sched_priority of 0.

SCHED_FIFO is a simple scheduling algorithm without time slicing. For processes scheduled under the SCHED_FIFO policy, the following rules are applied:

A SCHED_FIFO process that has been preempted by another process of higher priority will stay at the head of the list for its priority and will resume execution as soon as all processes of higher priority are blocked again.
When a SCHED_FIFO process becomes runnable, it will be inserted at the end of the list for its priority.
A SCHED_FIFO process runs until either it is blocked by an I/O request, it is preempted by a higher priority process, or it calls sched_yield.

SCHED_RR

SCHED_RR is a simple enhancement of SCHED_FIFO. Everything described above for SCHED_FIFO also applies to SCHED_RR, except that each process is only allowed to run for a maximum time quantum.

If a SCHED_RR process has been running for a time period equal to or longer than the time quantum, it will be put at the end of the list for its priority. The length of the time quantum can be retrieved by sched_rr_get_interval.

SCHED_OTHER

Default Linux time-sharing scheduling

SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the standard Linux time-sharing scheduler that is intended for all processes that do not require special static priority real-time mechanisms.

The process to run is chosen from the static priority 0 list based on a dynamic priority that is determined only inside this list. The dynamic priority is based on the nice level (set by the nice or setpriority system call) and increased for each time quantum the process is ready to run, but denied to run by the scheduler.

^[6]When putting a process into the background, it is automatically penalised with a lower nice value than the default user process nice value.