Linux 3.2 – CFS CPU bandwidth (english version)

Publié par cpb
Jan 07 2012

(Version originale en français)

Linus Torvalds released the Linux 3.2 kernel two days ago. As usual, it contains many additions and improvements. One of them caught my attention and I wanted to see how it works, it is the controller for the CFS CPU scheduler.

There were already many ways to adjust the percentage of CPU that a job could use compared to the other jobs with time-sharing scheduler (among aothers using the setpriority() system call, then nice command, or the system parameters in the /sys/fs/cgroup/cpu filesystem). We could easily assign 25% CPU time to a task and 75% to another one. But, up to now, if the second ended, then the first had 100% CPU time.

In other words the “CPU bandwidth” of a task alone was always 100%. It is now possible to change this value in order to assign a smaller portion of CPU time, even when alone.

What may be the interest of reducing the instant consumption of CPU for a process? For current systems, it must be confessed, that interest is reduced. But it is different for some large servers where used processor time but also instantaneaous available CPU power are billed to the user. In these environments it is very interesting to limit the CPU power available for a process regardless of the overall system load.

Another advantage of this limitation is to regulate the availability of the CPU by avoiding power surges and sudden downturns. One can imagine a host environment in which it is necessary to run multiple – say 4 – emulators (like Qemu for example). By limiting each of them to 25% CPU power, the behavior of a virtual system will not depend on the presence or absence of the other emulators on the host.

Using the CPU bandwith controler

There is a new compiler option found in the kernel configuration menu this way: “General Setup” – “Control Group support” – “Group CPU scheduler” – “CPU bandwith provisioning for FAIR_GROUP_SCHED“. It is necessary to activate all theses options.

Here is a generic configuration file for Linux 3.2 on PC (from an Ubuntu 11.10 distribution) with the “CPU bandwidth provisioning” option enabled.

After compiling and restarting on the new kernel, we see a new /proc/sys/kernel/sched_cfs_bandwidth_slice_us file that indicates the length of time slots used when a group of tasks uses CPU power shared among many processors. The default value is 5 milliseconds.

# cat /proc/sys/kernel/sched_cfs_bandwidth_slice_us
5000
#

To access the settings of the CPU power given to a group or a task we must go through the cgroup filesystem:

# mount none /sys/fs/cgroup -t tmpfs
# mkdir /sys/fs/cgroup/cpu
# mount none /sys/fs/cgroup/cpu/ -t cgroup -o cpu
# ls /sys/fs/cgroup/cpu/
cgroup.clone_children  cgroup.procs       cpu.cfs_quota_us  cpu.rt_runtime_us  cpu.stat           release_agent
cgroup.event_control   cpu.cfs_period_us  cpu.rt_period_us  cpu.shares         notify_on_release  tasks
#

Test

Single task

To check the CPU power given to a task we will create a small application that loops indefinitely, and displays every second on its standard output the number of loops that could be achieved during the last second.

consomme-cpu.c:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

int main(void)
{
    long long int compteur;
    time_t debut = time(NULL);
    // Wait for the next second
    while (time(NULL) == debut)
        ;
    while (1) {
        compteur = 0;
        time (& debut);
        while (time(NULL) == debut)
            compteur ++;
        fprintf(stdout, "[%u]%lldn", getpid(), compteur);
    }
    return 0;
}

Start this process in a terminal and let it run.

$ ./consomme-cpu 
          [3040]8046202
          [3040]8051003
          [3040]8038645
          [3040]8049329
          [3040]8378210
          [3040]8419106
          [3040]8416285
          [3040]8418075
          [3040]8415878
          [3040]8419727
          [3040]8416073
          [3040]8417343
          [3040]8414809
          ...

It carries about eight million of loops per second. In another console, create a specific control group and insert our process.

# cd /sys/fs/cgroup/cpu/
# mkdir group-1
# echo 3040 > group-1/tasks
#

The control parameters of bandwidth are by default as follows.

# cat group-1/cpu.cfs_period_us
100000
# cat group-1/cpu.cfs_quota_us
-1
#

The period of bandwidth regulation is 100 milliseconds, and the CPU quota granted to the task is -1, a negative value indicating that no constraints are applied to the process.

We will modify this value to give the task a quota of 25 milliseconds by 100 milliseconds intervals.

# echo 25000 > group-1/cpu.cfs_quota_us
#

Then the behavior of the process varies:

          [3040]8416132
          [3040]8417509
          [3040]4302933
          [3040]1766917
          [3040]1763016
          [3040]1798414
          [3040]1740740
          ...

The number of loops drops to about 18 million. Let’s re-increase the quota to 50%.

# echo 50000 > group-1/cpu.cfs_quota_us
#

So we see the number of loops per second increase again.

          [3040]1672509
          [3040]2776452
          [3040]3662061
          [3040]3777860
          [3040]3694039
          [3040]3768352
          [3040]3745732
          [3040]3822385
          ...

Restore the value to 100,000 microseconds and our process resumes its initial behavior.

# echo 100000 > group-1/cpu.cfs_quota_us
#
          [3040]3708664
          [3040]3779417
          [3040]5944852
          [3040]8051275
          [3040]8049984
          [3040]8050405
          ...

Group of tasks

Now observe the behavior when we start some tasks, which we insert in the same control group.

          $ ./consomme-cpu & ./consomme-cpu
          [6105]8051374
          [6106]8050150
          [6106]8047698
          [6105]8046993
          [6106]8046275
          [6105]8046629
          [6106]8050943
          [6105]8044749
          [6106]8049421
          [6105]8047196
          ...

Our tasks were placed on two distinct processors (or cores) and each can achieve 8 million loops per second. However, we can limit them with a lower CPU-time quota. For example 100% CPU at all.

# echo 6105 > group-1/tasks
# echo 6106 > group-1/tasks
# echo 100000 > group-1/cpu.cfs_quota_us
#

Each manages to do some 4 million loops per second.

          [6105]8055582
          [6106]8023743
          [6105]7876233
          [6106]7598323
          [6106]3678658
          [6105]3910164
          [6105]3800836
          [6106]3753870
          [6105]3776468
          [6106]3659704
          ...

Or a value of 150% CPU (with two processors of course).

# echo 150000 > group-1/cpu.cfs_quota_us
#
          [6105]3667414
          [6106]3831150
          [6106]3839798
          [6105]3714979
          [6105]5004119
          [6106]5851544
          [6106]5801272
          [6105]6028022
          [6105]5810675
          [6106]5749970
          [6106]5784018
          [6105]5769202
          ...

Conclusion

The values seen ​​above are not perfectly stable and accurate, given the time-sharing scheduling that takes into account all the tasks ready to run and the behavior of each of them towards the consumption of CPU time.

Control of the CPU power available for each group of tasks is in my opinion mainly interesting for application servers and virtual machines containers. In my particular case, I think to use it when I run test programs on multiple instances of Qemu to isolate each of them of the activity on the host server.

URL de trackback pour cette page