Parallelizing Compilations

Publié par cpb
Jan 14 2012

(Version originale en français)

I very frequently compile Linux kernels, often during training sessions or engineering services (mainly in the field of embedded systems or drivers development), sometimes while writing articles or books.

Compilation time varies greatly depending on the amount of code (drivers, filesystems, protocols, etc.) and on the CPU power of the host machine. On a mid-range PC, compiling a kernel adjusted for an embedded system (with very few drivers) lasts about three minutes. On an entry level machine (or a little old one), compiling a generic kernel for PC (with hundreds of drivers as modules) can last an hour.

To take advantage of the parallelism offered by the current processors (multiprocessor, multicore or hyper-threading), the make command allows us to run multiple jobs. So

$ make -j 4

guarantees there is always four compilation jobs active.

I have long reiterated that « if you have N processors (or cores, or virtual CPUs) available, you will save time by starting 2N compilation jobs in parallel« . The idea is that for every proccessor we have a job that performs the compilation (consuming CPU time) while another job is saving the results of the previous compilation and loading the source file of the next task. But wait… is this true?

Test script

I wrote the following script which downloads and unpack the required kernel sources, then makes several compilations using a variable number of jobs. For example, if we start

$ ./test-make-j.sh 3 5 8

It performs three complete compilations: one with three tasks in parallel, one with the five jobs and the last with eight jobs. The results are recorded in a text file. The script follows.

test-make-j.sh 
#! /bin/sh

KERNEL_VERSION="linux-3.2"
KERNEL_URL_PATH="www.kernel.org/pub/linux/kernel/v3.0/"
RESULT_FILE="compilation-timing.txt"

if [ "$#" -eq 0 ]
then
  echo "usage: $@ jobs_number..." >& 2
  exit 0
fi

if [ ! -d "${KERNEL_VERSION}" ]
then
  if [ ! -f "${KERNEL_VERSION}.tar.bz2" ]
  then
    wget "${KERNEL_URL_PATH}/${KERNEL_VERSION}.tar.bz2"
    if [ $? -ne 0 ] || [ ! -f "${KERNEL_VERSION}.tar.bz2" ]
    then
      echo "unable to obtain ${KERNEL_VERSION} archive" >&2
      exit 1
    fi
  fi
  tar xjf "${KERNEL_VERSION}.tar.bz2"
  if [ $? -ne 0 ]
  then
    echo "Error while uncompressing kernel archive" >&2
    exit 1
  fi
fi

cd "${KERNEL_VERSION}"

echo "# Timings of ${KERNEL_VERSION} compilations" >> "${RESULT_FILE}"
nb_cpu=$(grep "^processor" /proc/cpuinfo | wc -l)

echo "# Processors: ${nb_cpu}" >> "${RESULT_FILE}"
affinity=$(taskset -p $$ | sed -e 's/^.*://') >> "${RESULT_FILE}"

echo "# Affinity mask: ${affinity}" >> "${RESULT_FILE}"
for nb in "$@"
do
  echo "# Compiling with $nb simultaneous jobs" >> "${RESULT_FILE}"
  make mrproper
  make i386_defconfig
  sync
  sleep 10 # Let's all calm down
  start=$(date "+%s")
  make -j $nb
  sync
  end=$(date "+%s")
  # This script will fail during february 2038 ;-)
  echo "$nb     $((end - start))" >> "${RESULT_FILE}"
done

Results

Here are the results of a run on an Intel Q6600 Quad-Core (file: Intel-Q6600-1.txt)

# Timings of linux-3.2 compilations
# Processors: 4
# Affinity mask:  f
# Compiling with 1 simultaneous jobs
1     675
# Compiling with 2 simultaneous jobs
2     346
# Compiling with 3 simultaneous jobs
3     241
# Compiling with 4 simultaneous jobs
4     197
# Compiling with 5 simultaneous jobs
5     198
# Compiling with 6 simultaneous jobs
6     194
# Compiling with 7 simultaneous jobs
7     195
# Compiling with 8 simultaneous jobs
8     196
# Compiling with 9 simultaneous jobs
9     197
# Compiling with 10 simultaneous jobs
10     198
# Compiling with 11 simultaneous jobs
11     198
# Compiling with 12 simultaneous jobs
12     198
# Compiling with 13 simultaneous jobs
13     200
# Compiling with 14 simultaneous jobs
14     201
# Compiling with 15 simultaneous jobs
15     201
# Compiling with 16 simultaneous jobs
16     200

Let’s see them graphically with this little Gnuplot command line. On the horizontal axis, lies the number of concurrent jobs and on the vertical axis is the compilation time (in seconds).

$ echo "set terminal png size 640,480 ; set output './Intel-Q6600-1.png'; plot 'Intel-Q6600-1.txt' with linespoints" | gnuplot

 

Parallel Compilations on 4 CPU

Parallel Compilations on 4 CPU

Apparently, the best results are achieved (with some fluctuations) with make-j 4. Let’s try to confirm this. Before running again the script, we limit it to two processors with the following command which binds on processors 2 and 3 all the jobs launched from the running shell.

$ taskset -pc 2-3 $$

Here are the results (file: Intel-QL6600-2.txt).

# Timings of linux-3.2 compilations
# Processors: 4
# Affinity mask:  c
# Compiling with 1 simultaneous jobs
1     684
# Compiling with 2 simultaneous jobs
2     360
# Compiling with 3 simultaneous jobs
3     362
# Compiling with 4 simultaneous jobs
4     366
# Compiling with 8 simultaneous jobs
8     370
# Compiling with 16 simultaneous jobs
16     376
# Compiling with 32 simultaneous jobs
32     377
# Compiling with 64 simultaneous jobs
64     378
Parallel Compilations on 2 CPU

Parallel Compilations on 2 CPU

This time, it is clear that the minimum time is achieved with make-j 2. If we repeat the experiment on a single CPU, we obtain the following values (file: Intel-Q6600-3.txt).

# Timings of linux-3.2 compilations
# Processors: 4
# Affinity mask:  8
# Compiling with 1 simultaneous jobs
1     683
# Compiling with 2 simultaneous jobs
2     698
# Compiling with 3 simultaneous jobs
3     708
# Compiling with 4 simultaneous jobs
4     709
# Compiling with 5 simultaneous jobs
5     719
# Compiling with 6 simultaneous jobs
6     719
# Compiling with 7 simultaneous jobs
7     720
# Compiling with 8 simultaneous jobs
8     724

Represented on the graph below:

Parallel Compilations on a single CPU

Parallel Compilations on a single CPU

We can group these three curves on a single graph to better see their scales (I have not extended the curve of the compilation on a single CPU, but we can imagine that it continues with a slight increase).

Parallel Compilations on Q6600

Parallel Compilations on Q6600

To be sure, we can repeat the experiment on a different processor with two cores (AMD QL66). The results are as follows (file: AMD-QL66-1.txt).

# Timings of linux-3.2 compilations
# Processors: 2
# Affinity mask:  3
# Compiling with 1 simultaneous jobs
1     1113
# Compiling with 2 simultaneous jobs
2     844
# Compiling with 3 simultaneous jobs
3     875
# Compiling with 4 simultaneous jobs
4     863
# Compiling with 5 simultaneous jobs
5     840
# Compiling with 6 simultaneous jobs
6     844
# Compiling with 7 simultaneous jobs
7     844
# Compiling with 8 simultaneous jobs
8     851
Parallel Compilations on two CPU

Parallel Compilations on two CPU

Let’s try one last experiment on the same machine (two CPU), by disabling two elements:

  • prefetching of the next blocks of the disk (which can improve localized readings) with echo 0 > /sys/block/sda/read_ahead_kb
  • delayed (about 30 seconds) block writes (avoiding repetitive access to the disk in case of subsequent modification of the same block) with mount / -o sync,remount.

 

This time the results are very different (file: AMD-QL66-2.txt). The times are much longer than before because for each write to disk, the process waits for data to be transmitted to the device to continue his work.

 Timings of linux-3.2 compilations
# Processors: 2
# Affinity mask:  3
# Compiling with 1 simultaneous jobs
1     3487
# Compiling with 2 simultaneous jobs
2     2562
# Compiling with 3 simultaneous jobs
3     2198
# Compiling with 4 simultaneous jobs
4     1963
# Compiling with 5 simultaneous jobs
5     1779
# Compiling with 6 simultaneous jobs
6     1646
# Compiling with 7 simultaneous jobs
7     1636
# Compiling with 8 simultaneous jobs
8     1602
# Compiling with 9 simultaneous jobs
9     1738
# Compiling with 10 simultaneous jobs
10     1577
Parallel Compilations on 2 CPU without disk optimizations

Parallel Compilations on 2 CPU without disk optimizations

Here, the curve is closer to that than I imagined at first. Placing more jobs per CPU can take advantage of the wait times due to the disk access to progress in another compilation. Group the two curves in order to see the respective durations.

Parallel Compilations on QL66

Parallel Compilations on QL66

Conclusion

We see that with the quality of the I/O scheduler of Linux, and the optimized management of block devices, the best compilation time are obtained as soon as we launch one job per processor.

So I will modify my recommendation in the future as « If you have N processors available, compile your kernel with make -j N to get the best execution time. »

PS: If you have the opportunity to run this script on different architectures (8 processors, 16 processors, etc.). I am very interested in your results.

Une réponse

  1. […] To enable it, run – $ make -j n where n is the number of parallel jobs to execute. As per this analysis, the results are best when n = number of […]

URL de trackback pour cette page