{"id":1425,"date":"2012-01-14T10:04:32","date_gmt":"2012-01-14T09:04:32","guid":{"rendered":"http:\/\/www.blaess.fr\/christophe\/?p=1425"},"modified":"2012-01-14T10:04:32","modified_gmt":"2012-01-14T09:04:32","slug":"parallelizing-compilations","status":"publish","type":"post","link":"https:\/\/www.blaess.fr\/christophe\/2012\/01\/14\/parallelizing-compilations\/","title":{"rendered":"Parallelizing Compilations"},"content":{"rendered":"<p style=\"text-align: justify;\">(<a title=\"Parall\u00e9lisation de compilations\" href=\"http:\/\/www.blaess.fr\/christophe\/2012\/01\/14\/parallelisation-de-compilations\/\">Version originale en fran\u00e7ais<\/a>)<\/p>\n<p style=\"text-align: justify;\">I very frequently compile Linux kernels, often during training sessions or engineering services (mainly in the field of embedded systems or drivers development), sometimes while writing articles or books.<\/p>\n<p style=\"text-align: justify;\">Compilation time varies greatly depending on the amount of code (drivers, filesystems, protocols, etc.) and on the CPU power of the host machine. On a mid-range PC, compiling a kernel adjusted for an embedded system (with very few drivers) lasts about three minutes. On an entry level machine (or a little old one), compiling a generic kernel for PC (with hundreds of drivers as modules) can last an hour.<\/p>\n<p>\n<!--more-->\n<\/p>\n<p style=\"text-align: justify;\">To take advantage of the parallelism offered by the current processors (multiprocessor, multicore or hyper-threading), the <code>make<\/code> command allows us to run multiple jobs. So<\/p>\n<pre>$ <strong>make -j 4<\/strong><\/pre>\n<p style=\"text-align: justify;\">guarantees there is always four compilation jobs active.<\/p>\n<p style=\"text-align: justify;\">I have long reiterated that \u00ab\u00a0<em>if you have N processors (or cores, or virtual CPUs) available, you will save time by starting 2N compilation jobs in parallel<\/em>\u00ab\u00a0. The idea is that for every proccessor we have a job that performs the compilation (consuming CPU time) while another job is saving the results of the previous compilation and loading the source file of the next task. But wait&#8230; is this true?<\/p>\n<h1>Test script<\/h1>\n<p style=\"text-align: justify;\">I wrote the following script which downloads and unpack the required kernel sources, then makes several compilations using a variable number of jobs. For example, if we start<\/p>\n<pre>$ <strong>.\/test-make-j.sh 3 5 8<\/strong><\/pre>\n<p style=\"text-align: justify;\">It performs three complete compilations: one with three tasks in parallel, one with the five jobs and the last with eight jobs. The results are recorded in a text file. The script follows.<\/p>\n<pre><a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/test-make-j.sh\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/test-make-j.sh\" target=\"_blank\"><strong>test-make-j.sh <\/strong><\/a>\n#! \/bin\/sh\n\nKERNEL_VERSION=\"linux-3.2\"\nKERNEL_URL_PATH=\"www.kernel.org\/pub\/linux\/kernel\/v3.0\/\"\nRESULT_FILE=\"compilation-timing.txt\"\n\nif [ \"$#\" -eq 0 ]\nthen\n  echo \"usage: $@ jobs_number...\" &gt;&amp; 2\n  exit 0\nfi\n\nif [ ! -d \"${KERNEL_VERSION}\" ]\nthen\n  if [ ! -f \"${KERNEL_VERSION}.tar.bz2\" ]\n  then\n    wget \"${KERNEL_URL_PATH}\/${KERNEL_VERSION}.tar.bz2\"\n    if [ $? -ne 0 ] || [ ! -f \"${KERNEL_VERSION}.tar.bz2\" ]\n    then\n      echo \"unable to obtain ${KERNEL_VERSION} archive\" &gt;&amp;2\n      exit 1\n    fi\n  fi\n  tar xjf \"${KERNEL_VERSION}.tar.bz2\"\n  if [ $? -ne 0 ]\n  then\n    echo \"Error while uncompressing kernel archive\" &gt;&amp;2\n    exit 1\n  fi\nfi\n\ncd \"${KERNEL_VERSION}\"\n\necho \"# Timings of ${KERNEL_VERSION} compilations\" &gt;&gt; \"${RESULT_FILE}\"\nnb_cpu=$(grep \"^processor\" \/proc\/cpuinfo | wc -l)\n\necho \"# Processors: ${nb_cpu}\" &gt;&gt; \"${RESULT_FILE}\"\naffinity=$(taskset -p $$ | sed -e 's\/^.*:\/\/') &gt;&gt; \"${RESULT_FILE}\"\n\necho \"# Affinity mask: ${affinity}\" &gt;&gt; \"${RESULT_FILE}\"\nfor nb in \"$@\"\ndo\n  echo \"# Compiling with $nb simultaneous jobs\" &gt;&gt; \"${RESULT_FILE}\"\n  <strong>make mrproper<\/strong>\n  <strong>make i386_defconfig<\/strong>\n  sync\n  sleep 10 # Let's all calm down\n  start=$(date \"+%s\")\n  <strong>make -j $nb<\/strong>\n  sync\n  end=$(date \"+%s\")\n  # This script will fail during february 2038 ;-)\n  echo \"$nb     $((end - start))\" &gt;&gt; \"${RESULT_FILE}\"\ndone<\/pre>\n<h1>Results<\/h1>\n<p style=\"text-align: justify;\">Here are the results of a run on an Intel Q6600 Quad-Core (file: <a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-1.txt\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-1.txt\" target=\"_blank\">Intel-Q6600-1.txt<\/a>)<\/p>\n<pre># Timings of linux-3.2 compilations\n# Processors: 4\n# Affinity mask:  f\n# Compiling with 1 simultaneous jobs\n1     675\n# Compiling with 2 simultaneous jobs\n2     346\n# Compiling with 3 simultaneous jobs\n3     241\n# Compiling with 4 simultaneous jobs\n4     197\n# Compiling with 5 simultaneous jobs\n5     198\n# Compiling with 6 simultaneous jobs\n6     194\n# Compiling with 7 simultaneous jobs\n7     195\n# Compiling with 8 simultaneous jobs\n8     196\n# Compiling with 9 simultaneous jobs\n9     197\n# Compiling with 10 simultaneous jobs\n10     198\n# Compiling with 11 simultaneous jobs\n11     198\n# Compiling with 12 simultaneous jobs\n12     198\n# Compiling with 13 simultaneous jobs\n13     200\n# Compiling with 14 simultaneous jobs\n14     201\n# Compiling with 15 simultaneous jobs\n15     201\n# Compiling with 16 simultaneous jobs\n16     200<\/pre>\n<p style=\"text-align: justify;\">Let&rsquo;s see them graphically with this little Gnuplot command line. On the horizontal axis, lies the number of concurrent jobs and on the vertical axis is the compilation time (in seconds).<\/p>\n<pre>$ <strong>echo \"set terminal png size 640,480 ; set output '.\/Intel-Q6600-1.png'; plot 'Intel-Q6600-1.txt' with linespoints\" | gnuplot<\/strong><\/pre>\n<p>&nbsp;<\/p>\n<div id=\"attachment_1402\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1402\" class=\"size-medium wp-image-1402\" title=\"Intel-Q6600-1\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-1-300x225.png\" alt=\"Parallel Compilations on 4 CPU\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1402\" class=\"wp-caption-text\">Parallel Compilations on 4 CPU<\/p><\/div>\n<p style=\"text-align: justify;\">Apparently, the best results are achieved (with some fluctuations) with <strong><code>make-j 4<\/code><\/strong>. Let&rsquo;s try to confirm this. Before running again the script, we limit it to two processors with the following command which binds on processors 2 and 3 all the jobs launched from the running shell.<\/p>\n<pre>$ <strong>taskset -pc 2-3 $$<\/strong><\/pre>\n<p style=\"text-align: justify;\">Here are the results (file: <a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-2.txt\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-2.txt\" target=\"_blank\">Intel-QL6600-2.txt<\/a>).<\/p>\n<pre># Timings of linux-3.2 compilations\n# Processors: 4\n# Affinity mask:  c\n# Compiling with 1 simultaneous jobs\n1     684\n# Compiling with 2 simultaneous jobs\n2     360\n# Compiling with 3 simultaneous jobs\n3     362\n# Compiling with 4 simultaneous jobs\n4     366\n# Compiling with 8 simultaneous jobs\n8     370\n# Compiling with 16 simultaneous jobs\n16     376\n# Compiling with 32 simultaneous jobs\n32     377\n# Compiling with 64 simultaneous jobs\n64     378<\/pre>\n<div id=\"attachment_1405\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-21.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1405\" class=\"size-medium wp-image-1405\" title=\"Intel-Q6600-2\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-21-300x225.png\" alt=\"Parallel Compilations on 2 CPU\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1405\" class=\"wp-caption-text\">Parallel Compilations on 2 CPU<\/p><\/div>\n<p style=\"text-align: justify;\">This time, it is clear that the minimum time is achieved with <strong><code>make-j 2<\/code><\/strong>. If we repeat the experiment on a single CPU, we obtain the following values (file: <a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-3.txt\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/Intel-Q6600-3.txt\" target=\"_blank\">Intel-Q6600-3.txt<\/a>).<\/p>\n<pre># Timings of linux-3.2 compilations\n# Processors: 4\n# Affinity mask:  8\n# Compiling with 1 simultaneous jobs\n1     683\n# Compiling with 2 simultaneous jobs\n2     698\n# Compiling with 3 simultaneous jobs\n3     708\n# Compiling with 4 simultaneous jobs\n4     709\n# Compiling with 5 simultaneous jobs\n5     719\n# Compiling with 6 simultaneous jobs\n6     719\n# Compiling with 7 simultaneous jobs\n7     720\n# Compiling with 8 simultaneous jobs\n8     724<\/pre>\n<p style=\"text-align: justify;\">Represented on the graph below:<\/p>\n<div id=\"attachment_1406\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-3.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1406\" class=\"size-medium wp-image-1406\" title=\"Intel-Q6600-3\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-3-300x225.png\" alt=\"Parallel Compilations on a single CPU\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1406\" class=\"wp-caption-text\">Parallel Compilations on a single CPU<\/p><\/div>\n<p style=\"text-align: justify;\">We can group these three curves on a single graph to better see their scales (I have not extended the curve of the compilation on a single CPU, but we can imagine that it continues with a slight increase).<\/p>\n<div id=\"attachment_1448\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1448\" class=\"size-medium wp-image-1448\" title=\"Parallel Compilations on Q6600\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/Intel-Q6600-300x225.png\" alt=\"Parallel Compilations on Q6600\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1448\" class=\"wp-caption-text\">Parallel Compilations on Q6600<\/p><\/div>\n<p style=\"text-align: justify;\">To be sure, we can repeat the experiment on a different processor with two cores (AMD QL66). The results are as follows (file: <a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/AMD-QL66-1.txt\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/AMD-QL66-1.txt\" target=\"_blank\">AMD-QL66-1.txt<\/a>).<\/p>\n<pre># Timings of linux-3.2 compilations\n# Processors: 2\n# Affinity mask:  3\n# Compiling with 1 simultaneous jobs\n1     1113\n# Compiling with 2 simultaneous jobs\n2     844\n# Compiling with 3 simultaneous jobs\n3     875\n# Compiling with 4 simultaneous jobs\n4     863\n# Compiling with 5 simultaneous jobs\n5     840\n# Compiling with 6 simultaneous jobs\n6     844\n# Compiling with 7 simultaneous jobs\n7     844\n# Compiling with 8 simultaneous jobs\n8     851<\/pre>\n<div id=\"attachment_1407\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66-1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1407\" class=\"size-medium wp-image-1407 \" title=\"AMD-QL66-1\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66-1-300x225.png\" alt=\"Parallel Compilations on two CPU\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1407\" class=\"wp-caption-text\">Parallel Compilations on two CPU<\/p><\/div>\n<p style=\"text-align: justify;\">Let&rsquo;s try one last experiment on the same machine (two CPU), by disabling two elements:<\/p>\n<ul>\n<li style=\"text-align: justify;\">prefetching of the next blocks of the disk (which can improve localized readings) with <code><strong>echo 0 &gt; \/sys\/block\/sda\/read_ahead_kb<\/strong><\/code><\/li>\n<\/ul>\n<ul>\n<li style=\"text-align: justify;\">delayed (about 30 seconds) block writes (avoiding repetitive access to the disk in case of subsequent modification of the same block) with <code><strong>mount \/ -o sync,remount<\/strong><\/code>.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">This time the results are very different (file: <a title=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/AMD-QL66-2.txt\" href=\"http:\/\/www.blaess.fr\/christophe\/files\/article-2012-01-14\/AMD-QL66-2.txt\" target=\"_blank\">AMD-QL66-2.txt<\/a>). The times are much longer than before because for each write to disk, the process waits for data to be transmitted to the device to continue his work.<\/p>\n<pre> Timings of linux-3.2 compilations\n# Processors: 2\n# Affinity mask:  3\n# Compiling with 1 simultaneous jobs\n1     3487\n# Compiling with 2 simultaneous jobs\n2     2562\n# Compiling with 3 simultaneous jobs\n3     2198\n# Compiling with 4 simultaneous jobs\n4     1963\n# Compiling with 5 simultaneous jobs\n5     1779\n# Compiling with 6 simultaneous jobs\n6     1646\n# Compiling with 7 simultaneous jobs\n7     1636\n# Compiling with 8 simultaneous jobs\n8     1602\n# Compiling with 9 simultaneous jobs\n9     1738\n# Compiling with 10 simultaneous jobs\n10     1577<\/pre>\n<div id=\"attachment_1408\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66-2.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1408\" class=\"size-medium wp-image-1408\" title=\"AMD-QL66-2\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66-2-300x225.png\" alt=\"Parallel Compilations on 2 CPU without disk optimizations\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1408\" class=\"wp-caption-text\">Parallel Compilations on 2 CPU without disk optimizations<\/p><\/div>\n<p style=\"text-align: justify;\">Here, the curve is closer to that than I imagined at first. Placing more jobs per CPU can take advantage of the wait times due to the disk access to progress in another compilation.  Group the two curves in order to see the respective durations.<\/p>\n<div id=\"attachment_1449\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-1449\" class=\"size-medium wp-image-1449\" title=\"Parallel Compilations on QL66\" src=\"http:\/\/www.blaess.fr\/christophe\/wp-content\/uploads\/2012\/01\/AMD-QL66-300x225.png\" alt=\"Parallel Compilations on QL66\" width=\"300\" height=\"225\" \/><\/a><p id=\"caption-attachment-1449\" class=\"wp-caption-text\">Parallel Compilations on QL66<\/p><\/div>\n<h1>Conclusion<\/h1>\n<p style=\"text-align: justify;\">We see that with the quality of the I\/O scheduler of Linux, and the optimized management of block devices, the best compilation time are obtained as soon as we launch <strong>one job per processor<\/strong>.<\/p>\n<p style=\"text-align: justify;\">So I will modify my recommendation in the future as \u00ab\u00a0<em>If you have N processors available, compile your kernel with<\/em> <code> make -j N <\/code> <em>to get the best execution time<\/em>.\u00a0\u00bb<\/p>\n<p style=\"text-align: justify;\">PS: If you have the opportunity to run this script on different architectures (8 processors, 16 processors, etc.). I am very interested in your results.<\/p>","protected":false},"excerpt":{"rendered":"<p>(Version originale en fran&ccedil;ais) I very frequently compile Linux kernels, often during training sessions or engineering services (mainly in the field of embedded systems or drivers development), sometimes while writing articles or books. Compilation time varies greatly depending on the amount of code (drivers, filesystems, protocols, etc.) and on the CPU power of the host [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,10],"tags":[],"class_list":["post-1425","post","type-post","status-publish","format-standard","hentry","category-linux-2","category-microprocesseur"],"_links":{"self":[{"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/posts\/1425","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/comments?post=1425"}],"version-history":[{"count":0,"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/posts\/1425\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/media?parent=1425"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/categories?post=1425"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blaess.fr\/christophe\/wp-json\/wp\/v2\/tags?post=1425"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}