Loop unrolling and vectorization
Loop unrolling is a technique for replacing loop control code with a repetition of the loop body, as shown in the following example:
// not unrolled
for (i = 0 ; i < 100 ; ++i) dest[i] = src[i];
// unrolled
for (i = 0 ; i < 100 ; i += 4)
{ dest[i] = src[i];
dest[i + 1] = src[i + 1];
dest[i + 1] = src[i + 1];
dest[i + 1] = src[i + 1];
}
As with inlining, the gains achieved by loop unrolling can lie in the double-digit percentage range and be much higher for a well-placed, very hot, unrolled loop. And as with inlining, overdoing it results in code bloat, hurting performance instead of improving it.
For loop unrolling, we do not have as much control as for inlining. You can enable or disable it at compiler level using compiler options, but you cannot control single loops. So, we either trust the compiler and enable loop unrolling (-funroll-loops for GCC) or disable it and unroll the loops manually. One technique that you might want to look up in that context is Duff's device.
One could argue that vectorization too is a form of loop unrolling—the compiler replaces several calculations with one vector (that is, SIMD) instruction. A SIMD instruction can address several elements of an array and use them as input for a vector operation. Assuming that v1 and v2 are floating point arrays, the following SSE assembler code will take the first four elements of v1 and add to them the first four elements of v2:
movaps xmm0, [v1] # load v1[0 .. 3] to xmm0 register addps xmm0, [v2] # add v4[0 .. 3] to xmm0 register
movaps [v1], xmm0 # copy results back to v1
Admittedly, a compiler may sometimes miss an opportunity for vectorization and we could speed up the program by including it manually (do not write assembler code though, use compiler instrinsics instead!). But, as this is an intermediate-level book, we won't discuss that further. However, looking up your compiler's documentation for hints about writing code that will get vectorized easily is always worthwhile.