It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.
An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.
It is convenient to store thread-specific data, for instance, statistics, inside an array of structures. The size of the array is equal to the number of threads.
The only thing that you need to be careful about is to avoid so-called false sharing. It is a performance penalty that you pay when RW-data shares the same cache line and is accessed from multiple threads.
Align a structure accessed by each thread to a cache line size (64 bytes) using macro __rte_cache_aligned that is actually a shortcut for __attribute__(__aligned__((64))).
typedef struct counter_s
Define an array of the structures with one element per thread.
Note that in case if structure size is smaller than cache line size, the padding is required. Otherwise, gcc compiler will complain with the following error.
error: alignment of array elements is greater than element size
Taking into account orders of magnitude between speed access to different cache levels and RAM itself, it is advised to carefully analyze C data structures that are used frequently on cache friendliness. The idea is to have the most often accessed data (“hot”) to stay in a higher level cache as long as possible. And the following technics are used.
- Group “hot” members together in the beginning and push “cold” to the end;
- Minimize structure size by avoiding padding;
- Align data to cache line size.
You can find a great description of why and how the data structures are laid out by compilers here.
Poke-a-hole (pahole) analyzes the object file and outputs detailed description of each and every structure layout created by a compiler.
Analyze the file.
Analyze one structure.
pahole a.out -C structure
Get suggestion on improvements.
pahole --show_reorg_steps --reorganize -C structure a.out