To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.
To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.
Record PMU counters.
perf c2c record -F 99 -g -- binary
Analyze in interactive mode.
perf c2c report
Analyze in text mode.
perf c2c report --stdio
For example summary in a text mode could look as follows.
Trace Event Information
Total records : 5621889
Locked Load/Store Operations : 10032
Load Operations : 741529
Loads - uncacheable : 7
Loads - IO : 0
Loads - Miss : 8299
Loads - no mapping : 18
Load Fill Buffer Hit : 533018
Load L1D hit : 109495
Load L2D hit : 4337
Load LLC hit : 61245
Load Local HITM : 9673
Load Remote HITM : 12528
Load Remote HIT : 780
Load Local DRAM : 4593
Load Remote DRAM : 7209
Load MESI State Exclusive : 11802
Load MESI State Shared : 0
Load LLC Misses : 25110
LLC Misses to Local DRAM : 18.3%
LLC Misses to Remote DRAM : 28.7%
LLC Misses to Remote cache (HIT) : 3.1%
LLC Misses to Remote cache (HITM) : 49.9%
Store Operations : 4880360
Store - uncacheable : 0
Store - no mapping : 178126
Store L1D Hit : 4696772
Store L1D Miss : 5462
No Page Map Rejects : 1095
Unable to parse data source : 0
Global Shared Cache Line Event Information
Total Shared Cache Lines : 10898
Load HITs on shared lines : 88830
Fill Buffer Hits on shared lines : 39884
L1D hits on shared lines : 8717
L2D hits on shared lines : 86
LLC hits on shared lines : 25798
Locked Access on shared lines : 5336
Store HITs on shared lines : 5953
Store L1D hits on shared lines : 5633
Total Merged records : 28154
Inlining method can help to mitigate the following:
- Function call overhead;
- Pipeline stall.
It is advised to apply the method for the following types of routines:
- Trivial and small functions used as accessors to data or wrappers around another function;
- Big functions called quite regularly but not from many places.
A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.
Moreover to make a decision instead of the gcc compiler the following attribute should be used.
It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.
An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.
It is convenient to store thread-specific data, for instance, statistics, inside an array of structures. The size of the array is equal to the number of threads.
The only thing that you need to be careful about is to avoid so-called false sharing. It is a performance penalty that you pay when RW-data shares the same cache line and is accessed from multiple threads.
Align a structure accessed by each thread to a cache line size (64 bytes) using macro __rte_cache_aligned that is actually a shortcut for __attribute__(__aligned__((64))).
typedef struct counter_s
Define an array of the structures with one element per thread.
Note that in case if structure size is smaller than cache line size, the padding is required. Otherwise, gcc compiler will complain with the following error.
error: alignment of array elements is greater than element size