Learning DPDK: NUMA optimization

NUMA

Overview

To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.

To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.

Run

Record PMU counters.

perf c2c record -F 99 -g -- binary

Analyze in interactive mode.
perf c2c report

Analyze in text mode.
perf c2c report --stdio

For example summary in a text mode could look as follows.
=================================================
Trace Event Information
=================================================
Total records : 5621889
Locked Load/Store Operations : 10032
Load Operations : 741529
Loads - uncacheable : 7
Loads - IO : 0
Loads - Miss : 8299
Loads - no mapping : 18
Load Fill Buffer Hit : 533018
Load L1D hit : 109495
Load L2D hit : 4337
Load LLC hit : 61245
Load Local HITM : 9673
Load Remote HITM : 12528
Load Remote HIT : 780
Load Local DRAM : 4593
Load Remote DRAM : 7209
Load MESI State Exclusive : 11802
Load MESI State Shared : 0
Load LLC Misses : 25110
LLC Misses to Local DRAM : 18.3%
LLC Misses to Remote DRAM : 28.7%
LLC Misses to Remote cache (HIT) : 3.1%
LLC Misses to Remote cache (HITM) : 49.9%
Store Operations : 4880360
Store - uncacheable : 0
Store - no mapping : 178126
Store L1D Hit : 4696772
Store L1D Miss : 5462
No Page Map Rejects : 1095
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 10898
Load HITs on shared lines : 88830
Fill Buffer Hits on shared lines : 39884
L1D hits on shared lines : 8717
L2D hits on shared lines : 86
LLC hits on shared lines : 25798
Locked Access on shared lines : 5336
Store HITs on shared lines : 5953
Store L1D hits on shared lines : 5633
Total Merged records : 28154

References

Learning DPDK: Inlining

To_Inline_or_not_to_Inline_Dia_01Overview

Inlining method can help to mitigate the following:

  1. Function call overhead;
  2. Pipeline stall.

It is advised to apply the method for the following types of routines:

  1. Trivial and small functions used as accessors to data or wrappers around another function;
  2. Big functions called quite regularly but not from many places.

Solution

A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.

static inline

Moreover to make a decision instead of the gcc compiler the following attribute should be used.

__attribute__((always_inline))

References

Learning DPDK: Branch Prediction

Pipeline,_4_stage.svg

Overview

It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.

Solution

An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.

References

 

Learning DPDK: Avoid False Sharing

false-sharing-illustration

Overview

It is convenient to store thread-specific data, for instance, statistics, inside an array of structures. The size of the array is equal to the number of threads.

The only thing that you need to be careful about is to avoid so-called false sharing. It is a performance penalty that you pay when RW-data shares the same cache line and is accessed from multiple threads.

Solution

Align a structure accessed by each thread to a cache line size (64 bytes) using macro __rte_cache_aligned that is actually a shortcut for __attribute__(__aligned__((64))).

typedef struct counter_s
{
uint64_t packets;
uint64_t bytes;
uint64_t failed_packets;
uint64_t failed_bytes;
uint64_t pad[4];
}counter_t __rte_cache_aligned;

Define an array of the structures with one element per thread.
counter_t stats[THREADS_NUM];

Note that in case if structure size is smaller than cache line size, the padding is required. Otherwise, gcc compiler will complain with the following error.

error: alignment of array elements is greater than element size

References