Learning DPDK: NUMA optimization

NUMA

Overview

To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.

To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.

Run

Record PMU counters.

perf c2c record -F 99 -g -- binary

Analyze in interactive mode.
perf c2c report

Analyze in text mode.
perf c2c report --stdio

For example summary in a text mode could look as follows.
=================================================
Trace Event Information
=================================================
Total records : 5621889
Locked Load/Store Operations : 10032
Load Operations : 741529
Loads - uncacheable : 7
Loads - IO : 0
Loads - Miss : 8299
Loads - no mapping : 18
Load Fill Buffer Hit : 533018
Load L1D hit : 109495
Load L2D hit : 4337
Load LLC hit : 61245
Load Local HITM : 9673
Load Remote HITM : 12528
Load Remote HIT : 780
Load Local DRAM : 4593
Load Remote DRAM : 7209
Load MESI State Exclusive : 11802
Load MESI State Shared : 0
Load LLC Misses : 25110
LLC Misses to Local DRAM : 18.3%
LLC Misses to Remote DRAM : 28.7%
LLC Misses to Remote cache (HIT) : 3.1%
LLC Misses to Remote cache (HITM) : 49.9%
Store Operations : 4880360
Store - uncacheable : 0
Store - no mapping : 178126
Store L1D Hit : 4696772
Store L1D Miss : 5462
No Page Map Rejects : 1095
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 10898
Load HITs on shared lines : 88830
Fill Buffer Hits on shared lines : 39884
L1D hits on shared lines : 8717
L2D hits on shared lines : 86
LLC hits on shared lines : 25798
Locked Access on shared lines : 5336
Store HITs on shared lines : 5953
Store L1D hits on shared lines : 5633
Total Merged records : 28154

References