About Denys Haryachyy

Contractor, senior C/C++ embedded developer, with over 10 years experience. Proven ability to deliver high quality software through effective team management and hands-on work, moving projects through analysis, design, implementation and maintenance stages of software development life-cycle. Available to work on DPDK and network related projects.

Learning VPP: Building

logo_fdio-300x184

Overview

To build and run VPP in the interactive mode without DPDK on Ubuntu 16.04 the following steps could be used.

This immediately enables a developer to make changes and verify a build sanity.

Though it is not enough to verify packets processing, it is perfectly fine for testing other functionality using CLI.

Run

1. Pull the code.
git clone https://github.com/FDio/vpp

2. Build.
make install-dep
make bootstrap
make build

3. Create VPP group.
groupadd vpp
usermod -aG vpp root

4. Create startup.conf like the following.


unix {
nodaemon
log /tmp/vpp.log
full-coredump
gid vpp
interactive
cli-listen /run/vpp/cli.sock
}
api-trace {
on
}
api-segment {
gid vpp
}
plugins {
plugin dpdk_plugin.so { disable }
}

view raw

startup.conf

hosted with ❤ by GitHub

5. Run.
make run STARTUP_CONF=startup.conf

References

Learning VPP: Intro

logo_fdio-300x184

Overview

VPP (Vector Packet Processing) is a software virtual switch and a framework for high-speed packet processing.

It is highly scalable, production-ready piece of software that together with DPDK enables anybody to build their own packet processing products based on commodity servers.

Run

1. Create a directory on your laptop.
mkdir fdio-tutorial
cd fdio-tutorial

2. Create a Vagrantfile containing the following.


# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure(2) do |config|
config.vm.box = "puppetlabs/ubuntu-16.04-64-nocm"
config.vm.box_check_update = false
vmcpu=(ENV['VPP_VAGRANT_VMCPU'] || 2)
vmram=(ENV['VPP_VAGRANT_VMRAM'] || 4096)
config.ssh.forward_agent = true
config.vm.provider "virtualbox" do |vb|
vb.customize ["modifyvm", :id, "–ioapic", "on"]
vb.memory = "#{vmram}"
vb.cpus = "#{vmcpu}"
#support for the SSE4.x instruction is required in some versions of VB.
vb.customize ["setextradata", :id, "VBoxInternal/CPUM/SSE4.1", "1"]
vb.customize ["setextradata", :id, "VBoxInternal/CPUM/SSE4.2", "1"]
end
end

view raw

Vagrantfile

hosted with ❤ by GitHub

3. Bring up your Vagrant VM.
vagrant up
vagrant ssh

4. Install VPP from binary packages.
export UBUNTU="xenial"
sudo rm /etc/apt/sources.list.d/99fd.io.list
echo "deb [trusted=yes] https://nexus.fd.io/content/repositories/fd.io$RELEASE.ubuntu.$UBUNTU.main/ ./" | sudo tee -a /etc/apt/sources.list.d/99fd.io.list
sudo apt-get update
sudo apt-get install vpp vpp-lib

5. Open VPP CLI
sudo vppctl
_______ _ _ _____ ___
__/ __/ _ \ (_)__ | | / / _ \/ _ \
_/ _// // / / / _ \ | |/ / ___/ ___/
/_/ /____(_)_/\___/ |___/_/ /_/
vpp# show ver
vpp v18.07-release built by root on c469eba2a593 at Mon Jul 30 23:27:03 UTC 2018

References

Preprocessor tricks: foreach macro

GNU_Compiler_Collection_logo

Overview

Regularly, in C based projects we use enumerations for different purposes. And often it is useful to have a possibility to convert enum value into a string to provide a human-readable output in the message. A straightforward solution would be to define an array with strings that will accompany the enum. But this solution has a drawback during maintaining phase as far as a developer has to be careful to update both the enum and the array simultaneously.

Solution

The aforementioned inconvenience can be solved by using the preprocessor capabilities. The idea behind the trick is to call one macro inside another. And the purpose of the macros is to build both the enumeration and the array based on one common macro content.

First, define the enum values and strings.
#define foreach_test_error \
_ (NONE, "no error") \
_ (UNKNOWN_PROTOCOL, "unknown") \
_ (UNKNOWN_CONTROL, "control")

Second, define the error messages array.
static char *error_strings[] = {
#define _(f,s) s,
foreach_test_error
#undef _
};

Third, define the enum of errors.
typedef enum
{
#define _(f,s) TEST_ERROR_##f,
foreach_error
#undef _
TEST_N_ERROR,
} error_t;

References

Learning DPDK: NUMA optimization

NUMA

Overview

To get the maximum performance on NUMA system, the underlying architecture has to be taken into account.

To spot the problems in your data design, there exists a handy tool, called “perf c2c”. Where C2C stands for Cache To Cache. The output of the tool will provide statistics about the access to a data on the remote NUMA socket.

Run

Record PMU counters.

perf c2c record -F 99 -g -- binary

Analyze in interactive mode.
perf c2c report

Analyze in text mode.
perf c2c report --stdio

For example summary in a text mode could look as follows.
=================================================
Trace Event Information
=================================================
Total records : 5621889
Locked Load/Store Operations : 10032
Load Operations : 741529
Loads - uncacheable : 7
Loads - IO : 0
Loads - Miss : 8299
Loads - no mapping : 18
Load Fill Buffer Hit : 533018
Load L1D hit : 109495
Load L2D hit : 4337
Load LLC hit : 61245
Load Local HITM : 9673
Load Remote HITM : 12528
Load Remote HIT : 780
Load Local DRAM : 4593
Load Remote DRAM : 7209
Load MESI State Exclusive : 11802
Load MESI State Shared : 0
Load LLC Misses : 25110
LLC Misses to Local DRAM : 18.3%
LLC Misses to Remote DRAM : 28.7%
LLC Misses to Remote cache (HIT) : 3.1%
LLC Misses to Remote cache (HITM) : 49.9%
Store Operations : 4880360
Store - uncacheable : 0
Store - no mapping : 178126
Store L1D Hit : 4696772
Store L1D Miss : 5462
No Page Map Rejects : 1095
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 10898
Load HITs on shared lines : 88830
Fill Buffer Hits on shared lines : 39884
L1D hits on shared lines : 8717
L2D hits on shared lines : 86
LLC hits on shared lines : 25798
Locked Access on shared lines : 5336
Store HITs on shared lines : 5953
Store L1D hits on shared lines : 5633
Total Merged records : 28154

References

Learning DPDK: Inlining

To_Inline_or_not_to_Inline_Dia_01Overview

Inlining method can help to mitigate the following:

  1. Function call overhead;
  2. Pipeline stall.

It is advised to apply the method for the following types of routines:

  1. Trivial and small functions used as accessors to data or wrappers around another function;
  2. Big functions called quite regularly but not from many places.

Solution

A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.

static inline

Moreover to make a decision instead of the gcc compiler the following attribute should be used.

__attribute__((always_inline))

References

Learning DPDK: Cloud support

DPDK in cloud

Overview

DPDK based products fit perfectly into NFV paradigm. DPDK provides drivers for cloud-based NICs that could be run in AWS and VMware environments.

Limitations

The following nuances were discovered when using DPDK on VMware and Amazon platforms.

VMXNET 3 driver

Both RX and TX queues have to be configured on the device. Otherwise, DPDK initialization crashes.

ENA driver

The maximum number of buffer descriptors for RX queue is 512.

References

Learning DPDK: Branch Prediction

Pipeline,_4_stage.svg

Overview

It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.

Solution

An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.

References

 

Learning DPDK: Avoid False Sharing

false-sharing-illustration

Overview

It is convenient to store thread-specific data, for instance, statistics, inside an array of structures. The size of the array is equal to the number of threads.

The only thing that you need to be careful about is to avoid so-called false sharing. It is a performance penalty that you pay when RW-data shares the same cache line and is accessed from multiple threads.

Solution

Align a structure accessed by each thread to a cache line size (64 bytes) using macro __rte_cache_aligned that is actually a shortcut for __attribute__(__aligned__((64))).

typedef struct counter_s
{
uint64_t packets;
uint64_t bytes;
uint64_t failed_packets;
uint64_t failed_bytes;
uint64_t pad[4];
}counter_t __rte_cache_aligned;

Define an array of the structures with one element per thread.
counter_t stats[THREADS_NUM];

Note that in case if structure size is smaller than cache line size, the padding is required. Otherwise, gcc compiler will complain with the following error.

error: alignment of array elements is greater than element size

References

Learning DPDK: make your data cache friendly with pahole tool

CacheHierarchy

Overview

Taking into account orders of magnitude between speed access to different cache levels and RAM itself,  it is advised to carefully analyze C data structures that are used frequently on cache friendliness. The idea is to have the most often accessed data (“hot”) to stay in a higher level cache as long as possible. And the following technics are used.

  1. Group “hot” members together in the beginning and push “cold” to the end;
  2. Minimize structure size by avoiding padding;
  3. Align data to cache line size.

You can find a great description of why and how the data structures are laid out by compilers here.

Poke-a-hole (pahole) analyzes the object file and outputs detailed description of each and every structure layout created by a compiler.

Run

Analyze the file.
pahole a.out
Analyze one structure.
pahole a.out -C structure
Get suggestion on improvements.
pahole --show_reorg_steps --reorganize -C structure a.out

References

Learning DPDK: Profiling with Flame Graphs

flumegraph

Overview

perf is a great tool to profile the application. The problem is that it generates an enormous amount of text information that is difficult to analyze. Brendan Gregg developed a bunch of handy scripts to visualize perf results.

These tools generate a graph that represents call stacks and a relative execution time of each function.

Generate a graph

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -ag — sleep 60
perf script | ./stackcollapse-perf.pl > out.perf-folded
cat out.perf-folded | ./flamegraph.pl > perf.svg

Analyze

  1. Open the graph in a browser;
  2. Point to a bar for an execution statistics;
  3. Click on a bar to zoom;
  4. Use search “Ctrl-F”.

References