Learning DPDK: Traffic generator TRex

trex1

Overview

TRex is a stateful and stateless traffic generator based on DPDK. TCP stack implementation leverages BSD4.4 original code.

Setup

Ubuntu 18.04 server installed on VirtualBox VM with two interfaces connected in a loopback, 4 CPUs, 4G RAM.

Install

Download and build the latest Trex on Ubuntu 18.04.

sudo apt -y install zlib1g-dev build-essential python python3-distutils
git clone https://github.com/cisco-system-traffic-generator/trex-core.git
cd trex-core/
cd linux_dpdk
./b configure
./b build
cd ..
sudo cp scripts/cfg/simple_cfg.yaml /etc/trex_cfg.yaml

Find out the PCI IDs of the interfaces to be used by Trex.

lspci | grep Eth
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:09.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)

trex_cfg.yaml

Edit TRex config file by changing PCI ids.

- port_limit : 2
version : 2
#List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options
interfaces : ["00:08.0","00:09.0"]
port_info : # Port IPs. Change to suit your needs. In case of loopback, you can leave as is.
- ip : 1.1.1.1
default_gw : 2.2.2.2
- ip : 2.2.2.2
default_gw : 1.1.1.1

Run server

Run TRex in a stateful mode.

cd scripts/
sudo ./t-rex-64 -i --astf

Run console

Generate HTTP flows.

cd scripts/
./trex-console
trex> start -f astf/http_simple.py -m 1000 -d 1000 -l 1000
trex> tui

Traffic profile (http_simple.py)

from trex_astf_lib.api import *
class Prof1():
    def get_profile(self):
        # ip generator
        ip_gen_c = ASTFIPGenDist(ip_range=["10.10.10.0", "10.10.10.255"],
                                 distribution="seq")
        ip_gen_s = ASTFIPGenDist(ip_range=["20.20.20.0", "20.20.20.255"],
                                  distribution="seq")
        ip_gen = ASTFIPGen(glob=ASTFIPGenGlobal(ip_offset="1.0.0.0"),
                           dist_client=ip_gen_c,
                           dist_server=ip_gen_s)

        return ASTFProfile(default_ip_gen=ip_gen,
                            cap_list=[ASTFCapInfo(
                                      file="../avl/delay_10_http_browsing_0.pcap"
                                      cps=1)
                                     ])

def register():
    return Prof1()

Results

Monitor flow statistics by pressing “Esc” and “t” buttons in “tui” mode.

trex-stats

References

Learning DPDK : Huge pages

hugepages

Intro

Modern CPUs support different page sizes, e.g. 4K, 2M, 1GB. All page sizes, except 4K, are named “huge pages” in Linux. The reason for this name convention is historical and stems from the fact that originally Linux supported 4K page size only.

Big page sizes are beneficial for performance as far as fewer translations between virtual and physical addresses happen and Translation Lookaside Buffer (TLB) cache is a scarce resource.

mmu-tlb

To check the size of TLB the following utility can be used.

cpuid | grep -i tlb
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb6: instruction TLB: 4K, 8-way, 128 entries
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

To check the number of allocated huge pages the following command can be used.

cat /proc/meminfo | grep Huge
AnonHugePages: 4409344 kB
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB

There are two types of huge pages available in the Linux.

  • Transparent (Anonymous) huge pages
  • Persistent huge pages

Transparent huge pages

Transparent huge pages is an abstraction layer that automates most aspects of creating, managing and using huge pages. As far as there existed some issues with performance and stability, DPDK does not rely on this mechanism but uses persistent huge pages.

Persistent huge pages

Persistent huge pages have to be configured manually. Persistent huge pages are never swapped by the Linux kernel.

The following management interfaces exist in Linux to allocate the persistent huge pages.

  • Shared memory using shmget()
  • HugeTLBFS is a RAM-based filesystem and mmap()read() or memfd_create() can be used to access its files
  • Anonymous mmap() by specifying the flags MAP_ANONYMOUS and MAP_HUGETLB flags
  • libhugetlbfs APIs
  • Automatic backing of memory regions

Persistent huge pages are used in DPDK by default, mount points are discovered automatically and pages are released once application exits. But in case a user needs to manually tune something, the following EAL command line parameters could be used.

  • --huge-dir Use specified hugetlbfs directory instead of autodetected ones.
  • --huge-unlink Unlink huge page files after creating them (implies no secondary process support).
  • --in-memory Recent DPDK versions added an option to not rely on hugetlbfs

There are multiple ways to set up persistent huge pages.

  • On the boot
  • In run-time

In boot time

Modify Linux boot time parameters inside /etc/default/grub. Huge pages will be spread equally between all NUMA sockets.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=32"

Update the grub configuration file and reboot.

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

Create a folder for a permanent mount point of hugetlbfs 

mkdir /mnt/huge

Add the following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

In runtime

Update number of huge pages for each NUMA node. Default huge page size cannot be modified in the runtime.
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Create a mount point.

mkdir /mnt/huge

Mount hugetlbfs 
mount -t hugetlbfs nodev /mnt/huge

Memory allocation

While there are many ways to allocate persistent huge pages, DPDK is using the following.

  • mmap() call with hugetlbfs mount point
  • mmap() call with MAP_HUGETLB flag
  • memfd_create() call with MFD_HUGETLB flag

References

Learning DPDK : Capture to Kafka

kafka_logo

Requirements

The requirements are as follows.

  • To capture 10Gbps of 128B packets into Apache Kafka
  • Implement basic filtering using IP addresses
  • Save traffic into one Kafka topic

Hardware

Software

Solution

The following key design ideas helped to achieve 5Gbps capture speed.

  • Use XFS filesystem
  • Combine small packets into the big Kafka messages, 500KB each
  • Run 4 Kafka brokers on one physical server simultaneously
  • Allocate 20 partitions per topic

Conclusion

The decision was made to use two servers to be able to capture all 10Gbps traffic.

References

Learning DPDK : Symmetric RSS

rss diagram (1)

Overview

Receive side scaling (RSS) is a technology that enables the distribution of received packets between multiple RX queues using a predefined hash function. It enabled multicore CPU to process packets from different queues on different cores.

Symmetric RSS promise is to populate two-way packets from the same TCP connection to the same RX queue. As a result statistics on the different connections could be stored in the per-queue data structures avoiding any need for locking.

System

Recently I had a chance to test symmetric RSS on two NICs from Intel, namely XL710 40G and 82599 10G. 

The approach to configuring symmetric RSS in XL710 is different from the standard DPDK approach. Thus i40e driver offers specific API for this purpose.

DPDK 18.05.1 was used for testing.

82599 solution

To make symmetric RSS work a default hash key has to be replaced with a custom one.

#define RSS_HASH_KEY_LENGTH 40
static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = {
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
};
struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = hash_key,
.rss_key_len = RSS_HASH_KEY_LENGTH,
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

XL710 solution

To enable symmetric RSS i40e driver provides API to setup hardware registers.

struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);
int sym_hash_enable(int port_id, uint32_t ftype, enum rte_eth_hash_function function)
{
struct rte_eth_hash_filter_info info;
int ret = 0;
uint32_t idx = 0;
uint32_t offset = 0;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_GLOBAL_CONFIG;
info.info.global_conf.hash_func = function;
idx = ftype / UINT64_BIT;
offset = ftype % UINT64_BIT;
info.info.global_conf.valid_bit_mask[idx] |= (1ULL << offset);
info.info.global_conf.sym_hash_enable_mask[idx] |=
(1ULL << offset);
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set global hash configurations"
"on port %u", port_id);
return ret;
}
return 0;
}
int sym_hash_set(int port_id, int enable)
{
int ret = 0;
struct rte_eth_hash_filter_info info;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_SYM_HASH_ENA_PER_PORT;
info.info.enable = enable;
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set symmetric hash enable per port "
"on port %u", port_id);
return ret;
}
return 0;
}
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_TCP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_UDP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_FRAG_IPV4, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_SCTP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_OTHER, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_set(port_id, 1);

References

Learning DPDK: Inlining

To_Inline_or_not_to_Inline_Dia_01Overview

Inlining method can help to mitigate the following:

  1. Function call overhead;
  2. Pipeline stall.

It is advised to apply the method for the following types of routines:

  1. Trivial and small functions used as accessors to data or wrappers around another function;
  2. Big functions called quite regularly but not from many places.

Solution

A modern compiler uses heuristics to decide which functions need to be inlined. But it is always better to give it a hint using the following keywords.

static inline

Moreover to make a decision instead of the gcc compiler the following attribute should be used.

__attribute__((always_inline))

References

Learning DPDK: Cloud support

DPDK in cloud

Overview

DPDK based products fit perfectly into NFV paradigm. DPDK provides drivers for cloud-based NICs that could be run in AWS and VMware environments.

Limitations

The following nuances were discovered when using DPDK on VMware and Amazon platforms.

VMXNET 3 driver

Both RX and TX queues have to be configured on the device. Otherwise, DPDK initialization crashes.

ENA driver

The maximum number of buffer descriptors for RX queue is 512.

References

Learning DPDK : DPI with Hyperscan

magnifying-glass

Why

To know which application generates monitored traffic it is not enough to know TCP/IP address and port but a look inside HTTP header is required.

How

HTTP header is analyzed against a collection of strings. Each string is associated with some protocol, like facebook, google chat, etc.

Complications

String search is a slow operation and to be made fast could leverage smart algorithms and HW optimization technics.

Solution

Regex library called Hyperscan. You can listen for the introduction of the library here. The speed of the library was evaluated here.

Integration

Install binary prerequisites

yum install ragel libstdc++-static

Download Hyperscan sources

wget https://github.com/intel/hyperscan/archive/v4.7.0.tar.gz
tar -xf v4.7.0.tar.gz

Download boost headers

wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz
tar -xf boost_1_67_0.tar.gz
cp -r boost_1_67_0/boost hyperscan-4.7.0/include

Build and install Hyperscan shared library

Just follow the instruction from here.
cd hyperscan-4.7.0
mkdir build
cd build
cmake -DBUILD_SHARED_LIBS=true ..
make
make install

Link DPDK app against Hyperscan

Modify Makefile as follows.
CFLAGS += -I/usr/local/include/hs/
LDFLAGS += -lhs

Build a database from a list of strings

Use hs_compile_multi() with an array of strings that you need to grep. To escape a string use  \Q and \E symbols from PCRE syntax.

Search

Use hs_scan() API
Check simplegrep example for more details.

References

Learning DPDK : Java support

JNI-Java-Native-Interface

Overview

DPDK framework is written in C with a purpose to be fast and be able to use HW optimization technics. But there are many languages that people use to write a software. And one of the most popular languages is Java.

So we had a project with a goal to develop a packet capturing Java library. To marry DPDK with JAVA we have chosen JNI.

Building blocks

We have chosen the following approach to create a library that could be linked to Java application.

  1. Build DPDK as a set of dynamic libraries.
    You need to enable CONFIG_RTE_BUILD_SHARED_LIB in the configuration.
  2. Generate C headers using JNI.
  3. Build own dynamic library using DPDK build system.
    You need to include rte.extshared.mk in the library Makefile

Communication between DPDK and JAVA

There are two directions of communication, i.e. from JAVA application to DPDK and the opposite.

Here you need to follow JNI guidelines with the following exceptions.

  1. Do not use DPDK native thread for communication with JAVA but create a dedicated thread using pthread. Otherwise, we observed a crash.
  2. Use JAVA static methods. It is not clear why, but we could not use the regular JAVA methods.

 

Learning DPDK : Packet capturing

A new project has its goal to capture 40G rate traffic on a specified schedule.

Why

To analyze

  • security breaches,
  • misbehaviours or
  • faulty appliances

it is utterly useful to have virtual traces fully recorded.

What

  • You can record the whole Ethernet packet.
  • You can trim its payload in case only headers are important for later analysis.
  • You can filter the traffic based on IP address and TCP/UDP port.

How

  • First, capture the traffic into the RAM.
  • Second, store it on disk.

Complications

  • Average SSD disk speed is about 500 MB/s
  • SATA 3.0 speed is 6Gb/S

Solution

It looks a solution could be one of the following or both

  • RAID
  • PCI + high-speed SSD disk

Learning DPDK : KNI interface

KNI (Kernel Network Interface) is an approach that is used in DPDK to connect user space applications with the kernel network stack.

The following slides present the concept using a number of functional block diagrams.

The code that we are interested in is located by the following locations.

  • Sample KNI application
    • example/kni
  • KNI kernel module
    • lib/librte_eal/linuxapp/kni
  • KNI library
    • lib/librte_kni

To begin testing KNI we need to build DPDK libraries first

git clone git://dpdk.org/dpdk
export RTE_SDK=~/dpdk/
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc
cd x86_64-native-linuxapp-gcc
make

Then we need to compile KNI sample application

cd ${RTE_SDK}/examples/kni
export RTE_TARGET=x86_64-native-linuxapp-gcc
make

To run the above application we need to load KNI kernel module

insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko

The following kernel module options are available in case if a loopback mode is required.

  • kthread_mode=single/multiple – number of kernel threads
  • lo_mode=lo_mode_fifo/lo_mode_fifo_skb – loopback mode

Enable enough huge pages

mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Load UIO kernel module and bind network interfaces to it. Note that you will not be able to bind interface in case if there exist any route associated with it.

modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py --status
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth2
${RTE_SDK}/tools/dpdk_nic_bind.py --status

In the case of PC/VM with four cores we can run KNI application using the following commands.

export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 -- -P -p 0x3 --config="(0,0,1),(1,2,3)"

Where:

  • -c = core bitmask
  • -P = promiscuous mode
  • -p = port hex bitmask
  • –config=”(port, lcore_rx, lcore_tx [,lcore_kthread, …]) …”

Note that each core can do either TX or RX for one port only.

You can use the following script to setup and run KNI test application.

#/bin/sh
#setup path to DPDK
export RTE_SDK=/home/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc
#setup 512 huge pages
mkdir -p /mnt/huge
umount -t hugetlbfs nodev /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
#bind eth1 and eth2 to Linux generic UIO
modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth2
#insert KNI kernel driver
insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko
#start KNI sample application
export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 — -P -p 0x3 –config="(0,0,1),(1,2,3)"

view raw

start_kni.sh

hosted with ❤ by GitHub

Let’s set ip addresses to KNI interfaces

sudo ifconfig vEth0 192.168.56.100
sudo ifconfig vEth1 192.168.56.101

Now we are set to test the application. To see statistics we need to send the SIGUSR1 signal.

watch -n 10 sudo pkill -10 kni

References