Learning DPDK: Traffic generator TRex

trex1

Overview

TRex is a stateful and stateless traffic generator based on DPDK. TCP stack implementation leverages BSD4.4 original code.

Setup

Ubuntu 18.04 server installed on VirtualBox VM with two interfaces connected in a loopback, 4 CPUs, 4G RAM.

Install

Download and build the latest Trex on Ubuntu 18.04.

sudo apt -y install zlib1g-dev build-essential python python3-distutils
git clone https://github.com/cisco-system-traffic-generator/trex-core.git
cd trex-core/
cd linux_dpdk
./b configure
./b build
cd ..
sudo cp scripts/cfg/simple_cfg.yaml /etc/trex_cfg.yaml

Find out the PCI IDs of the interfaces to be used by Trex.

lspci | grep Eth
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:08.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)
00:09.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 02)

trex_cfg.yaml

Edit TRex config file by changing PCI ids.

- port_limit : 2
version : 2
#List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options
interfaces : ["00:08.0","00:09.0"]
port_info : # Port IPs. Change to suit your needs. In case of loopback, you can leave as is.
- ip : 1.1.1.1
default_gw : 2.2.2.2
- ip : 2.2.2.2
default_gw : 1.1.1.1

Run server

Run TRex in a stateful mode.

cd scripts/
sudo ./t-rex-64 -i --astf

Run console

Generate HTTP flows.

cd scripts/
./trex-console
trex> start -f astf/http_simple.py -m 1000 -d 1000 -l 1000
trex> tui

Traffic profile (http_simple.py)

from trex_astf_lib.api import *
class Prof1():
    def get_profile(self):
        # ip generator
        ip_gen_c = ASTFIPGenDist(ip_range=["10.10.10.0", "10.10.10.255"],
                                 distribution="seq")
        ip_gen_s = ASTFIPGenDist(ip_range=["20.20.20.0", "20.20.20.255"],
                                  distribution="seq")
        ip_gen = ASTFIPGen(glob=ASTFIPGenGlobal(ip_offset="1.0.0.0"),
                           dist_client=ip_gen_c,
                           dist_server=ip_gen_s)

        return ASTFProfile(default_ip_gen=ip_gen,
                            cap_list=[ASTFCapInfo(
                                      file="../avl/delay_10_http_browsing_0.pcap"
                                      cps=1)
                                     ])

def register():
    return Prof1()

Results

Monitor flow statistics by pressing “Esc” and “t” buttons in “tui” mode.

trex-stats

References

Learning DPDK : Huge pages

hugepages

Intro

Modern CPUs support different page sizes, e.g. 4K, 2M, 1GB. All page sizes, except 4K, are named “huge pages” in Linux. The reason for this name convention is historical and stems from the fact that originally Linux supported 4K page size only.

Big page sizes are beneficial for performance as far as fewer translations between virtual and physical addresses happen and Translation Lookaside Buffer (TLB) cache is a scarce resource.

mmu-tlb

To check the size of TLB the following utility can be used.

cpuid | grep -i tlb
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb6: instruction TLB: 4K, 8-way, 128 entries
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

To check the number of allocated huge pages the following command can be used.

cat /proc/meminfo | grep Huge
AnonHugePages: 4409344 kB
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB

There are two types of huge pages available in the Linux.

  • Transparent (Anonymous) huge pages
  • Persistent huge pages

Transparent huge pages

Transparent huge pages is an abstraction layer that automates most aspects of creating, managing and using huge pages. As far as there existed some issues with performance and stability, DPDK does not rely on this mechanism but uses persistent huge pages.

Persistent huge pages

Persistent huge pages have to be configured manually. Persistent huge pages are never swapped by the Linux kernel.

The following management interfaces exist in Linux to allocate the persistent huge pages.

  • Shared memory using shmget()
  • HugeTLBFS is a RAM-based filesystem and mmap()read() or memfd_create() can be used to access its files
  • Anonymous mmap() by specifying the flags MAP_ANONYMOUS and MAP_HUGETLB flags
  • libhugetlbfs APIs
  • Automatic backing of memory regions

Persistent huge pages are used in DPDK by default, mount points are discovered automatically and pages are released once application exits. But in case a user needs to manually tune something, the following EAL command line parameters could be used.

  • --huge-dir Use specified hugetlbfs directory instead of autodetected ones.
  • --huge-unlink Unlink huge page files after creating them (implies no secondary process support).
  • --in-memory Recent DPDK versions added an option to not rely on hugetlbfs

There are multiple ways to set up persistent huge pages.

  • On the boot
  • In run-time

In boot time

Modify Linux boot time parameters inside /etc/default/grub. Huge pages will be spread equally between all NUMA sockets.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=32"

Update the grub configuration file and reboot.

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

Create a folder for a permanent mount point of hugetlbfs 

mkdir /mnt/huge

Add the following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

In runtime

Update number of huge pages for each NUMA node. Default huge page size cannot be modified in the runtime.
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Create a mount point.

mkdir /mnt/huge

Mount hugetlbfs 
mount -t hugetlbfs nodev /mnt/huge

Memory allocation

While there are many ways to allocate persistent huge pages, DPDK is using the following.

  • mmap() call with hugetlbfs mount point
  • mmap() call with MAP_HUGETLB flag
  • memfd_create() call with MFD_HUGETLB flag

References

Learning DPDK : Capture to Kafka

kafka_logo

Requirements

The requirements are as follows.

  • To capture 10Gbps of 128B packets into Apache Kafka
  • Implement basic filtering using IP addresses
  • Save traffic into one Kafka topic

Hardware

Software

Solution

The following key design ideas helped to achieve 5Gbps capture speed.

  • Use XFS filesystem
  • Combine small packets into the big Kafka messages, 500KB each
  • Run 4 Kafka brokers on one physical server simultaneously
  • Allocate 20 partitions per topic

Conclusion

The decision was made to use two servers to be able to capture all 10Gbps traffic.

References

Learning DPDK : Symmetric RSS

rss diagram (1)

Overview

Receive side scaling (RSS) is a technology that enables the distribution of received packets between multiple RX queues using a predefined hash function. It enabled multicore CPU to process packets from different queues on different cores.

Symmetric RSS promise is to populate two-way packets from the same TCP connection to the same RX queue. As a result statistics on the different connections could be stored in the per-queue data structures avoiding any need for locking.

System

Recently I had a chance to test symmetric RSS on two NICs from Intel, namely XL710 40G and 82599 10G. 

The approach to configuring symmetric RSS in XL710 is different from the standard DPDK approach. Thus i40e driver offers specific API for this purpose.

DPDK 18.05.1 was used for testing.

82599 solution

To make symmetric RSS work a default hash key has to be replaced with a custom one.

#define RSS_HASH_KEY_LENGTH 40
static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = {
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
};
struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = hash_key,
.rss_key_len = RSS_HASH_KEY_LENGTH,
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

XL710 solution

To enable symmetric RSS i40e driver provides API to setup hardware registers.

struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);
int sym_hash_enable(int port_id, uint32_t ftype, enum rte_eth_hash_function function)
{
struct rte_eth_hash_filter_info info;
int ret = 0;
uint32_t idx = 0;
uint32_t offset = 0;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_GLOBAL_CONFIG;
info.info.global_conf.hash_func = function;
idx = ftype / UINT64_BIT;
offset = ftype % UINT64_BIT;
info.info.global_conf.valid_bit_mask[idx] |= (1ULL << offset);
info.info.global_conf.sym_hash_enable_mask[idx] |=
(1ULL << offset);
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set global hash configurations"
"on port %u", port_id);
return ret;
}
return 0;
}
int sym_hash_set(int port_id, int enable)
{
int ret = 0;
struct rte_eth_hash_filter_info info;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_SYM_HASH_ENA_PER_PORT;
info.info.enable = enable;
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set symmetric hash enable per port "
"on port %u", port_id);
return ret;
}
return 0;
}
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_TCP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_UDP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_FRAG_IPV4, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_SCTP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_OTHER, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_set(port_id, 1);

References

Learning DPDK: Cloud support

DPDK in cloud

Overview

DPDK based products fit perfectly into NFV paradigm. DPDK provides drivers for cloud-based NICs that could be run in AWS and VMware environments.

Limitations

The following nuances were discovered when using DPDK on VMware and Amazon platforms.

VMXNET 3 driver

Both RX and TX queues have to be configured on the device. Otherwise, DPDK initialization crashes.

ENA driver

The maximum number of buffer descriptors for RX queue is 512.

References

Learning DPDK: Branch Prediction

Pipeline,_4_stage.svg

Overview

It is well-known that modern CPUs are built using the instructions pipelines that enable them to execute multiple instructions in parallel. But in case of conditional branches within the program code, not all the instructions are executed each time. As a solution, a speculative execution and branch prediction mechanisms are used to further speed up performance by guessing and executing one branch ahead of time. The problem is that in case of the wrong guess, the results of the execution have to be discarded and correct instructions have to be loaded into the instruction cache and executed on the spot.

Solution

An application developer should use macros likely and unlikely that are shortcuts for gcc __builtin_expect directive. The purpose of these macros is to give the compiler a hint which path will be taken more often and as a result, decreasing percentage of branch prediction misses.

References

 

Learning DPDK : DPI with Hyperscan

magnifying-glass

Why

To know which application generates monitored traffic it is not enough to know TCP/IP address and port but a look inside HTTP header is required.

How

HTTP header is analyzed against a collection of strings. Each string is associated with some protocol, like facebook, google chat, etc.

Complications

String search is a slow operation and to be made fast could leverage smart algorithms and HW optimization technics.

Solution

Regex library called Hyperscan. You can listen for the introduction of the library here. The speed of the library was evaluated here.

Integration

Install binary prerequisites

yum install ragel libstdc++-static

Download Hyperscan sources

wget https://github.com/intel/hyperscan/archive/v4.7.0.tar.gz
tar -xf v4.7.0.tar.gz

Download boost headers

wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz
tar -xf boost_1_67_0.tar.gz
cp -r boost_1_67_0/boost hyperscan-4.7.0/include

Build and install Hyperscan shared library

Just follow the instruction from here.
cd hyperscan-4.7.0
mkdir build
cd build
cmake -DBUILD_SHARED_LIBS=true ..
make
make install

Link DPDK app against Hyperscan

Modify Makefile as follows.
CFLAGS += -I/usr/local/include/hs/
LDFLAGS += -lhs

Build a database from a list of strings

Use hs_compile_multi() with an array of strings that you need to grep. To escape a string use  \Q and \E symbols from PCRE syntax.

Search

Use hs_scan() API
Check simplegrep example for more details.

References

Learning DPDK : Java support

JNI-Java-Native-Interface

Overview

DPDK framework is written in C with a purpose to be fast and be able to use HW optimization technics. But there are many languages that people use to write a software. And one of the most popular languages is Java.

So we had a project with a goal to develop a packet capturing Java library. To marry DPDK with JAVA we have chosen JNI.

Building blocks

We have chosen the following approach to create a library that could be linked to Java application.

  1. Build DPDK as a set of dynamic libraries.
    You need to enable CONFIG_RTE_BUILD_SHARED_LIB in the configuration.
  2. Generate C headers using JNI.
  3. Build own dynamic library using DPDK build system.
    You need to include rte.extshared.mk in the library Makefile

Communication between DPDK and JAVA

There are two directions of communication, i.e. from JAVA application to DPDK and the opposite.

Here you need to follow JNI guidelines with the following exceptions.

  1. Do not use DPDK native thread for communication with JAVA but create a dedicated thread using pthread. Otherwise, we observed a crash.
  2. Use JAVA static methods. It is not clear why, but we could not use the regular JAVA methods.

 

Learning DPDK : Packet capturing

A new project has its goal to capture 40G rate traffic on a specified schedule.

Why

To analyze

  • security breaches,
  • misbehaviours or
  • faulty appliances

it is utterly useful to have virtual traces fully recorded.

What

  • You can record the whole Ethernet packet.
  • You can trim its payload in case only headers are important for later analysis.
  • You can filter the traffic based on IP address and TCP/UDP port.

How

  • First, capture the traffic into the RAM.
  • Second, store it on disk.

Complications

  • Average SSD disk speed is about 500 MB/s
  • SATA 3.0 speed is 6Gb/S

Solution

It looks a solution could be one of the following or both

  • RAID
  • PCI + high-speed SSD disk

Learning DPDK: 10G traffic generator

DPDK was successfully used for a project to create Ostinato based traffic generator.

The following performance was achieved on 1G NIC.
Ostinato performance report for 1G link

The following performance was achieved on 10G NIC.
Ostinato performance report for 10G link

The code for this project is available on Github:
https://github.com/PLVision/ostinato-dpdk

The library responsible for DPDK is located on GitHub as well:
https://github.com/PLVision/dpdkadapter