Learning DPDK : DPI with Hyperscan

magnifying-glass

Why

To know which application generates monitored traffic it is not enough to know TCP/IP address and port but a look inside HTTP header is required.

How

HTTP header is analyzed against a collection of strings. Each string is associated with some protocol, like facebook, google chat, etc.

Complications

String search is a slow operation and to be made fast could leverage smart algorithms and HW optimization technics.

Solution

Regex library called Hyperscan. You can listen for the introduction of the library here. The speed of the library was evaluated here.

Integration

Install binary prerequisites

yum install ragel libstdc++-static

Download Hyperscan sources

wget https://github.com/intel/hyperscan/archive/v4.7.0.tar.gz
tar -xf v4.7.0.tar.gz

Download boost headers

wget https://dl.bintray.com/boostorg/release/1.67.0/source/boost_1_67_0.tar.gz
tar -xf boost_1_67_0.tar.gz
cp -r boost_1_67_0/boost hyperscan-4.7.0/include

Build and install Hyperscan shared library

Just follow the instruction from here.
cd hyperscan-4.7.0
mkdir build
cd build
cmake -DBUILD_SHARED_LIBS=true ..
make
make install

Link DPDK app against Hyperscan

Modify Makefile as follows.
CFLAGS += -I/usr/local/include/hs/
LDFLAGS += -lhs

Build a database from a list of strings

Use hs_compile_multi() with an array of strings that you need to grep. To escape a string use  \Q and \E symbols from PCRE syntax.

Search

Use hs_scan() API
Check simplegrep example for more details.

References

Learning DPDK : Packet capturing

A new project has its goal to capture 40G rate traffic on a specified schedule.

Why

To analyze

  • security breaches,
  • misbehaviours or
  • faulty appliances

it is utterly useful to have virtual traces fully recorded.

What

  • You can record the whole Ethernet packet.
  • You can trim its payload in case only headers are important for later analysis.
  • You can filter the traffic based on IP address and TCP/UDP port.

How

  • First, capture the traffic into the RAM.
  • Second, store it on disk.

Complications

  • Average SSD disk speed is about 500 MB/s
  • SATA 3.0 speed is 6Gb/S

Solution

It looks a solution could be one of the following or both

  • RAID
  • PCI + high-speed SSD disk

Learning DPDK : Deep Packet Inspection 40/100G

Currently, I am working on the project with a goal to analyze application protocols. That results in a functionality similar to a stateful firewall.

The main challenges:

  1. Parsing on line-rate
  2. Storing state

The following hardware parts were chosen:

  1. FPGA based smart NIC supported by DPDK
  2. Powerful multicore Intel server with lots of RAM

To sustain high speed and enable scalability, main DPDK application design principals have to be respected:

  1. Avoid thread synchronization
  2. Avoid heavy operations such as memcpy, etc.
  3. Take NUMA into the consideration

Thus the design of the resulting application is presented in the below diagram. And its cornerstone is the ability of NIC to distribute bidirectional flows between multiple CPUs, i.e. Receive Side Scaling.

As a result, packets from one flow are always handled by the same CPU core. Furthermore, each thread is using its own dedicated data structures, e.g. hash table, without a need for synchronization.

XDR design

The flow entry is allocated from a memory pool and the hash table is used to quickly locate the flow entry by a key tuple, consisting of source/destination pair of IP address and TCP/UDP port.

In the next article, I am going to describe specifics of the cards that I plan to test, i.e Napatech and Netcope.

Learning DPDK : Algorithmics

DPDK library contains a lot of things that help the application that leverages its API to achieve line speed switching/routing. To name a few:
1. Poll drivers (recently enhanced with interrupt support)
2. Hardware optimizations (SSE, AVX2, etc.)
3. OS optimizations (Huge tables, etc.)
4. Lookup algorithms (Hash, LPM, etc.)

The following slides is an incomplete overview of the packet lookup algorithms supported in the current version of DPDK.

Good books on the topic are:
High Performance Switches and Routers

Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices

Learning DPDK : vagrant VM

It is very convenient to develop DPDK based solution using one or multiple VMs. My choice of tools for this task is a combination of vagrant and virtualbox tools. This pair makes deployment of VM really simple and easily reproducible.

The following slides give a short overview of the vagrant usage scenario.

My vagrant configuration file looks as follows.


# -*- mode: ruby -*-
# vi: set ft=ruby :
# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "trusty64"
config.vm.provision :shell, path: "../bootstrap.sh"
config.vm.network "public_network", bridge: "br-1", ip: "192.168.56.1"
config.vm.network "public_network", bridge: "br-2", ip: "192.168.57.1"
config.vm.network "public_network", bridge: "br-3", ip: "192.168.58.1"
config.vm.provider :virtualbox do |vb|
# # Don't boot with headless mode
# vb.gui = true
#
# # Use VBoxManage to customize the VM. For example to change memory:
vb.customize ["modifyvm", :id, "–memory", "1024"]
vb.customize ["modifyvm", :id, "–cpuexecutioncap", "50"]
vb.customize ["modifyvm", :id, "–ioapic", "on"]
vb.customize ["modifyvm", :id, "–cpus", "4"]
end
end

view raw

Vagrantfile

hosted with ❤ by GitHub

My bootstrap.sh file contains the following.

#!/bin/bash
apt-get -y install git libtool autoconf g++ gdb

Learning DPDK : KNI interface

KNI (Kernel Network Interface) is an approach that is used in DPDK to connect user space applications with the kernel network stack.

The following slides present the concept using a number of functional block diagrams.

The code that we are interested in is located by the following locations.

  • Sample KNI application
    • example/kni
  • KNI kernel module
    • lib/librte_eal/linuxapp/kni
  • KNI library
    • lib/librte_kni

To begin testing KNI we need to build DPDK libraries first

git clone git://dpdk.org/dpdk
export RTE_SDK=~/dpdk/
make config T=x86_64-native-linuxapp-gcc O=x86_64-native-linuxapp-gcc
cd x86_64-native-linuxapp-gcc
make

Then we need to compile KNI sample application

cd ${RTE_SDK}/examples/kni
export RTE_TARGET=x86_64-native-linuxapp-gcc
make

To run the above application we need to load KNI kernel module

insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko

The following kernel module options are available in case if a loopback mode is required.

  • kthread_mode=single/multiple – number of kernel threads
  • lo_mode=lo_mode_fifo/lo_mode_fifo_skb – loopback mode

Enable enough huge pages

mkdir -p /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Load UIO kernel module and bind network interfaces to it. Note that you will not be able to bind interface in case if there exist any route associated with it.

modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py --status
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py --bind=uio_pci_generic eth2
${RTE_SDK}/tools/dpdk_nic_bind.py --status

In the case of PC/VM with four cores we can run KNI application using the following commands.

export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 -- -P -p 0x3 --config="(0,0,1),(1,2,3)"

Where:

  • -c = core bitmask
  • -P = promiscuous mode
  • -p = port hex bitmask
  • –config=”(port, lcore_rx, lcore_tx [,lcore_kthread, …]) …”

Note that each core can do either TX or RX for one port only.

You can use the following script to setup and run KNI test application.


#/bin/sh
#setup path to DPDK
export RTE_SDK=/home/dpdk
export RTE_TARGET=x86_64-native-linuxapp-gcc
#setup 512 huge pages
mkdir -p /mnt/huge
umount -t hugetlbfs nodev /mnt/huge
mount -t hugetlbfs nodev /mnt/huge
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
#bind eth1 and eth2 to Linux generic UIO
modprobe uio_pci_generic
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth1
${RTE_SDK}/tools/dpdk_nic_bind.py –bind=uio_pci_generic eth2
#insert KNI kernel driver
insmod ${RTE_SDK}/${RTE_TARGET}/kmod/rte_kni.ko
#start KNI sample application
export LD_LIBRARY_PATH=${RTE_SDK}/${RTE_TARGET}/lib/
${RTE_SDK}/examples/kni/build/kni -c 0x0f -n 4 — -P -p 0x3 –config="(0,0,1),(1,2,3)"

view raw

start_kni.sh

hosted with ❤ by GitHub

Let’s set ip addresses to KNI interfaces

sudo ifconfig vEth0 192.168.56.100
sudo ifconfig vEth1 192.168.56.101

Now we are set to test the application. To see statistics we need to send the SIGUSR1 signal.

watch -n 10 sudo pkill -10 kni

References

Custom priority queue in C++

A priority queue container is a basis for some algorithms on graphs, for instance, Dijkstra Shortest Path.

An implementation has to be efficient in terms of speed while allow to get access to (retrieve and change) a queue elements randomly.

STL library implementation of the priority queue does not support random access to its elements.

Here is presented a custom implementation of a priority queue with a random access property. To make access operations reasonably fast the bucket sorting algorithm is leveraged.

A bucket sorting idea is to have a sorted container of element lists that have equal values. It can be implemented an array with pointers to lists of elements. We follow another approach that could slightly impact speed of the algorithm while removing restrictions caused by the static nature of array. We used a tree (STL map) to store lists. Another option is to use a dynamic array (STL vector) that could be more efficient on the big data sets.

Our PriorityQueue class is built upon two STL maps of custom types queue_t and Id2ValueMap_t. The first map is a collection of STL set containers, that are, actually, the buckets. While the second map is used for instant access to the value of any element.

Priority queue

Priority queue

Aforementioned class has the following declaration.

typedef std::set<uint> bucket_t;
typedef std::map<uint, bucket_t> queue_t;
typedef std::map<uint, uint> id2ValueMap_t;

class PriorityQueue
{
public:
   PriorityQueue();

   void addElement(uint id, uint value);
   bool removeElement(uint id);

   bool getValue(uint id, uint* pValue) const;
   bool modifyValue(uint id, uint oldValue, uint newValue);

   bool getMinimalElement(uint* pId) const;

   bool isEmpty() const;
   bool contains(uint id) const;
   void clear();

private:
   queue_t  _queue;
   Id2ValueMap_t _id2ValueMap;
};

It is critical that both maps contain relevant and consistent information thus inserting to queue will look as follows. Error checking is omitted in all methods implementations for the sake of clarity.

void PriorityQueue:: addElement(uint id, uint value)
{
   queue_t& queue = _queue[value];
   queue.insert(id);

   _id2ValueMap[id] = value;
}

Retrieving an element with minimal value will look similar to the next code.

bool PriorityQueue::getMinimalElement(uint* pId) const
{
   queue_t::const_iterator queue_iter = _queue.begin();

   if (queue_iter == _queue.end())
   {
       return false; /* queue is empty */
   }

   const bucket_t& bucket = queue_iter>second;

   bucket_t::const_iterator bucket_iter = bucket.begin();

   if (bucket_iter == bucket.end())
   {
       return false; /* list is empty */
   }

   *pId = *bucket_iter;

   return true;
}

The following method is used to change a value of an element. Basically it is removing the element from the bucket in first map, adding it again to another bucket plus updating value in the map that is used to quickly get value based on the identifier.

bool modifyValue(uint id, uint oldValue, uint newValue)
{
   queue_t::iterator queue_iter;

   queue_iter = _queue.find(oldValue);

   if (queue_iter == _queue.end())
   {
       return false; /* no element found */
   }

   bucket_t& bucket = queue_iter->second;

   bucket.erase(id);

   return addElement(id, newValue);
}

References
http://www.cplusplus.com/reference/queue/priority_queue/
https://en.wikipedia.org/wiki/Bucket_sort

Bucket-based priority queue

A priority-queue is a wide used container that is used in lots of algorithms operating on the graphs, such as Shortest Path First (SPF).

As far as SPF algorithm lies in the core of routing algorithms, that are running as a part of  routers software, the queue implementation has to be efficient in terms of speed and memory consumption.

To meet the mentioned goal bucket-based sorting is an ideal solution. Thus an array of capacity equal to the product of network diameter and maximal possible cost of the link is used. Moreover each cell in the array is a pointer to the list of nodes with distance equal to the cell position.

Bucket sorting

Bucket sorting

The operation of retrieving minimal value from the queue is made very fast by use of the head pointer that is moved forward as soon as a minimal value is removed from the queue.

This is possible because in SPF algorithm bigger distances always stay in queue while smaller distances are removed and added to the tree. Also this property allows to use the array in a round-robin fashion.

References

Dijkstra’s shortest path algorithm

Network Algorithmics

Shortest Path First

Overview

Shortest Path First algorithm developed by Dijkstra is the one that is used to calculate forwarding tables by link-state routing protocols, namely OSPF and IS-IS.

The key element in the algorithm implementation is a priority queue used to sort distances from the root node.

Flowchart

Shortest path

Shortest path

References
http://en.wikipedia.org/wiki/Dijkstra’s_algorithm