Here I will present some nuances of using DPDK libraries that I learned during implementation of DPDK based custom application.
These tips could save some time for a developer implementing an independent application using DPDK libraries.
Linking DPDK library
By default DPDK sources are built in the set of libraries. But there is an option to pack all libraries in the one dynamic or static library. To achieve it the file “config/common_linuxapp” has to be modified by assigning “y” to “CONFIG_RTE_BUILD_COMBINE_LIBS” define.
Beside that the following options in the Makefile are required.
-include /.../rte_config.h D__STDC_LIMIT_MACROS DRTE_MAX_LCORE=64 DRTE_PKTMBUF_HEADROOM=128 DRTE_MAX_ETHPORTS=32 DRTE_MACHINE_CPUFLAG_SSE DRTE_MACHINE_CPUFLAG_SSE2 DRTE_MACHINE_CPUFLAG_SSE3 DRTE_MACHINE_CPUFLAG_SSSE3 DRTE_COMPILE_TIME_CPUFLAGS=RTE_CPUFLAG_SSE DRTE_CPUFLAG_SSE2 DRTE_CPUFLAG_SSE3 DRTE_CPUFLAG_SSSE3
Drop unprocessed traffic
As I have discovered it is important to setup DPDK RX queues to drop the received packets in case if they are not processed by an application. Otherwise both RX and TX will be compromised.
To enable such behavior rte_eth_rx_queue_setup has to be provided with rte_eth_rxconf structure that has rx_drop_en field assigned 1.
It could be that such issue is present not for all NICs but only for some of them.
Mbuf reference count
In case if a specific packet has to be sent many times to the same interface it is impractical and quite inefficient to copy its contents each time before transmit. Such copy operation dramatically affects datapath throughput. The reason is not only that it takes memory cycles to execute but, what is more important, it does not let to use cache efficiently.
In order to implement a zero-copy mechanism DPDK provides a notion of reference counter for each memory buffer (mbuf) that is used to store a packet.
The function rte_pktmbuf_refcnt_update could be used to increment reference counter before each send invocation. In such scenario memory buffer would not be released to a memory pool after packet was sent out the port and the same buffer could be used again later.
Jumbo frame is a name for frames that have the size greater 1500 bytes of a regular Ethernet frame. Usually its size does not exceed 9000 bytes.
To enable DPDK application to receive such big frames the “rte_eth_conf.rxmode.jumbo_frame” field has to be set to 1 and “rte_eth_conf.rxmode.max_rx_pkt_len” has to be set to the maximum frame size supported.
Afterwards rte_eth_conf structure has to be provided as a parameter to the function rte_eth_dev_configure that is used to configure a specific interface.
Besides that it has to be noted that manipulation with such frames is implemented using multiple segment memory buffers (mbuf). The following fields of mbuf are used to support segments:
- pkt_segs – the first segment has to specify the overall number in the chain
- data_len – a particular segment payload length
- pkt_len – a total length of the payload stored in all segments in the chain
- next – the pointer to the next segment or null otherwise
Barriers for synchronization
While working on multiple cores it is important to be able to send a synchronization signal from one core to another. For this purpose a simple flag (array of flags) could be used accompanied by a R/W barrier, i.e. “rte_mb”, that ensures that all threads have most relevant memory view.
The following guide will describe how to run l2fwd app using KVM hypervisor on the Intel chipset. The approach is to use IOMMU feature of the modern chipsets that allows to use DMA between VM guests and hardware. After that PCI pass-through feature of the hypervisor will be used to bind NIC interface directly to VM.
Enable virtualization in BIOS
First of all Intel VT and VT-D features have to be enabled in BIOS.
Enable virtualization in kernel
Then IOMMU has to be enabled in Linux kernel using GRUB bootloader parameter by adding “intel_iommu=on” option to GRUB_CMDLINE_LINUX_DEFAULT variable inside /etc/default/grub file.
After that the “update-grub”commandhas to be run to make changes effective. Obviously the reboot is required as well.
Check Kernel support
IOMMU related logs have to appear in the system log
dmesg | grep -e DMAR -e IOMMU
To install KVM and utilities to manage it execute the following steps.
- apt-get install qemu-kvm libvirt-bin bridge-utils
- apt-get install virt-install vncdisplay
- sudo adduser `id -un` kvm
- sudo adduser `id -un` libvirtd
- Logout and login again in UI and/or shell
Install VM guest
Download Ubuntu server image from the official site.
Install VM by issuing the following command.
sudo virt-install –connect qemu:///system -n vm2 -r 1024 –vcpus=3 –disk path=/var/lib/libvirt/images/vm2.img,size=12 -c /home/ubuntu/images/ubuntu-14.04.1-server-amd64.iso –noautoconsole –os-type linux –accelerate –network=bridge:virbr0 –hvm –graphics vnc,listen=0.0.0.0 –cpu host
Step through visual installation by connecting using VNC viewer on the port 5900.
Passthrough of DPDK interfaces
On the host find PCI details (domain, bus, slot and function) of Ethernet interfaces in the form XXXX:XX:XX.X using “lspci -D | grep Eth” command. Add “hostdev” descriptions under “devices” section in VM configuration.
$ virsh edit vm1
<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x1' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x1' slot='0x00' function='0x2'/> </source> </hostdev>
After starting VM by executing “virsh start vm1” update network interface configuration that resides in the file “/etc/network/interfaces” by enabling newly created interfaces.
iface eth0 inet dhcp
iface eth1 inet dhcp
iface eth2 inet dhcp
As a result NIC interfaces will appear in VM and disappear from the host.
This application is called l2fwd and can be found in examples folder of DPDK distribution.
The idea behind this sample application is to capture traffic on one port and after modifying its source and destination MAC addresses send it out adjacent port. Each core is designated to receive traffic on the only one port.
The following DPDK API groups used in the app could be identified according to its purpose:
- Global initialization
- Port (device) initialization
- Start threads
- RX/TX and modify the packets
It has to be noted that no RX/TX operations could be performed on the port before it is initialized.
First of all initialization of the whole library has to be requested using rte_eal_init provided with command line parameters, where user can specify number of CPU cores to be used and other requirements.
Secondly memory pools and port queues have to be setup. TX and RX pools are created using rte_mempool_create that is configured with buffer size, ring depth, cache size and optional NUMA socket parameter.
RX and TX queues are configured using rte_eth_rx_queue_setup and rte_eth_tx_queue_setup respectfully.
RX queue is provided with the pointer to an appropriate memory pool to get buffers for the received packets and a depth (number of RX) descriptors is chosen. Both queues are configured with ring threshold registers.
Thirdly DPDK enabled ports, called devices, are enumerated by probing over PCI bus using rte_eal_pci_probe. As a result rte_eth_dev_count returns the number of discovered devices. Ports are configured using rte_eth_dev_configure and started using rte_eth_dev_start.
The infinite loop responsible for forwarding traffic is started using rte_eal_remote_launch on every slave core but master that is responsible for gathering statistics.
RX/TX and modify the packets
Packets are captured not individually but in groups, called bursts, using rte_eth_rx_burst. Those packets are modified and combined in other bursts waiting for transmission.
As soon as a TX burst of packets is full it is sent out of the port using rte_eth_tx_burst.
Note that releasing packet buffers is not required after send while unsent packets have to be released to a memory pool.
DPDK (Data Plane Development Kit) is a set of user-space libraries that improves packet processing speed for x86 platforms. It was created by Intel and made available open source. The limited number of NICs are supported.
To achieve line speed, 1.48 millions of 64 byte packets per second have to be sent/received on 1G interface and 14.8 millions on 10G interface respectively.
To boost packet processing performance of a commodity server the decision to bypass Linux kernel was taken. All the packet processing activity happens in the user space.
DPDK utilizes the following technics to achieve the maximum throughput and minimal time for processing one packet.
- DMA directly to user-space shared memory;
- Polling instead of handling interrupts for each arrived packet;
- SSE instruction set to copy big amount of data effectively;
- Hugepages to decrease the size of TLB that results in a much faster virtual to physical page conversion;
- Thread affinity to bind threads to a specific core to improve cache utilization;
- Lock-free user-space multi-core synchronization using rings;
- NUMA awareness to avoid expensive data transfers between sockets;
Beside that the application developer is advised to avoid inter-core communication if possible and utilize zero-copy method to avoid thrashing CPU caches.
- DPDK overview from Intel
- Documentation on dpdk.org
- Wind River Network Acceleration Platform
- How many Packets per Second per port are needed to achieve Wire-Speed?