About Denys Haryachyy

Contractor, senior C/C++ embedded developer, with over 10 years experience. Proven ability to deliver high quality software through effective team management and hands-on work, moving projects through analysis, design, implementation and maintenance stages of software development life-cycle. Available to work on DPDK and network related projects.

Learning VPP: VXLAN tunnel

logo_fdio-300x184

Overview

VXLAN tunnel is an L2 overlay on top of an L3 network underlay. It uses the UDP protocol to traverse the network. The VXLAN frame looks as follows.

VXLAN frame

Setup

Two Ubuntu VMs with VPP ver. 19.01 and two Ubuntu VMs representing hosts.

VXLAN setup (1)

VPP configuration

Router1

loopback create mac 1a:2b:3c:4d:5e:8f
create bridge-domain 13 learn 1 forward 1 uu-flood 1 flood 1 arp-term 0
create vxlan tunnel src 192.168.31.47 dst 192.168.31.76 vni 13
set interface l2 bridge vxlan_tunnel0 13 1
set interface l2 bridge loop0 13 bvi
set interface ip table loop0 0

Router2

loopback create mac 1a:2b:3c:4d:5e:7f
create bridge-domain 13 learn 1 forward 1 uu-flood 1 flood 1 arp-term 0
create vxlan tunnel src 192.168.31.76 dst 192.168.31.47 vni 13
set interface l2 bridge vxlan_tunnel0 13 1
set interface l2 bridge loop0 13 bvi
set interface ip table loop0 0

Results

Packet trace

00:03:43:444347: dpdk-input
GigabitEthernet0/3/0 rx queue 0
buffer 0x9d811: current data 0, length 148, buffer-pool 0, ref-count 1, totlen-nifb 0, trace handle 0x8
ext-hdr-valid
l4-cksum-computed l4-cksum-correct
PKT MBUF: port 0, nb_segs 1, pkt_len 148
buf_len 2176, data_len 148, ol_flags 0x0, data_off 128, phys_addr 0x8c1604c0
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 08:00:27:68:d1:1e -> 08:00:27:5a:18:a5
UDP: 192.168.31.47 -> 192.168.31.76
tos 0x00, ttl 253, length 134, checksum 0xfd9a
fragment id 0x0000
UDP: 28591 -> 4789
length 114, checksum 0x0000
00:03:43:444389: ethernet-input
frame: flags 0x3, hw-if-index 1, sw-if-index 1
IP4: 08:00:27:68:d1:1e -> 08:00:27:5a:18:a5
00:03:43:444399: ip4-input-no-checksum
UDP: 192.168.31.47 -> 192.168.31.76
tos 0x00, ttl 253, length 134, checksum 0xfd9a
fragment id 0x0000
UDP: 28591 -> 4789
length 114, checksum 0x0000
00:03:43:444406: ip4-lookup
fib 0 dpo-idx 5 flow hash: 0x00000000
UDP: 192.168.31.47 -> 192.168.31.76
tos 0x00, ttl 253, length 134, checksum 0xfd9a
fragment id 0x0000
UDP: 28591 -> 4789
length 114, checksum 0x0000
00:03:43:444414: ip4-local
UDP: 192.168.31.47 -> 192.168.31.76
tos 0x00, ttl 253, length 134, checksum 0xfd9a
fragment id 0x0000
UDP: 28591 -> 4789
length 114, checksum 0x0000
00:03:43:444418: ip4-udp-lookup
UDP: src-port 28591 dst-port 4789
00:03:43:444423: vxlan4-input
VXLAN decap from vxlan_tunnel0 vni 13 next 1 error 0
00:03:43:444429: l2-input
l2-input: sw_if_index 4 dst 1a:2b:3c:4d:5e:7f src 1a:2b:3c:4d:5e:8f
00:03:43:444436: l2-learn
l2-learn: sw_if_index 4 dst 1a:2b:3c:4d:5e:7f src 1a:2b:3c:4d:5e:8f bd_index 1
00:03:43:444441: l2-fwd
l2-fwd: sw_if_index 4 dst 1a:2b:3c:4d:5e:7f src 1a:2b:3c:4d:5e:8f bd_index 1 result [0x700000003, 3] static age-not bvi
00:03:43:444446: ip4-input
ICMP: 10.100.0.6 -> 20.20.20.1
tos 0x00, ttl 64, length 84, checksum 0xabb7
fragment id 0x5c73, flags DONT_FRAGMENT
ICMP echo_request checksum 0xd8d2
00:03:43:444449: ip4-lookup
fib 0 dpo-idx 4 flow hash: 0x00000000
ICMP: 10.100.0.6 -> 20.20.20.1
tos 0x00, ttl 64, length 84, checksum 0xabb7
fragment id 0x5c73, flags DONT_FRAGMENT
ICMP echo_request checksum 0xd8d2
00:03:43:444451: ip4-local
ICMP: 10.100.0.6 -> 20.20.20.1
tos 0x00, ttl 64, length 84, checksum 0xabb7
fragment id 0x5c73, flags DONT_FRAGMENT
ICMP echo_request checksum 0xd8d2

Node counters

Count Node Reason
31 null-node blackholed packets
24 dpdk-input no error
9 ip4-udp-lookup no error
4 ip4-input ip4 source lookup miss
267 ip4-input Multicast RPF check failed
1 ip4-arp ARP requests sent
281 vxlan4-input good packets decapsulated
357 vxlan4-encap good packets encapsulated
357 l2-output L2 output packets
281 l2-learn L2 learn packets
1 l2-learn L2 learn misses
638 l2-input L2 input packets
81 l2-flood L2 flood packets
33 GigabitEthernet0/3/0-output interface is down
35 GigabitEthernet0/8/0-output interface is down

References

Learning DPDK : Huge pages

hugepages

Intro

Modern CPUs support different page sizes, e.g. 4K, 2M, 1GB. All page sizes, except 4K, are named “huge pages” in Linux. The reason for this name convention is historical and stems from the fact that originally Linux supported 4K page size only.

Big page sizes are beneficial for performance as far as fewer translations between virtual and physical addresses happen and Translation Lookaside Buffer (TLB) cache is a scarce resource.

mmu-tlb

To check the size of TLB the following utility can be used.

cpuid | grep -i tlb
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb6: instruction TLB: 4K, 8-way, 128 entries
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries

To check the number of allocated huge pages the following command can be used.

cat /proc/meminfo | grep Huge
AnonHugePages: 4409344 kB
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB

There are two types of huge pages available in the Linux.

  • Transparent (Anonymous) huge pages
  • Persistent huge pages

Transparent huge pages

Transparent huge pages is an abstraction layer that automates most aspects of creating, managing and using huge pages. As far as there existed some issues with performance and stability, DPDK does not rely on this mechanism but uses persistent huge pages.

Persistent huge pages

Persistent huge pages have to be configured manually. Persistent huge pages are never swapped by the Linux kernel.

The following management interfaces exist in Linux to allocate the persistent huge pages.

  • Shared memory using shmget()
  • HugeTLBFS is a RAM-based filesystem and mmap()read() or memfd_create() can be used to access its files
  • Anonymous mmap() by specifying the flags MAP_ANONYMOUS and MAP_HUGETLB flags
  • libhugetlbfs APIs
  • Automatic backing of memory regions

Persistent huge pages are used in DPDK by default, mount points are discovered automatically and pages are released once application exits. But in case a user needs to manually tune something, the following EAL command line parameters could be used.

  • --huge-dir Use specified hugetlbfs directory instead of autodetected ones.
  • --huge-unlink Unlink huge page files after creating them (implies no secondary process support).
  • --in-memory Recent DPDK versions added an option to not rely on hugetlbfs

There are multiple ways to set up persistent huge pages.

  • On the boot
  • In run-time

In boot time

Modify Linux boot time parameters inside /etc/default/grub. Huge pages will be spread equally between all NUMA sockets.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=32"

Update the grub configuration file and reboot.

grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

Create a folder for a permanent mount point of hugetlbfs 

mkdir /mnt/huge

Add the following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

In runtime

Update number of huge pages for each NUMA node. Default huge page size cannot be modified in the runtime.
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Create a mount point.

mkdir /mnt/huge

Mount hugetlbfs 
mount -t hugetlbfs nodev /mnt/huge

Memory allocation

While there are many ways to allocate persistent huge pages, DPDK is using the following.

  • mmap() call with hugetlbfs mount point
  • mmap() call with MAP_HUGETLB flag
  • memfd_create() call with MFD_HUGETLB flag

References

Learning DPDK : Capture to Kafka

kafka_logo

Requirements

The requirements are as follows.

  • To capture 10Gbps of 128B packets into Apache Kafka
  • Implement basic filtering using IP addresses
  • Save traffic into one Kafka topic

Hardware

Software

Solution

The following key design ideas helped to achieve 5Gbps capture speed.

  • Use XFS filesystem
  • Combine small packets into the big Kafka messages, 500KB each
  • Run 4 Kafka brokers on one physical server simultaneously
  • Allocate 20 partitions per topic

Conclusion

The decision was made to use two servers to be able to capture all 10Gbps traffic.

References

Learning DPDK : Symmetric RSS

rss diagram (1)

Overview

Receive side scaling (RSS) is a technology that enables the distribution of received packets between multiple RX queues using a predefined hash function. It enabled multicore CPU to process packets from different queues on different cores.

Symmetric RSS promise is to populate two-way packets from the same TCP connection to the same RX queue. As a result statistics on the different connections could be stored in the per-queue data structures avoiding any need for locking.

System

Recently I had a chance to test symmetric RSS on two NICs from Intel, namely XL710 40G and 82599 10G. 

The approach to configuring symmetric RSS in XL710 is different from the standard DPDK approach. Thus i40e driver offers specific API for this purpose.

DPDK 18.05.1 was used for testing.

82599 solution

To make symmetric RSS work a default hash key has to be replaced with a custom one.


#define RSS_HASH_KEY_LENGTH 40
static uint8_t hash_key[RSS_HASH_KEY_LENGTH] = {
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A, 0x6D, 0x5A,
};
struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = hash_key,
.rss_key_len = RSS_HASH_KEY_LENGTH,
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);

XL710 solution

To enable symmetric RSS i40e driver provides API to setup hardware registers.


struct rte_eth_conf port_conf = {
.rxmode = {
.mq_mode = ETH_MQ_RX_RSS,
},
.rx_adv_conf = {
.rss_conf = {
.rss_hf = ETH_RSS_IP |
ETH_RSS_TCP |
ETH_RSS_UDP |
ETH_RSS_SCTP,
}
},
};
rte_eth_dev_configure(port_id, rx_queue_num, tx_queue_num, &port_conf);
int sym_hash_enable(int port_id, uint32_t ftype, enum rte_eth_hash_function function)
{
struct rte_eth_hash_filter_info info;
int ret = 0;
uint32_t idx = 0;
uint32_t offset = 0;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_GLOBAL_CONFIG;
info.info.global_conf.hash_func = function;
idx = ftype / UINT64_BIT;
offset = ftype % UINT64_BIT;
info.info.global_conf.valid_bit_mask[idx] |= (1ULL << offset);
info.info.global_conf.sym_hash_enable_mask[idx] |=
(1ULL << offset);
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set global hash configurations"
"on port %u", port_id);
return ret;
}
return 0;
}
int sym_hash_set(int port_id, int enable)
{
int ret = 0;
struct rte_eth_hash_filter_info info;
memset(&info, 0, sizeof(info));
ret = rte_eth_dev_filter_supported(port_id, RTE_ETH_FILTER_HASH);
if (ret < 0) {
DPDK_ERROR("RTE_ETH_FILTER_HASH not supported on port: %d",
port_id);
return ret;
}
info.info_type = RTE_ETH_HASH_FILTER_SYM_HASH_ENA_PER_PORT;
info.info.enable = enable;
ret = rte_eth_dev_filter_ctrl(port_id, RTE_ETH_FILTER_HASH,
RTE_ETH_FILTER_SET, &info);
if (ret < 0)
{
DPDK_ERROR("Cannot set symmetric hash enable per port "
"on port %u", port_id);
return ret;
}
return 0;
}
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_TCP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_UDP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_FRAG_IPV4, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_SCTP, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_enable(port_id, RTE_ETH_FLOW_NONFRAG_IPV4_OTHER, RTE_ETH_HASH_FUNCTION_TOEPLITZ);
sym_hash_set(port_id, 1);

References

Learning VPP: Internet access

logo_fdio-300x184

Overview

The goal is to provide internet access for a network namespace through VPP.

To achieve this we can set up routing and NAT. Besides, we can use the ARP proxy feature of VPP.

Build and run

First, build and run VPP as described in a previous post.

make run STARTUP_CONF=startup.conf

Setup

To set up network namespace, routing, NAT, and ARP proxy run the following script.


#!/bin/bash
PATH=$PATH:./build-root/build-vpp-native/vpp/bin/
if [ $USER != "root" ] ; then
echo "Restarting script with sudo…"
sudo $0 ${*}
exit
fi
# delete previous incarnations if they exist
ip link del dev vpp1
ip link del dev vpp2
ip netns del vpp1
#create namespaces
ip netns add vpp1
# create and configure 1st veth pair
ip link add name veth_vpp1 type veth peer name vpp1
ip link set dev vpp1 up
ip link set dev veth_vpp1 up netns vpp1
ip netns exec vpp1 \
bash -c "
ip link set dev lo up
ip addr add 172.16.1.2/24 dev veth_vpp1
ip route add 172.16.2.0/24 via 172.16.1.1
ip route add default via 172.16.1.1
"
# create and configure 2nd veth pair
ip link add name veth_vpp2 type veth peer name vpp2
ip link set dev vpp2 up
ip addr add 172.16.2.2/24 dev veth_vpp2
ip link set dev veth_vpp2 up
ip route add 172.16.1.0/24 via 172.16.2.2
# configure VPP
vppctl create host-interface name vpp1
vppctl create host-interface name vpp2
vppctl set int state host-vpp1 up
vppctl set int state host-vpp2 up
vppctl set int ip address host-vpp1 172.16.1.1/24
vppctl set int ip address host-vpp2 172.16.2.1/24
vppctl ip route add 172.16.1.0/24 via 172.16.1.1 host-vpp1
vppctl ip route add 172.16.2.0/24 via 172.16.2.1 host-vpp2
vppctl ip route add 0.0.0.0/0 via 172.16.2.2 host-vpp2
vppctl set interface proxy-arp host-vpp2 enable
vppctl set ip arp proxy 172.16.1.1 – 172.16.1.2
# Enable IP-forwarding.
echo 1 > /proc/sys/net/ipv4/ip_forward
# Flush forward rules.
iptables -P FORWARD DROP
iptables -F FORWARD
# Flush nat rules.
iptables -t nat -F
# Enable NAT masquerading
iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
iptables -A FORWARD -i wlan0 -o veth_vpp2 -j ACCEPT
iptables -A FORWARD -o wlan0 -i veth_vpp2 -j ACCEPT

Results

Now we can access the internet from vpp1 network namespace.

sudo ip netns exec vpp1 ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=115 time=73.5 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=115 time=139 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=115 time=35.3 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=115 time=36.6 ms

Also, VPP itself has access to the internet.

DBGvpp# ping 8.8.8.8
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=53.7913 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=35.3645 ms
Aborted due to a keypress.

Statistics: 2 sent, 2 received, 0% packet loss

Now we can access google web site.

sudo ip netns exec vpp1 curl www.google.com

And trace HTTP packets inside VPP.

DBGvpp# trace add af-packet-input 1000
DBGvpp# show trace
...
Packet 37

00:18:08:629788: af-packet-input
af_packet: hw_if_index 1 next-index 4
tpacket2_hdr:
status 0x9 len 54 snaplen 54 mac 66 net 80
sec 0x5b7e7b58 nsec 0x157af968 vlan 0 vlan_tpid 0
00:18:08:629839: ethernet-input
IP4: 6e:25:1b:a7:11:05 -> 02:fe:13:61:29:4b
00:18:08:629865: ip4-input
TCP: 172.16.1.2 -> 173.194.221.103
tos 0x00, ttl 64, length 40, checksum 0xd927
fragment id 0x296c, flags DONT_FRAGMENT
TCP: 51480 -> 80
seq. 0xeec90cb7 ack 0x5287d28f
flags 0x10 ACK, tcp header: 20 bytes
window 457, checksum 0x0000
00:18:08:629889: ip4-lookup
fib 0 dpo-idx 4 flow hash: 0x00000000
TCP: 172.16.1.2 -> 173.194.221.103
tos 0x00, ttl 64, length 40, checksum 0xd927
fragment id 0x296c, flags DONT_FRAGMENT
TCP: 51480 -> 80
seq. 0xeec90cb7 ack 0x5287d28f
flags 0x10 ACK, tcp header: 20 bytes
window 457, checksum 0x0000
00:18:08:629913: ip4-rewrite
tx_sw_if_index 2 dpo-idx 4 : ipv4 via 172.16.2.2 host-vpp2: mtu:9000 a60ae99593be02fe094ec8700800 flow hash: 0x00000000
00000000: a60ae99593be02fe094ec870080045000028296c40003f06da27ac100102adc2
00000020: dd67c9180050eec90cb75287d28f501001c900000000000000000000
00:18:08:629940: host-vpp2-output
host-vpp2
IP4: 02:fe:09:4e:c8:70 -> a6:0a:e9:95:93:be
TCP: 172.16.1.2 -> 173.194.221.103
tos 0x00, ttl 63, length 40, checksum 0xda27
fragment id 0x296c, flags DONT_FRAGMENT
TCP: 51480 -> 80
seq. 0xeec90cb7 ack 0x5287d28f
flags 0x10 ACK, tcp header: 20 bytes
window 457, checksum 0x0000

References

Learning VPP: Flow table

logo_fdio-300x184

Overview

The task is to implement stateful machinery to track TCP connections. To achieve this goal, the following functional pieces are required.

  • A bidirectional hash table to match packets against a flow using 5-tuple;
  • A pool with flow entries data structures;
  • A timer wheel to expire stale flows;
  • TCP state machine tracking to clean up closed TCP connections.

Example

One possible implementation could be found here.

1. A bidirectional hash table.

The requirement is to match the packets from one flow to the same flow entry regardless of their direction. Thus by comparing source and destination addresses, it is we can always calculate the same hash as in the snippet below.

if (ip4_address_compare(&ip4->src_address, &ip4->dst_address) < 0)   {          ip4_sig->src = ip4->src_address;
    ip4_sig->dst = ip4->dst_address;
    *is_reverse = 1;
  } else {
    ip4_sig->src = ip4->dst_address;
    ip4_sig->dst = ip4->src_address;
  }

2. A flow table cache.

To speed up flow entry allocation, flow table cache is used. The idea is to batch entries allocation. In other words, to allocate 256 flow entries at once and store their pool indices inside a vector. As soon as a flow entry allocation is required, leverage preallocated data structure.

always_inline void
flow_entry_cache_fill(flowtable_main_t * fm, flowtable_main_per_cpu_t * fmt)
{
    int i;
    flow_entry_t * f;

    if (pthread_spin_lock(&fm->flows_lock) == 0)
    {
        if (PREDICT_FALSE(fm->flows_cpt > fm->flows_max)) {
            pthread_spin_unlock(&fm->flows_lock);
            return;
        }

        for (i = 0; i < FLOW_CACHE_SZ; i++) {                          pool_get_aligned(fm->flows, f, CLIB_CACHE_LINE_BYTES);
            vec_add1(fmt->flow_cache, f - fm->flows);
        }
        fm->flows_cpt += FLOW_CACHE_SZ;

        pthread_spin_unlock(&fm->flows_lock);
    }
}

3. The timer wheel.

To avoid stale flow entries, timers, organized in a so-called timer wheel are used.

static u64
flowtable_timer_expire(flowtable_main_t * fm, flowtable_main_per_cpu_t * fmt,
    u32 now)
{
    u64 expire_cpt;
    flow_entry_t * f;
    u32 * time_slot_curr_index;
    dlist_elt_t * time_slot_curr;
    u32 index;

    time_slot_curr_index = vec_elt_at_index(fmt->timer_wheel, fmt->time_index);

    if (PREDICT_FALSE(dlist_is_empty(fmt->timers, *time_slot_curr_index)))
        return 0;

    expire_cpt = 0;
    time_slot_curr = pool_elt_at_index(fmt->timers, *time_slot_curr_index);

    index = time_slot_curr->next;
    while (index != *time_slot_curr_index && expire_cpt < TIMER_MAX_EXPIRE)     {         dlist_elt_t * e = pool_elt_at_index(fmt->timers, index);
        f = pool_elt_at_index(fm->flows, e->value);

        index = e->next;
        expire_single_flow(fm, fmt, f, e);
        expire_cpt++;
    }

    return expire_cpt;
}

4. Flow entries recycling.

As soon as there are no more available flow entries in the pool, there is a mechanism to recycle the oldest entries.

static void
recycle_flow(flowtable_main_t * fm, flowtable_main_per_cpu_t * fmt, u32 now)
{
    u32 next;

    next = (now + 1) % TIMER_MAX_LIFETIME;
    while (PREDICT_FALSE(next != now))
    {
        flow_entry_t * f;
        u32 * slot_index = vec_elt_at_index(fmt->timer_wheel, next);

        if (PREDICT_FALSE(dlist_is_empty(fmt->timers, *slot_index))) {
            next = (next + 1) % TIMER_MAX_LIFETIME;
            continue;
        }
        dlist_elt_t * head = pool_elt_at_index(fmt->timers, *slot_index);
        dlist_elt_t * e = pool_elt_at_index(fmt->timers, head->next);

        f = pool_elt_at_index(fm->flows, e->value);
        return expire_single_flow(fm, fmt, f, e);
    }

    /*
     * unreachable:
     * this should be called if there is no free flows, so we're bound to have
     * at least *one* flow within the timer wheel (cpu cache is filled at init).
     */
    clib_error("recycle_flow did not find any flow to recycle !");
}

5. Bucket list.

The bidirectional hash algorithm that is used to lookup 5-tuple against a flow entry can result in multiple signatures for different tuples. To overcome this limitation, the list of entries is attached to each hash bucket.

clib_dlist_addhead(fmt->ht_lines, ht_line_head_index, f->ht_index);

6. TCP state machine.

TCP connection states are tracked in order to control flow lifetime.

static const tcp_state_t tcp_trans[TCP_STATE_MAX][TCP_EV_MAX] =
{
    [TCP_STATE_START] = {
        [TCP_EV_SYN]    = TCP_STATE_SYN,
        [TCP_EV_SYNACK] = TCP_STATE_SYNACK,
        [TCP_EV_FIN]    = TCP_STATE_FIN,
        [TCP_EV_FINACK] = TCP_STATE_FINACK,
        [TCP_EV_RST]    = TCP_STATE_RST,
        [TCP_EV_NONE]   = TCP_STATE_ESTABLISHED,
    },
    [TCP_STATE_SYN] = {
        [TCP_EV_SYNACK] = TCP_STATE_SYNACK,
        [TCP_EV_PSHACK] = TCP_STATE_ESTABLISHED,
        [TCP_EV_FIN]    = TCP_STATE_FIN,
        [TCP_EV_FINACK] = TCP_STATE_FINACK,
        [TCP_EV_RST]    = TCP_STATE_RST,
    },
    [TCP_STATE_SYNACK] = {
        [TCP_EV_PSHACK] = TCP_STATE_ESTABLISHED,
        [TCP_EV_FIN]    = TCP_STATE_FIN,
        [TCP_EV_FINACK] = TCP_STATE_FINACK,
        [TCP_EV_RST]    = TCP_STATE_RST,
    },
    [TCP_STATE_ESTABLISHED] = {
        [TCP_EV_FIN]    = TCP_STATE_FIN,
        [TCP_EV_FINACK] = TCP_STATE_FINACK,
        [TCP_EV_RST]    = TCP_STATE_RST,
    },
    [TCP_STATE_FIN] = {
        [TCP_EV_FINACK] = TCP_STATE_FINACK,
        [TCP_EV_RST]    = TCP_STATE_RST,
    },
    [TCP_STATE_FINACK] = {
        [TCP_EV_RST]    = TCP_STATE_RST,
    },
};

7. TCP lifetime.

TCP flow entry has a different expiration time depending on the connection life stage.

static const int tcp_lifetime[TCP_STATE_MAX] =
{
    [TCP_STATE_START]       = 60,
    [TCP_STATE_SYN]         = 15,
    [TCP_STATE_SYNACK]      = 60,
    [TCP_STATE_ESTABLISHED] = 299,
    [TCP_STATE_FIN]         = 15,
    [TCP_STATE_FINACK]      = 3,
    [TCP_STATE_RST]         = 6
};

References

Learning VPP: Packets tracing

logo_fdio-300x184

Overview

There are multiple ways to run VPP on your laptop, namely, it could be run on the host Linux, in VM or in a docker container.

Also, besides DPDK interfaces, VPP supports low performant while very handy interface types that could be used for connection to network namespaces. These are veth (host interface in VPP) and TAP interface.

Build and run

To test traffic through VPP installed on a host Linux, two network namespaces have to be created to emulate external host machines. And packets will come and leave VPP either TAP or veth interfaces.

Now, build and run VPP as described in a previous post.

make run STARTUP_CONF=startup.conf

Virtual network over TAPs

To set up namespaces, taps and a bridge run the following script.


#!/bin/bash
./build-root/build-vpp-native/vpp/bin/vppctl tap connect vpp1
./build-root/build-vpp-native/vpp/bin/vppctl tap connect vpp2
./build-root/build-vpp-native/vpp/bin/vppctl set interface state tapcli-0 up
./build-root/build-vpp-native/vpp/bin/vppctl set interface state tapcli-1 up
ip netns delete vpp1
ip netns delete vpp2
ip netns add vpp1
ip netns add vpp2
ip link set dev vpp1 netns vpp1
ip link set dev vpp2 netns vpp2
ip netns exec vpp1 ip link set vpp1 up
ip netns exec vpp2 ip link set vpp2 up
ip netns exec vpp1 ip addr add 192.168.0.1/24 dev vpp1
ip netns exec vpp2 ip addr add 192.168.0.2/24 dev vpp2
./build-root/build-vpp-native/vpp/bin/vppctl set interface l2 bridge tapcli-0 23
./build-root/build-vpp-native/vpp/bin/vppctl set interface l2 bridge tapcli-1 23

Tracing packets

The below commands can be used to test the VPP based bridge.

ip netns exec vpp1 ping -c1 192.168.0.2
ip netns exec vpp2 ping -c1 192.168.0.1

To see packets inside VPP, the trace feature has to be enabled beforehand.

DBGvpp# trace add tapcli-rx 8

Then to see how packet traversed VPP graph the following command has to be used.

DBGvpp# show trace

------------------- Start of thread 0 vpp_main -------------------
Packet 1

00:50:54:290610: tapcli-rx
tapcli-0
00:50:54:377068: ethernet-input
IP4: 12:77:2b:e0:b9:81 -> c2:12:c9:0d:80:23
00:50:54:406116: l2-input
l2-input: sw_if_index 1 dst c2:12:c9:0d:80:23 src 12:77:2b:e0:b9:81
00:50:54:414204: l2-learn
l2-learn: sw_if_index 1 dst c2:12:c9:0d:80:23 src 12:77:2b:e0:b9:81 bd_index 1
00:50:54:414940: l2-fwd
l2-fwd: sw_if_index 1 dst c2:12:c9:0d:80:23 src 12:77:2b:e0:b9:81 bd_index 1
00:50:54:415656: l2-output
l2-output: sw_if_index 2 dst c2:12:c9:0d:80:23 src 12:77:2b:e0:b9:81 data 08 00 45 00 00 54 2a 1a 40 00 40 01
00:50:54:415697: tapcli-1-output
tapcli-1
IP4: 12:77:2b:e0:b9:81 -> c2:12:c9:0d:80:23
ICMP: 192.168.0.1 -> 192.168.0.2
tos 0x00, ttl 64, length 84, checksum 0x8f3b
fragment id 0x2a1a, flags DONT_FRAGMENT
ICMP echo_request checksum 0xde15

Virtual network over veth pair

To set up namespaces and veth pairs run the following script.


#!/bin/bash
PATH=$PATH:./build-root/build-vpp-native/vpp/bin/
if [ $USER != "root" ] ; then
echo "Restarting script with sudo…"
sudo $0 ${*}
exit
fi
# delete previous incarnations if they exist
ip link del dev vpp1
ip link del dev vpp2
ip netns del vpp1
ip netns del vpp2
#create namespaces
ip netns add vpp1
ip netns add vpp2
# create and configure 1st veth pair
ip link add name veth_vpp1 type veth peer name vpp1
ip link set dev vpp1 up
ip link set dev veth_vpp1 up netns vpp1
ip netns exec vpp1 \
bash -c "
ip link set dev lo up
ip addr add 172.16.1.2/24 dev veth_vpp1
ip route add 172.16.2.0/24 via 172.16.1.1
"
# create and configure 2st veth pair
ip link add name veth_vpp2 type veth peer name vpp2
ip link set dev vpp2 up
ip link set dev veth_vpp2 up netns vpp2
ip netns exec vpp2 \
bash -c "
ip link set dev lo up
ip addr add 172.16.2.2/24 dev veth_vpp2
ip route add 172.16.1.0/24 via 172.16.2.1
"
vppctl create host-interface name vpp1
vppctl create host-interface name vpp2
vppctl set int state host-vpp1 up
vppctl set int state host-vpp2 up
vppctl set int ip address host-vpp1 172.16.1.1/24
vppctl set int ip address host-vpp2 172.16.2.1/24
vppctl ip route add 172.16.1.0/24 via 172.16.1.1 host-vpp1
vppctl ip route add 172.16.2.0/24 via 172.16.2.1 host-vpp2

Tracing packets

The below commands can be used to test the VPP based bridge.

ip netns exec vpp1 ping 172.16.2.1 -c 1

To see packets inside VPP, the trace feature has to be enabled beforehand.

DBGvpp# trace add af-packet-input 8

Then to see how packet traversed VPP graph the following command has to be used.

DBGvpp# show trace
------------------- Start of thread 0 vpp_main -------------------
Packet 1

00:02:26:500404: af-packet-input
  af_packet: hw_if_index 1 next-index 4
    tpacket2_hdr:
      status 0x20000001 len 98 snaplen 98 mac 66 net 80
      sec 0x5b7a7435 nsec 0x2ed6d440 vlan 0 vlan_tpid 0
00:02:26:500486: ethernet-input
  IP4: b6:7b:f1:64:fe:8c -> 02:fe:9e:f6:c1:8f
00:02:26:500501: ip4-input
  ICMP: 172.16.1.2 -> 172.16.2.1
    tos 0x00, ttl 64, length 84, checksum 0xeaf8
    fragment id 0xf48c, flags DONT_FRAGMENT
  ICMP echo_request checksum 0xdbe0
00:02:26:500509: ip4-lookup
  fib 0 dpo-idx 8 flow hash: 0x00000000
  ICMP: 172.16.1.2 -> 172.16.2.1
    tos 0x00, ttl 64, length 84, checksum 0xeaf8
    fragment id 0xf48c, flags DONT_FRAGMENT
  ICMP echo_request checksum 0xdbe0
00:02:26:500523: ip4-local
    ICMP: 172.16.1.2 -> 172.16.2.1
      tos 0x00, ttl 64, length 84, checksum 0xeaf8
      fragment id 0xf48c, flags DONT_FRAGMENT
    ICMP echo_request checksum 0xdbe0
00:02:26:500529: ip4-icmp-input
  ICMP: 172.16.1.2 -> 172.16.2.1
    tos 0x00, ttl 64, length 84, checksum 0xeaf8
    fragment id 0xf48c, flags DONT_FRAGMENT
  ICMP echo_request checksum 0xdbe0
00:02:26:500533: ip4-icmp-echo-request
  ICMP: 172.16.1.2 -> 172.16.2.1
    tos 0x00, ttl 64, length 84, checksum 0xeaf8
    fragment id 0xf48c, flags DONT_FRAGMENT
  ICMP echo_request checksum 0xdbe0
00:02:26:500540: ip4-load-balance
  fib 0 dpo-idx 17 flow hash: 0x00000000
  ICMP: 172.16.2.1 -> 172.16.1.2
    tos 0x00, ttl 64, length 84, checksum 0x8e73
    fragment id 0x5112, flags DONT_FRAGMENT
  ICMP echo_reply checksum 0xe3e0
00:02:26:500543: ip4-rewrite
  tx_sw_if_index 1 dpo-idx 2 : ipv4 via 172.16.1.2 host-vpp1: mtu:9000 b67bf164fe8c02fe9ef6c18f0800 flow hash: 0x00000000
  00000000: b67bf164fe8c02fe9ef6c18f0800450000545112400040018e73ac100201ac10
  00000020: 01020000e3e0167e000135747a5b000000008bfd0b00000000001011
00:02:26:500550: host-vpp1-output
  host-vpp1
  IP4: 02:fe:9e:f6:c1:8f -> b6:7b:f1:64:fe:8c
  ICMP: 172.16.2.1 -> 172.16.1.2
    tos 0x00, ttl 64, length 84, checksum 0x8e73
    fragment id 0x5112, flags DONT_FRAGMENT
  ICMP echo_reply checksum 0xe3e0

References

Learning VPP: Code style

logo_fdio-300x184

Overview

It is important that an open-source or proprietary product follows strict code style guidelines. It helps people who are involved in the project to understand, extend and maintain the codebase far more comfortably. Let alone it makes the code listings look neat and clean.

And VPP could be a good example of the above statement. It has a strict and well-defined code style that is based on GNU Coding Standards.

Hints

For example, an indentation looks as follows.

if (1)
  {
  }

A routine definition looks as follows.

static int
vnet_test_add_del(u8 * name, u32 index, u8 add);

A function call looks as follows.

vnet_feature_enable_disable ("device-input", "test",
			     sw_if_index, enable_disable, 0, 0);

To verify that your code is following the agreement the following command is used.

make checkstyle

VPP developers provided the rules written as a clang-format file that can be used by different tools and IDEs to enforce code formatting.

References

Learning VPP: Hash and pool

logo_fdio-300x184

Overview

VPP implements a pool for a fixed size objects combining two data structures together, namely a vector and a bitmap.

VPP has multiple hash implementations. The most basic is defined inside hash.h file. It is used mostly in a control plane and a string could be used as a key while a pool index as a value.

The following example illustrates how to use the aforementioned data structures together to build a hash table.

Example

1. Definition

typedef struct {
u8 *name;
} test_t;

test_t *pool;
uword *hash;

2. Initialization.

hash = hash_create_vec (32, sizeof (u8), sizeof (uword));

3. Add element

test_t *test = NULL;
pool_get (pool, test);
memset(test, 0, sizeof(*test));
hash_set_mem (hash, name, test - pool);

4. Get element

uword *p = NULL;
test_t *test = NULL;
p = hash_get_mem (hash, name);
if (p) {
test = pool_elt_at_index (pool, p[0]);
}

5. Delete element

uword *p = NULL;
test_t *test = NULL;
p = hash_get_mem (hash, name);
if (p) {
hash_unset_mem (hash, name);
test = pool_elt_at_index (pool, p[0]);
pool_put (pool, test);
}

6. Iteration

u8 *name = NULL;
u32 index = 0;
/* *INDENT-OFF* */
hash_foreach(name, index, hash,
({
   test_t *test = NULL;
   test = pool_elt_at_index(pool, index);
}));
/* *INDENT-ON* */

References

Learning VPP: CLI

logo_fdio-300x184

Overview

VPP preferred interface is a binary API that is used by Northbound control plane applications like Honeycomb.

But for debugging and other related purposes VPP includes very convenient CLI engine from both user and developer perspectives.

To add a new command it is required to register its name, help string and handler routine.

Run

1. A registration.

VLIB_CLI_COMMAND (test_create_command, static) =
{
.path = "create test",
.short_help = "create test name <string>",
.function = test_create_command_fn,
};

2. A handler.

static clib_error_t *
test_create_command_fn (vlib_main_t * vm,
                        unformat_input_t * input,
                        vlib_cli_command_t * cmd)
{
  unformat_input_t _line_input, *line_input = &_line_input;
  u8 *name = NULL;
  if (unformat_user (input, unformat_line_input, line_input))
  {
    while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
    {
      if (unformat (line_input, "name %s", &name));
      else
      {
        unformat_free (line_input);
        return clib_error_return (0, "unknown input `%U'",
        format_unformat_error, input);
      }
    }
    unformat_free (line_input);
  }

  return NULL;
}

References