Analyzing Open vSwitch* with DPDK Bottlenecks Using Intel® VTune™ Amplifier
Overview
This article is primarily aimed at development engineers working on high-performance computing applications. We will show an example of how we used Intel® VTune™ Amplifier to detect a performance bottleneck in Open vSwitch* (OvS) with Data Plane Development Kit (DPDK), also known as OvS-DPDK. We will also describe how we addressed this performance issue. If you are relatively new to design principles of OvS-DPDK packet processing, we highly recommend reading our previous introductory article for a description of the datapath classifier fundamentals as well as our second article, where we emphasized the overall packet processing pipeline with detailed call graphs.The primary focus of this article is on Intel® microarchitecture and particularly on the top-down analysis approach.
Introduction
To optimize the application for performance, it’s not necessary for a developer to be a performance expert but they should be proficient with their own application. Many aspects come into play when trying to improve application performance, ranging from hardware platform considerations, code design changes, and code fine-tuning to leverage microarchitecture features. A deep understanding of the code design and implementation becomes an essential requirement for an application developer to understand how the application is utilizing the available hardware resources. This can be achieved by acquiring a deeper knowledge of hardware microarchitecture and by using specialized profiling tools like VTune™ Amplifier.Getting Started: The Top-down Approach
One of the prominent performance tuning methodologies is the top-down approach. This approach has three stages: system tuning on the top, application tuning in middle, and microarchitecture tuning at the bottom. System tuning involves the hardware and operating system tuning, while application tuning includes better design, parallelization, and improving the efficiency of libraries. Microarchitecture is the last stage and involves careful selection of compiler flags, vectorization, and code refactoring w.r.t memory/cache optimizations as well as an understanding of CPU pitfalls, as depicted in Figure 1.
Figure 1. Top-down performance tuning approach.
Microarchitecture and Micro-operations

Figure 2. Intel® Core™ microarchitecture.

Figure 3. Front-end and back-end processor pipeline responsibilities.
At a given point in time, a pipeline slot can be empty or filled with μOps. The pipeline slot is classified in four categories as depicted in Figure 4. More information on pipeline slot classification can be found in this article.

Figure 4. Pipeline slot classifications.
VTune™ Amplifier Tool
The VTune Amplifier collects measurements by leveraging performance monitoring units (PMUs) that are part of the CPU core. The specialized monitoring counters can collect and expose information on the hardware resource consumption. With the help of PMUs, the metrics regarding efficiency of the processed instructions and the caches usage can be measured and retrieved. It’s important to become familiar with the wide range of available metrics like retired instructions; clock ticks; L2, L3 cache statistics; branch mis-predicts; and so on. Also, “uncore” PMU metrics like bytes read/written from/to memory controller and data traffic transferred by Intel® QuickPath Interconnect (Intel® QPI) can be measured.Below is the brief description and the formulas used to calculate the metrics of different pipeline slots. An application is considered front-end bound if the front end is delivering < 4 μOps per cycle while the back end is ready to accept more μOps. This is likely caused by delays in fetching code (caching/ITLB issues) or in decoding instructions. The front-end bound pipeline slot can be classified into sub-categories as depicted in Figure 5.
Formula: IDQ_UOPS_NOT_DELIVERED.CORE / (4 * Clockticks)

Figure 5. Front-end bound sub-categories.
Formula: 1 - (Front-end Bound + Bad speculation + Retiring)

Figure 6. Back-end bound sub-categories.
Formula: (UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4* INT_MISC.RECOVERY_CYCLES) / (4* Clockticks)

Figure 7. Bad speculation.
Formula: UOPS_RETIRED.RETIRE_SLOTS / (4 * Clockticks)

Figure 8. Retiring.
Analyzing OvS-DPDK Bottlenecks
Now that we have discussed the microarchitecture fundamentals, let’s apply the knowledge in analyzing the bottlenecks in OvS-DPDK.Batching packets by flows
One of the important packet pipeline stages in OvS-DPDK is the flow batching. Each incoming packet is first classified and then batched depending on its matching flow, as depicted in Figure 9.
Figure 9. Packet are grouped depending on the matching flow.
Occasionally, there could be few packets in a batch. In the worst case, each of the fetched packets is matching a different flow, so each batch will contain a single packet. When the corresponding action for the flow is to forward packets to a certain physical port, transmitting a few packets can be very inefficient as packet transmission over a DPDK interface incurs expensive memory-mapped I/O (MMIO) writes.
Figure 10 shows the packet processing call graph. In exact match cache processing, for every input packet a lookup is performed in EMC to retrieve matching flow. In case of an EMC hit, the packets are queued into batches — see struct packet_batch_per_flow in OvS-DPDK source code — matching the flow using dp_netdev_queue_batches(). Thereafter, packets are processed in batches for faster packet processing using packet_batch_per_flow_execute. If the corresponding action of the flow is to forward the packets to a DPDK port, netdev_send will be invoked, as depicted in Figure 10.

Figure 10. OvS-DPDK call graph for classification, flow batching and forwarding.
Benchmark with 64-byte UDP packets
A benchmark was set up with an IXIA Traffic Generator sending a few dozen unique streams comprising 64-byte UDP packets. Significant performance drop is observed with tens of different flows in “PHY2PHY” test case. Note that the flow rules are unique and are set up to match on source IP address of the packets for a corresponding stream. Example flow rules are shown below; this will create four batches and the packets are queued to corresponding batches.$ ovs-ofctl add-flow br0 in_port=4,dl_type=0x0800,nw_src=2.2.2.1,actions=output:2 $ ovs-ofctl add-flow br0 in_port=4,dl_type=0x0800,nw_src=4.4.4.1,actions=output:2 $ ovs-ofctl add-flow br0 in_port=4,dl_type=0x0800,nw_src=6.6.6.1,actions=output:2 $ ovs-ofctl add-flow br0 in_port=4,dl_type=0x0800,nw_src=8.8.8.1,actions=output:2With the above flow rule, VTune General Exploration analysis revealed interesting insights into transmission bottlenecks in OvS-DPDK.
VTune Amplifier summary
When General Exploration analysis is run for a 60-second duration, Figure 11 shows the snapshot of the VTune Amplifier summary on how the pipeline slots are occupied. Note that the slots highlighted pink need attention and are auto highlighted by VTune based on the default thresholds of the application category.
Figure 11. Summary of the General Exploration analysis provided by VTune™ Amplifier.

Table 1. Expected vs. measured ranges of pipeline slots.
- netdev_dpdk_eth_send() consumed 17% of the total cycles.
- The cycles per instruction (CPI) rate for the above function stands at 4.921, much higher than the theoretical limit 0.25 and acceptable range of 1.0 in case of HPC applications.
- This function is entirely back-end bound and is hardly retiring any instructions (<6 li="">6>

Figure 12. Summary of the General Exploration analysis provided by VTune™ Amplifier.

Figure 13. Back-end bound summary.
Solution to Mitigate MMIO Cost
In a real scenario, packets coming from a physical port may hit a large number of flows resulting in grouping very few packets for each flow batch. This becomes very inefficient when packets are transmitted over DPDK interfaces. To amortize the cost of MMIO writes, an intermediate queue can be used to queue “NETDEV_MAX_BURST” (i.e., 32) packets and transmit the packets in burst with rte_eth_tx_burst. The intermediate queue is implemented using the ‘netdev_dpdk_eth_tx_queue()’ function that will queue the packets. Packets will be transmitted when one of the conditions below are met.- If packet count in txq (txq->count) >= NETDEV_MAX_BURST, invoke netdev_dpdk_eth_tx_burst() to burst packets.
- After a timeout elapses, any packet waiting into the queue must be flushed, regardless of its number.

Table 2. Measured ranges of pipeline slots with and without the patch.

Figure 14. Effect of MMIO mitigation by the intermediate queue.
Conclusions
In this article we described the top-down methodology to analyze and detect performance bottlenecks. Also we focused on explaining the VTune Amplifier metrics in context of microarchitecture. An application example is used (i.e., OvS-DPDK) to identify and find the root cause of the bottlenecks and the steps taken to improve the performance.Additional Information
For any questions, feel free to follow up with the query on the Open vSwitch discussion mailing thread.Articles
Intel VTune Amplifier: https://software.intel.com/en-us/intel-vtune-amplifier-xe/OvS-DPDK Datapath Classifier: https://software.intel.com/en-us/articles/ovs-dpdk-datapath-classifier
OvS-DPDK Datapath Classifier – part 2: https://software.intel.com/en-us/articles/ovs-dpdk-datapath-classifier-part-2
Comments