Network Device Driver
A driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the
igb initialization function (
igb_init_module) and its registration with
module_init can be found in drivers/net/ethernet/intel/igb/igb_main.c.
Both are fairly straightforward:
The bulk of the work to initialize the device happens with the call to
pci_register_driver as we’ll see next.
The Intel I350 network card is a PCI express device.
PCI devices identify themselves with a series of registers in the PCI Configuration Space.
When a device driver is compiled, a macro named
include/module.h) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.
The kernel uses this table to determine which device driver to load to control the device.
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
As seen in the previous section,
pci_register_driver is called in the driver’s initialization function.
This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
Once a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
Some typical operations to perform include:
- Enabling the PCI device.
- Requesting memory ranges and IO ports.
- Setting the DMA mask.
- The ethtool (described more below) functions the driver supports are registered.
- Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung).
- Other device specific stuff like workarounds or dealing with hardware specific quirks or similar.
- The creation, initialization, and registration of a
struct net_device_opsstructure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more.
- The creation, initialization, and registration of a high level
struct net_devicewhich represents a network device.
Let’s take a quick look at some of these operations in the
igb driver in the function
A peek into PCI initialization
The following code from the
igb_probe function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c:
First, the device is initialized with
pci_enable_device_mem. This will wake up the device if it is suspended, enable memory resources, and more.
Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so
dma_set_mask_and_coherent is called with
Memory regions will be reserved with a call to
pci_request_selected_regions, PCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to
pci_set_master, and the PCI configuration space is saved with a call to
More Linux PCI driver information
Network device initialization
igb_probe function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:
struct net_device_opsis registered.
ethtooloperations are registered.
- The default MAC address is obtained from the NIC.
net_devicefeature flags are set.
- And lots more.
Let’s take a look at each of these as they will be interesting later.
struct net_device_ops contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.
net_device_ops structure is attached to a
struct net_device in
igb_probe. From drivers/net/ethernet/intel/igb/igb_main.c)
And the functions that this
net_device_ops structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/igb/igb_main.c:
As you can see, there are several interesting fields in this
ndo_get_stats64 which hold the addresses of functions implemented by the
We’ll be looking at some of these in more detail later.
ethtool is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running
apt-get install ethtool.
A common use of
ethtool is to gather detailed statistics from network devices. Other
ethtool settings of interest will be described later.
ethtool program talks to device drivers by using the
ioctl system call. The device drivers register a series of functions that run for the
ethtool operations and the kernel provides the glue.
ioctl call is made from
ethtool, the kernel finds the
ethtool structure registered by the appropriate driver and executes the functions registered. The driver’s
ethtool function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.
igb driver registers its
ethtool operations in
igb_probe by calling
All of the
ethtool code can be found in the file
drivers/net/ethernet/intel/igb/igb_ethtool.c along with the
Above that, you can find the
igb_ethtool_ops structure with the
ethtool functions the
igb driver supports set to the appropriate fields.
It is up to the individual drivers to determine which
ethtool functions are relevant and which should be implemented. Not all drivers implement all
ethtool functions, unfortunately.
ethtool function is
get_ethtool_stats, which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.
The monitoring section below will show how to use
ethtool to access these detailed statistics.
When a data frame is written to RAM via DMA, how does the NIC tell the rest of the system that data is ready to be processed?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely.
We’ll see why that is, exactly, in later sections.
NAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a
poll function that the NAPI subsystem will call to harvest data frames.
The intended use of NAPI in network device drivers is as follows:
- NAPI is enabled by the driver, but is in the off position initially.
- A packet arrives and is DMA’d to memory by the NIC.
- An IRQ is generated by the NIC which triggers the IRQ handler in the driver.
- The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered
pollfunction in a separate thread of execution.
- The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device.
- Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled.
- The process starts back at step 2.
This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
The device driver implements a
poll function and registers it with NAPI by calling
netif_napi_add. When registering a NAPI
poll function with
netif_napi_add, the driver will also specify the
weight. Most of the drivers hardcode a value of
64. This value and its meaning will be described in more detail below.
Typically, drivers register their NAPI
poll functions during driver initialization.
NAPI initialization in the
igb driver does this via a long call chain:
This call trace results in a few high level things happening:
- If MSI-X is supported, it will be enabled with a call to
- Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets.
igb_alloc_q_vectoris called once for every transmit and receive queue that will be created.
- Each call to
netif_napi_addto register a
pollfunction for that queue and an instance of
struct napi_structthat will be passed to
pollwhen called to harvest packets.
Let’s take a look at
igb_alloc_q_vector to see how the
poll callback and its private data are registered.
The above code is allocation memory for a receive queue and registering the function
igb_poll with the NAPI subsystem. It provides a reference to the
struct napi_struct associated with this newly created RX queue (
&q_vector->napi above). This will be passed into
igb_poll when called by the NAPI subsystem when it comes time to harvest packets from this RX queue.
This will be important later when we examine the flow of data from drivers up the network stack.
Bringing a network device up
net_device_ops structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.
When a network device is brought up (for example, with
ifconfig eth0 up), the function attached to the
ndo_open field of the
net_device_ops structure is called.
ndo_open function will typically do things like:
- Allocate RX and TX queue memory
- Enable NAPI
- Register an interrupt handler
- Enable hardware interrupts
- And more.
In the case of the
igb driver, the function attached to the
ndo_open field of the
net_device_ops structure is called
Preparing to receive data from the network
Most NICs you’ll find today will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
The Intel I350 NIC does support multiple queues. We can see evidence of this in the
igb driver. One of the first things the
igb driver does when it is brought up is call a function named
igb_setup_all_rx_resources. This function calls another function,
igb_setup_rx_resources, once for each RX queue to arrange for DMA-able memory where the device will write incoming data.
If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO.
It turns out the number and size of the RX queues can be tuned by using
ethtool. Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.
The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packets at the hardware level, if desired.
We’ll take a look at how to tune these settings shortly.
When a network device is brought up, a driver will usually enable NAPI.
We saw earlier how drivers register
poll functions with NAPI, but NAPI is not usually enabled until the device is brought up.
Enabling NAPI is relatively straight forward. A call to
napi_enable will flip a bit in the
struct napi_struct to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.
In the case of the
igb driver, NAPI is enabled for each
q_vector that was initialized when the driver was loaded or when the queue count or size are changed with
Register an interrupt handler
After enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
Some drivers, like the
igb driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.
MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with
irqbalance or by modifying
/proc/irq/IRQ_NUMBER/smp_affinity). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.
If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
igb driver, the functions
igb_intr are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.
You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/igb_main.c:
As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with
igb_request_msix, falling back to MSI on failure. Next,
request_irq is used to register
igb_intr_msi, the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts.
request_irq is used again to register the legacy interrupt handler
And this is how the
igb driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.
At this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the
igb driver does this in
__igb_open by calling a helper function named
Interrupts are enabled for this device by writing to registers:
The network device is now up
Drivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.
Let’s take a look at monitoring and tuning settings for network device drivers.
Monitoring network devices
There are several different ways to monitor your network devices offering different levels of granularity and complexity. Let’s start with most granular and move to least granular.
You can install
ethtool on an Ubuntu system by running:
sudo apt-get install ethtool.
Once it is installed, you can access the statistics by passing the
-S flag along with the name of the network device you want statistics about.
Monitor detailed NIC device statistics (e.g., packet drops) with `ethtool -S`.
$ sudo ethtool -S eth0 NIC statistics: rx_packets: 597028087 tx_packets: 5924278060 rx_bytes: 112643393747 tx_bytes: 990080156714 rx_broadcast: 96 tx_broadcast: 116 rx_multicast: 20294528 ....
Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning.
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read. In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via
ethtool can be misleading.
sysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
You can find the number of dropped incoming network data frames for, e.g. eth0 by using
cat on a file.
Monitor higher level NIC statistics with sysfs.
$ cat /sys/class/net/eth0/statistics/rx_dropped 2
The counter values will be split into files like
Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss.
If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means.
An even higher level file is
/proc/net/dev which provides high-level summary-esque information for each network adapter on the system.
Monitor high level NIC statistics by reading
$ cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0 lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0
This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver.
Tuning network devices
Check the number of RX queues being used
If your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using
Check the number of NIC receive queues with
$ sudo ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
Error seen if your NIC doesn’t support this operation.
$ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters : Operation not supported
This means that your driver has not implemented the ethtool
get_channels operation. This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature.
Adjusting the number of RX queues
Once you’ve found the current and maximum queue count, you can adjust the values by using
sudo ethtool -L.
Note: some devices and their drivers only support combined queues that are paired for transmit and receive, as in the example in the above section.
Set combined NIC transmit and receive queues to 8 with
$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d like to change only the RX queue count to 8, you would run:
Set the number of NIC receive queues to 8 with
$ sudo ethtool -L eth0 rx 8
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the size of the RX queues
Some NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily
ethtool provides a generic way for users to adjust the size. Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely.
Check current NIC queue sizes with
$ sudo ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512
the above output indicates that the hardware supports up to 4096 receive and transmit descriptors, but it is currently only using 512.
Increase size of each RX queue to 4096 with
$ sudo ethtool -G eth0 rx 4096
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the processing weight of RX queues
Some NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight.
You can configure this if:
- Your NIC supports flow indirection.
- Your driver implements the
- You are running a new enough version of
ethtoolthat has support for the command line options
-Xto show and set the indirection table, respectively.
Check the RX flow indirection table with
$ sudo ethtool -x eth0 RX flow hash indirection table for eth3 with 2 RX ring(s): 0: 0 1 0 1 0 1 0 1 8: 0 1 0 1 0 1 0 1 16: 0 1 0 1 0 1 0 1 24: 0 1 0 1 0 1 0 1
This output shows packet hash values on the left, with receive queue 0 and 1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0, while a packet which hashes to 3 will be delivered to receive queue 1.
Example: spread processing evenly between first 2 RX queues
$ sudo ethtool -X eth0 equal 2
If you want to set custom weights to alter the number of packets which hit certain receive queues (and thus CPUs), you can specify those on the command line, as well:
Set custom RX queue weights with
$ sudo ethtool -X eth0 weight 6 2
The above command specifies a weight of 6 for rx queue 0 and 2 for rx queue 1, pushing much more data to be processed on queue 0.
Some NICs will also let you adjust the fields which be used in the hash algorithm, as we’ll see now.
Adjusting the rx hash fields for network flows
You can use
ethtool to adjust the fields that will be used when computing a hash for use with RSS.
Check which fields are used for UDP RX flow hash with
$ sudo ethtool -n eth0 rx-flow-hash udp4 UDP over IPV4 flows use these fields for computing Hash flow key: IP SA IP DA
For eth0, the fields that are used for computing a hash on UDP flows is the IPv4 source and destination addresses. Let’s include the source and destination ports:
Set UDP RX flow hash fields with
$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
sdfn string is a bit cryptic; check the
ethtool man page for an explanation of each letter.
Adjusting the fields to take a hash on is useful, but
ntuple filtering is even more useful for finer grained control over which flows will be handled by which RX queue.
ntuple filtering for steering network flows
Some NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via
ethtool) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1.
On Intel NICs this feature is commonly known as Intel Ethernet Flow Director. Other NIC vendors may have other marketing names for this feature.
As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. aRFS will be covered later.
This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:
- A webserver running on port 80 is pinned to run on CPU 2.
- IRQs for an RX queue are assigned to be processed by CPU 2.
- TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2.
- All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program.
- Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.
As mentioned, ntuple filtering can be configured with
ethtool, but first, you’ll need to ensure that this feature is enabled on your device.
Check if ntuple filters are enabled with
$ sudo ethtool -k eth0 Offload parameters for eth0: ... ntuple-filters: off receive-hashing: on
As you can see,
ntuple-filters are set to off on this device.
Enable ntuple filters with
$ sudo ethtool -K eth0 ntuple on
Once you’ve enabled ntuple filters, or verified that it is enabled, you can check the existing ntuple rules by using
Check existing ntuple filters with
$ sudo ethtool -u eth0 40 RX rings available Total 0 rules
As you can see, this device has no ntuple filter rules. You can add a rule by specifying it on the command line to
ethtool. Let’s add a rule to direct all TCP traffic with a destination port of 80 to RX queue 2:
Add ntuple filter to send TCP flows with destination port 80 to RX queue 2
$ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2
You can also use ntuple filtering to drop packets for particular flows at the hardware level. This can be useful for mitigating heavy incoming traffic from specific IP addresses. For more information about configuring ntuple filter rules, see the
ethtool man page.
You can usually get statistics about the success (or failure) of your ntuple rules by checking values output from
ethtool -S [device name]. For example, on Intel NICs, the statistics
fdir_miss calculate the number of matches and misses for your ntuple filtering rules. Consult your device driver source and device data sheet for tracking down statistics counters (if available).
Before examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.
What is a softirq?
The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen
ksoftirqd/0 in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.
Kernel subsystems (like networking) can register a softirq handler by executing the
open_softirq function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.
Since softirqs are so important for deferring the work of device drivers, you might imagine that the
ksoftirqd process is spawned pretty early in the life cycle of the kernel and you’d be correct.
Looking at the code found in kernel/softirq.c reveals how the
ksoftirqd system is initialized:
As you can see from the
struct smp_hotplug_thread definition above, there are two function pointers being registered:
Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
The code in
kernel/smpboot.c first calls
ksoftirqd_should_run which determines if there are any pending softirqs and, if there are pending softirqs,
run_ksoftirqd is executed. The
run_ksoftirqd does some minor bookkeeping before it calls
__do_softirq function does a few interesting things:
- determines which softirq is pending
- softirq time is accounted for statistics purposes
- softirq execution statistics are incremented
- the softirq handler for the pending softirq (which was registered with a call to
open_softirq) is executed.
So, when you look at graphs of CPU usage and see
si you now know that this is measuring the amount of CPU usage happening in a deferred work context.
softirq system increments statistic counters which can be read from
/proc/softirqs Monitoring these statistics can give you a sense for the rate at which softirqs for various events are being generated.
Check softIRQ stats by reading
$ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 0 0 0 0 TIMER: 2831512516 1337085411 1103326083 1423923272 NET_TX: 15774435 779806 733217 749512 NET_RX: 1671622615 1257853535 2088429526 2674732223 BLOCK: 1800253852 1466177 1791366 634534 BLOCK_IOPOLL: 0 0 0 0 TASKLET: 25 0 0 0 SCHED: 2642378225 1711756029 629040543 682215771 HRTIMER: 2547911 2046898 1558136 1521176 RCU: 2056528783 4231862865 3545088730 844379888
This file can give you an idea of how your network receive (
NET_RX) processing is currently distributed across your CPUs. If it is distributed unevenly, you will see a larger count value for some CPUs than others. This is one indicator that you might be able to benefit from Receive Packet Steering / Receive Flow Steering described below. Be careful using just this file when monitoring your performance: during periods of high network activity you would expect to see the rate
NET_RX increments increase, but this isn’t necessarily the case. It turns out that this is a bit nuanced, because there are additional tuning knobs in the network stack that can affect the rate at which
NET_RX softirqs will fire, which we’ll see soon.
You should be aware of this, however, so that if you adjust the other tuning knobs you will know to examine
/proc/softirqs and expect to see a change.
Now, let’s move on to the networking stack and trace how network data is received from top to bottom.
Linux network device subsystem
Now that we’ve taken a look in to how network drivers and softirqs work, let’s see how the Linux network device subsystem is initialized. Then, we can follow the path of a packet starting with its arrival.
Initialization of network device subsystem
The network device (netdev) subsystem is initialized in the function
net_dev_init. Lots of interesting things happen in this initialization function.
struct softnet_data structures
net_dev_init creates a set of
struct softnet_data structures for each CPU on the system. These structures will hold pointers to several important things for processing network data:
- List for NAPI structures to be registered to this CPU.
- A backlog for data processing.
- The processing
- The receive offload structure list.
- Receive packet steering settings.
- And more.
Each of these will be examined in greater detail later as we progress up the stack.
Initialization of softirq handlers
net_dev_init registers a transmit and receive softirq handler which will be used to process incoming or outgoing network data. The code for this is pretty straight forward:
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the
net_rx_action function registered to the
At long last; network data arrives!
Assuming that the RX queue has enough available descriptors, the packet is written to RAM via DMA. The device then raises the interrupt that is assigned to it (or in the case of MSI-X, the interrupt tied to the rx queue the packet arrived on).
In general, the interrupt handler which runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts may be blocked.
Let’s take a look at the source for the MSI-X interrupt handler; it will really help illustrate the idea that the interrupt handler does as little work as possible.
This interrupt handler is very short and performs 2 very quick operations before returning.
First, this function calls
igb_write_itr which simply updates a hardware specific register. In this case, the register that is updated is one which is used to track the rate hardware interrupts are arriving.
This register is used in conjunction with a hardware feature called “Interrupt Throttling” (also called “Interrupt Coalescing”) which can be used to to pace the delivery of interrupts to the CPU. We’ll see soon how
ethtool provides a mechanism for adjusting the rate at which IRQs fire.
napi_schedule is called which wakes up the NAPI processing loop if it was not already active. Note that the NAPI processing loop executes in a softirq; the NAPI processing loop does not execute from the interrupt handler. The interrupt handler simply causes it to start executing if it was not already.
The actual code showing exactly how this works is important; it will guide our understanding of how network data is processed on multi-CPU systems.
Let’s figure out how the
napi_schedule call from the hardware interrupt handler works.
Remember, NAPI exists specifically to harvest network data without needing interrupts from the NIC to signal that data is ready for processing. As mentioned earlier, the NAPI
poll loop is bootstrapped by receiving a hardware interrupt. In other words: NAPI is enabled, but off, until the first packet arrives at which point the NIC raises an IRQ and NAPI is started. There are a few other cases, as we’ll see soon, where NAPI can be disabled and will need a hardware interrupt to be raised before it will be started again.
The NAPI poll loop is started when the interrupt handler in the driver calls
napi_schedule is actually just a wrapper function defined in a header file which calls down to
This code is using
__get_cpu_var to get the
softnet_data structure that is registered to the current CPU. This
softnet_data structure and the
struct napi_struct structure handed up from the driver are passed into
____napi_schedule. Wow, that’s a lot of underscores 😉
Let’s take a look at
____napi_schedule, from net/core/dev.c:
This code does two important things:
struct napi_structhanded up from the device driver’s interrupt handler code is added to the
poll_listattached to the
softnet_datastructure associated with the current CPU.
__raise_softirq_irqoffis used to “raise” (or trigger) a NET_RX_SOFTIRQ softirq. This will cause the
net_rx_actionregistered during the network device subsystem initialization to be executed, if it’s not currently being executed.
As we’ll see shortly, the softirq handler function
net_rx_action will call the NAPI
poll function to harvest packets.
A note about CPU and network data processing
Note that all the code we’ve seen so far to defer work from a hardware interrupt handler to a softirq has been using structures associated with the current CPU.
While the driver’s IRQ handler itself does very little work itself, the softirq handler will execute on the same CPU as the driver’s IRQ handler.
This why setting the CPU a particular IRQ will be handled by is important: that CPU will be used not only to execute the interrupt handler in the driver, but the same CPU will also be used when harvesting packets in a softirq via NAPI.
As we’ll see later, things like Receive Packet Steering can distribute some of this work to other CPUs further up the network stack.
Monitoring network data arrival
Hardware interrupt requests
Note: monitoring hardware IRQs does not give a complete picture of packet processing health. Many drivers turn off hardware IRQs while NAPI is running, as we’ll see later. It is one important part of your whole monitoring solution.
Check hardware interrupt stats by reading
$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 46 0 0 0 IR-IO-APIC-edge timer 1: 3 0 0 0 IR-IO-APIC-edge i8042 30: 3361234770 0 0 0 IR-IO-APIC-fasteoi aacraid 64: 0 0 0 0 DMAR_MSI-edge dmar0 65: 1 0 0 0 IR-PCI-MSI-edge eth0 66: 863649703 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 67: 986285573 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 68: 45 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 69: 394 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 NMI: 9729927 4008190 3068645 3375402 Non-maskable interrupts LOC: 2913290785 1585321306 1495872829 1803524526 Local timer interrupts
You can monitor the statistics in
/proc/interrupts to see how the number and rate of hardware interrupts change as packets arrive and to ensure that each RX queue for your NIC is being handled by an appropriate CPU. As we’ll see shortly, this number only tells us how many hardware interrupts have happened, but it is not necessarily a good metric for understanding how much data has been received or processed as many drivers will disable NIC IRQs as part of their contract with the NAPI subsystem. Further, using interrupt coalescing will also affect the statistics gathered from this file. Monitoring this file can help you determine if the interrupt coalescing settings you select are actually working.
To get a more complete picture of your network processing health, you’ll need to monitor
/proc/softirqs (as mentioned above) and additional files in
/proc that we’ll cover below.
Tuning network data arrival
Interrupt coalescing is a method of preventing interrupts from being raised by a device to a CPU until a specific amount of work or number of events are pending.
This can help prevent interrupt storms and can help increase throughput or latency, depending on the settings used. Fewer interrupts generated result in higher throughput, increased latency, and lower CPU usage. More interrupts generated result in the opposite: lower latency, lower throughput, but also increased CPU usage.
Historically, earlier versions of the
e1000, and other drivers included support for a parameter called
InterruptThrottleRate. This parameter has been replaced in more recent drivers with a generic
Get the current IRQ coalescing settings with
$ sudo ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 ...
ethtool provides a generic interface for setting various coalescing settings. Keep in mind, however, that not every device or driver will support every setting. You should check your driver documentation or driver source code to determine what is, or is not, supported. As per the ethtool documentation: “Anything not implemented by the driver causes these values to be silently ignored.”
One interesting option that some drivers support is “adaptive RX/TX IRQ coalescing.” This option is typically implemented in hardware. The driver usually needs to do some work to inform the NIC that this feature is enabled and some bookkeeping as well (as seen in the
igb driver code above).
The result of enabling adaptive RX/TX IRQ coalescing is that interrupt delivery will be adjusted to improve latency when packet rate is low and also improve throughput when packet rate is high.
Enable adaptive RX IRQ coalescing with
$ sudo ethtool -C eth0 adaptive-rx on
You can also use
ethtool -C to set several options. Some of the more common options to set are:
rx-usecs: How many usecs to delay an RX interrupt after a packet arrives.
rx-frames: Maximum number of data frames to receive before an RX interrupt.
rx-usecs-irq: How many usecs to delay an RX interrupt while an interrupt is being serviced by the host.
rx-frames-irq: Maximum number of data frames to receive before an RX interrupt is generated while the system is servicing an interrupt.
And many, many more.
Reminder that your hardware and driver may only support a subset of the options listed above. You should consult your driver source code and your hardware data sheet for more information on supported coalescing options.
Unfortunately, the options you can set aren’t well documented anywhere except in a header file. Check the source of include/uapi/linux/ethtool.h to find an explanation of each option supported by
ethtool (but not necessarily your driver and NIC).
Note: while interrupt coalescing seems to be a very useful optimization at first glance, the rest of the networking stack internals also come into the fold when attempting to optimize. Interrupt coalescing can be useful in some cases, but you should ensure that the rest of your networking stack is also tuned properly. Simply modifying your coalescing settings alone will likely provide minimal benefit in and of itself.
Adjusting IRQ affinities
If your NIC supports RSS / multiqueue or if you are attempting to optimize for data locality, you may wish to use a specific set of CPUs for handling interrupts generated by your NIC.
Setting specific CPUs allows you to segment which CPUs will be used for processing which IRQs. These changes may affect how upper layers operate, as we’ve seen for the networking stack.
If you do decide to adjust your IRQ affinities, you should first check if you running the
irqbalance daemon. This daemon tries to automatically balance IRQs to CPUs and it may overwrite your settings. If you are running
irqbalance, you should either disable
irqbalance or use the
--banirq in conjunction with
IRQBALANCE_BANNED_CPUS to let
irqbalance know that it shouldn’t touch a set of IRQs and CPUs that you want to assign yourself.
Next, you should check the file
/proc/interrupts for a list of the IRQ numbers for each network RX queue for your NIC.
Finally, you can adjust the which CPUs each of those IRQs will be handled by modifying
/proc/irq/IRQ_NUMBER/smp_affinity for each IRQ number.
You simply write a hexadecimal bitmask to this file to instruct the kernel which CPUs it should use for handling the IRQ.
Example: Set the IRQ affinity for IRQ 8 to CPU 0
$ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'
Network data processing begins
Once the softirq code determines that a softirq is pending, begins processing, and executes
net_rx_action, network data processing begins.
Let’s take a look at portions of the
net_rx_action processing loop to understand how it works, which pieces are tunable, and what can be monitored.
net_rx_action processing loop
net_rx_action begins the processing of packets from the memory the packets were DMA’d into by the device.
The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI
poll functions. It does this in two ways:
- By keeping track of a work
budget(which can be adjusted), and
- Checking the elapsed time
This is how the kernel prevents packet processing from consuming the entire CPU. The
budget above is the total available budget that will be spent among each of the available NAPI structures registered to this CPU.
This is another reason why multiqueue NICs should have the IRQ affinity carefully tuned. Recall that the CPU which handles the IRQ from the device will be the CPU where the softirq handler will execute and, as a result, will also be the CPU where the above loop and budget computation runs.
Systems with multiple NICs each with multiple queues can end up in a situation where multiple NAPI structs are registered to the same CPU. Data processing for all NAPI structs on the same CPU spend from the same
If you don’t have enough CPUs to distribute your NIC’s IRQs, you can consider increasing the
budget to allow for more packet processing for each CPU. Increasing the budget will increase CPU usage (specifically
top or other programs), but should reduce latency as data will be processed more promptly.
Note: the CPU will still be bounded by a time limit of 2 jiffies, regardless of the assigned budget.
poll function and
Recall that network device drivers use
netif_napi_add for registering
poll function. As we saw earlier in this post, the
igb driver has a piece of code like this:
This registers a NAPI structure with a hardcoded weight of 64. We’ll see now how this is used in the
net_rx_action processing loop.
This code obtains the weight which was registered to the NAPI struct (
64 in the above driver code) and passes it into the
poll function which was also registered to the NAPI struct (
igb_poll in the above code).
poll function returns the number of data frames that were processed. This amount is saved above as
work, which is then subtracted from the overall
- You are using a weight of
64from your driver (all drivers were hardcoded with this value in Linux 3.13.0), and
- You have your
budgetset to the default of
Your system would stop processing data when either:
igb_pollfunction was called at most 5 times (less if no data to process as we’ll see next), OR
- At least 2 jiffies of time have elapsed.
The NAPI / network device driver contract
One important piece of information about the contract between the NAPI subsystem and device drivers which has not been mentioned yet are the requirements around shutting down NAPI.
This part of the contract is as follows:
- If a driver’s
pollfunction consumes its entire weight (which is hardcoded to
64) it must NOT modify NAPI state. The
net_rx_actionloop will take over.
- If a driver’s
pollfunction does NOT consume its entire weight, it must disable NAPI. NAPI will be re-enabled next time an IRQ is received and the driver’s IRQ handler calls
We’ll see how
net_rx_action deals with the first part of that contract now. Next, the
poll function is examined, we’ll see how the second part of that contract is handled.
net_rx_action processing loop finishes up with one last section of code that deals with the first part of the NAPI contract explained in the previous section. From net/core/dev.c:
If the entire work is consumed, there are two cases that
- The network device should be shutdown (e.g. because the user ran
ifconfig eth0 down),
- If the device is not being shutdown, check if there’s a generic receive offload (GRO) list. If the timer tick rate is >= 1000, all GRO’d network flows that were recently updated will be flushed. We’ll dig into GRO in detail later. Move the NAPI structure to the end of the list for this CPU so the next iteration of the loop will get the next NAPI structure registered.
And that is how the packet processing loop invokes the driver’s registered
poll function to process packets. As we’ll see shortly, the
poll function will harvest network data and send it up the stack to be processed.
Exiting the loop when limits are reached
net_rx_action loop will exit when either:
- The poll list registered for this CPU has no more NAPI structures (
- The remaining budget is <= 0, or
- The time limit of 2 jiffies has been reached
Here’s this code we saw earlier again:
If you follow the
softnet_break label you stumble upon something interesting. From net/core/dev.c:
struct softnet_data structure has some statistics incremented and the softirq
NET_RX_SOFTIRQ is shut down. The
time_squeeze field is a measure of the number of times
net_rx_action had more work to do but either the budget was exhausted or the time limit was reached before it could be completed. This is a tremendously useful counter for understanding bottlenecks in network processing. We’ll see shortly how to monitor this value. The
NET_RX_SOFTIRQ is disabled to free up processing time for other tasks. This makes sense as this small stub of code is only executed when more work could have been done, but we don’t want to monopolize the CPU.
Execution is then transferred to the
out label. Execution can also make it to the
out label if there were no more NAPI structures to process, in other words, there is more budget than there is network activity and all the drivers have shut NAPI off and there is nothing left for
net_rx_action to do.
out section does one important thing before returning from
net_rx_action: it calls
net_rps_action_and_irq_enable. This function serves an important purpose if Receive Packet Steering is enabled; it wakes up remote CPUs to start processing network data.
We’ll see more about how RPS works later. For now, let’s see how to monitor the health of the
net_rx_action processing loop and move on to the inner working of NAPI
poll functions so we can progress up the network stack.
Recall in previous sections that device drivers allocate a region of memory for the device to perform DMA to incoming packets. Just as it is the responsibility of the driver to allocate those regions, it is also the responsibility of the driver to unmap those regions, harvest the data, and send it up the network stack.
Let’s take a look at how the
igb driver does this to get an idea of how this works in practice.
At long last, we can finally examine our friend
igb_poll. It turns out the code for
igb_poll is deceptively simple. Let’s take a look. From drivers/net/ethernet/intel/igb/igb_main.c:
This code does a few interesting things:
- If Direct Cache Access (DCA) support is enabled in the kernel, the CPU cache is warmed so that accesses to the RX ring will hit CPU cache. You can read more about DCA in the Extras section at the end of this blog post.
igb_clean_rx_irqis called which does the heavy lifting, as we’ll see next.
clean_completeis checked to determine if there was still more work that could have been done. If so, the
budget(remember, this was hardcoded to
64) is returned. As we saw earlier,
net_rx_actionwill move this NAPI structure to the end of the poll list.
- Otherwise, the driver turns off NAPI by calling
napi_completeand re-enables interrupts by calling
igb_ring_irq_enable. The next interrupt that arrives will re-enable NAPI.
Let’s see how
igb_clean_rx_irq sends network data up the stack.
igb_clean_rx_irq function is a loop which processes one packet at a time until the
budget is reached or no additional data is left to process.
The loop in this function does a few important things:
- Allocates additional buffers for receiving data as used buffers are cleaned out. Additional buffers are added
IGB_RX_BUFFER_WRITE(16) at a time.
- Fetch a buffer from the RX queue and store it in an
- Check if the buffer is an “End of Packet” buffer. If so, continue processing. Otherwise, continue fetching additional buffers from the RX queue, adding them to the
skb. This is necessary if a received data frame is larger than the buffer size.
- Verify that the layout and headers of the data are correct.
- The number of bytes processed statistic counter is increased by
- Set the hash, checksum, timestamp, VLAN id, and protocol fields of the skb. The hash, checksum, timestamp, and VLAN id are provided by the hardware. If the hardware is signaling a checksum error, the
csum_errorstatistic is incremented. If the checksum succeeded and the data is UDP or TCP data, the
skbis marked as
CHECKSUM_UNNECESSARY. If the checksum failed, the protocol stacks are left to deal with this packet. The protocol is computed with a call to
eth_type_transand stored in the
- The constructed
skbis handed up the network stack with a call to
- The number of packets processed statistics counter is incremented.
- The loop continues until the number of packets processed reaches the budget.
Once the loop terminates, the function assigns statistics counters for rx packets and bytes processed.
Now it’s time to take two detours prior to proceeding up the network stack. First, let’s see how to monitor and tune the network subsystem’s softirqs. Next, let’s talk about Generic Receive Offloading (GRO). After that, the rest of the networking stack will make more sense as we enter
Monitoring network data processing
As seen in the previous section,
net_rx_action increments a statistic when exiting the
net_rx_action loop and when additional work could have been done, but either the
budget or the time limit for the softirq was hit. This statistic is tracked as part of the
struct softnet_data associated with the CPU.
These statistics are output to a file in proc:
/proc/net/softnet_stat for which there is, unfortunately, very little documentation. The fields in the file in proc are not labeled and could change between kernel releases.
In Linux 3.13.0, you can find which values map to which field in
/proc/net/softnet_stat by reading the kernel source. From net/core/net-procfs.c:
Many of these statistics have confusing names and are incremented in places where you might not expect. An explanation of when and where each of these is incremented will be provided as the network stack is examined. Since the
squeeze_time statistic was seen in
net_rx_action, I thought it made sense to document this file now.
Monitor network data processing statistics by reading
$ cat /proc/net/softnet_stat 6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000 660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000 61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Important details about
- Each line of
/proc/net/softnet_statcorresponds to a
struct softnet_datastructure, of which there is 1 per CPU.
- The values are separated by a single space and are displayed in hexadecimal
- The first value,
sd->processed, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment the
sd->processedcount more than once for the same packet.
- The second value,
sd->dropped, is the number of network frames dropped because there was no room on the processing queue. More on this later.
- The third value,
sd->time_squeeze, is (as we saw) the number of times the
net_rx_actionloop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing the
budgetas explained earlier can help reduce this.
- The next 5 values are always 0.
- The ninth value,
sd->cpu_collision, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
- The tenth value,
sd->received_rps, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt
- The last value,
flow_limit_count, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
If you decide to monitor this file and graph the results, you must be extremely careful that the ordering of these fields hasn’t changed and that the meaning of each field has been preserved. You will need to read the kernel source to verify this.
Tuning network data processing
You can adjust the
net_rx_action budget, which determines how much packet processing can be spent among all NAPI structures registered to a CPU by setting a sysctl value named
Example: set the overall packet processing budget to 600.
$ sudo sysctl -w net.core.netdev_budget=600
You may also want to write this setting to your
/etc/sysctl.conf file so that changes persist between reboots.
The default value on Linux 3.13.0 is 300.
Generic Receive Offloading (GRO)
Generic Receive Offloading (GRO) is a software implementation of a hardware optimization that is known as Large Receive Offloading (LRO).
The main idea behind both methods is that reducing the number of packets passed up the network stack by combining “similar enough” packets together can reduce CPU usage. For example, imagine a case where a large file transfer is occurring and most of the packets contain chunks of data in the file. Instead of sending small packets up the stack one at a time, the incoming packets can be combined into one packet with a huge payload. That packet can then be passed up the stack. This allows the protocol layers to process a single packet’s headers while delivering bigger chunks of data to the user program.
The problem with this sort of optimization is, of course, information loss. If a packet had some important option or flag set, that option or flag could be lost if the packet is coalesced into another. And this is exactly why most people don’t use or encourage the use of LRO. LRO implementations, generally speaking, had very lax rules for coalescing packets.
GRO was introduced as an implementation of LRO in software, but with more strict rules around which packets can be coalesced.
By the way: if you have ever used
tcpdump and seen unrealistically large incoming packet sizes, it is most likely because your system has GRO enabled. As you’ll see soon, packet capture taps are inserted further up the stack, after GRO has already happened.
Tuning: Adjusting GRO settings with
You can use
ethtool to check if GRO is enabled and also to adjust the setting.
ethtool -k to check your GRO settings.
$ ethtool -k eth0 | grep generic-receive-offload generic-receive-offload: on
As you can see, on this system I have
generic-receive-offload set to on.
ethtool -K to enable (or disable) GRO.
$ sudo ethtool -K eth0 gro on
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
napi_gro_receive deals processing network data for GRO (if GRO is enabled for the system) and sending the data up the stack toward the protocol layers. Much of this logic is handled in a function called
This function begins by checking if GRO is enabled and, if so, preparing to do GRO. In the case where GRO is enabled, a list of GRO offload filters is traversed to allow the higher level protocol stacks to act on a piece of data which is being considered for GRO. This is done so that the protocol layers can let the network device layer know if this packet is part of a network flow that is currently being receive offloaded and handle anything protocol specific that should happen for GRO. For example, the TCP protocol will need to decide if/when to ACK a packet that is being coalesced into an existing packet.
Here’s the code from
net/core/dev.c which does this:
If the protocol layers indicated that it is time to flush the GRO’d packet, that is taken care of next. This happens with a call to
napi_gro_complete, which calls a
gro_complete callback for the protocol layers and then passes the packet up the stack by calling
Here’s the code from
net/core/dev.c which does this:
Next, if the protocol layers merged this packet to an existing flow,
napi_gro_receive simply returns as there’s nothing else to do.
If the packet was not merged and there are fewer than
MAX_GRO_SKBS (8) GRO flows on the system, a new entry is added to the
gro_list on the NAPI structure for this CPU.
Here’s the code from
net/core/dev.c which does this:
And that is how the GRO system in the Linux networking stack works.
napi_skb_finish is called which either frees unneeded data structures because a packet has been merged, or calls
netif_receive_skb to pass the data up the network stack (because there were already
MAX_GRO_SKBS flows being GRO’d).
Next, it’s time for
netif_receive_skb to see how data is handed off to the protocol layers. Before this can be examined, we’ll need to take a look at Receive Packet Steering (RPS) first.
Receive Packet Steering (RPS)
Recall earlier how we discussed that network device drivers register a NAPI
poll function. Each
NAPI poller instance is executed in the context of a softirq of which there is one per CPU. Further recall that the CPU which the driver’s IRQ handler runs on will wake its softirq processing loop to process packets.
In other words: a single CPU processes the hardware interrupt and polls for packets to process incoming data.
Some NICs (like the Intel I350) support multiple queues at the hardware level. This means incoming packets can be DMA’d to a separate memory region for each queue, with a separate NAPI structure to manage polling this region, as well. Thus multiple CPUs will handle interrupts from the device and also process packets.
This feature is typically called Receive Side Scaling (RSS).
Receive Packet Steering (RPS) is a software implementation of RSS. Since it is implemented in software, this means it can be enabled for any NIC, even NICs which have only a single RX queue. However, since it is in software, this means that RPS can only enter into the flow after a packet has been harvested from the DMA memory region.
This means that you wouldn’t notice a decrease in CPU time spent handling IRQs or the NAPI
poll loop, but you can distribute the load for processing the packet after it’s been harvested and reduce CPU time from there up the network stack.
RPS works by generating a hash for incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor Interrupt (IPI) is delivered to the CPU owning the backlog. This helps to kick-start backlog processing if it is not currently processing data on the backlog. The
/proc/net/softnet_stat contains a count of the number of times each
softnet_data struct has received an IPI (the
netif_receive_skb will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.
Tuning: Enabling RPS
For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should process packets for a given interface and RX queue.
You can find some documentation about these bitmasks in the kernel documentation.
In short, the bitmasks to modify are found in:
eth0 and receive queue 0, you would modify the file:
/sys/class/net/eth0/queues/rx-0/rps_cpus with a hexadecimal number indicating which CPUs should process packets from
eth0’s receive queue 0. As the documentation points out, RPS may be unnecessary in certain configurations.
Note: enabling RPS to distribute packet processing to CPUs which were previously not processing packets will cause the number of `NET_RX` softirqs to increase for that CPU, as well as the `si` or `sitime` in the CPU usage graph. You can compare before and after of your softirq and CPU usage graphs to confirm that RPS is configured properly to your liking.
Receive Flow Steering (RFS)
Receive flow steering (RFS) is used in conjunction with RPS. RPS attempts to distribute incoming packet load amongst multiple CPUs, but does not take into account any data locality issues for maximizing CPU cache hit rates. You can use RFS to help increase cache hit rates by directing packets for the same flow to the same CPU for processing.
Tuning: Enabling RFS
For RFS to work, you must have RPS enabled and configured.
RFS keeps track of a global hash table of all flows and the size of this hash table can be adjusted by setting the
Increase the size of the RFS socket flow hash by setting a
$ sudo sysctl -w net.core.rps_sock_flow_entries=32768
Next, you can also set the number of flows per RX queue by writing this value to the sysfs file named
rps_flow_cnt for each RX queue.
Example: increase the number of flows for RX queue 0 on eth0 to 2048.
$ sudo bash -c 'echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt'
Hardware accelerated Receive Flow Steering (aRFS)
RFS can be sped up with the use of hardware acceleration; the NIC and the kernel can work together to determine which flows should be processed on which CPUs. To use this feature, it must be supported by the NIC and your driver.
Consult your NIC’s data sheet to determine if this feature is supported. If your NIC’s driver exposes a function called
ndo_rx_flow_steer, then the driver has support for accelerated RFS.
Tuning: Enabling accelerated RFS (aRFS)
Assuming that your NIC and driver support it, you can enable accelerated RFS by enabling and configuring a set of things:
- Have RPS enabled and configured.
- Have RFS enabled and configured.
- Your kernel has
CONFIG_RFS_ACCELenabled at compile time. The Ubuntu kernel 3.13.0 does.
- Have ntuple support enabled for the device, as described previously. You can use
ethtoolto verify that ntuple support is enabled for the device.
- Configure your IRQ settings to ensure each RX queue is handled by one of your desired network processing CPUs.
Once the above is configured, accelerated RFS will be used to automatically move data to the RX queue tied to a CPU core that is processing data for that flow and you won’t need to specify an ntuple filter rule manually for each flow.
Moving up the network stack with
Picking up where we left off with
netif_receive_skb, which is called from a few places. The two most common (and also the two we’ve already looked at):
napi_skb_finishif the packet is not going to be merged to an existing GRO’d flow, OR
napi_gro_completeif the protocol layers indicated that it’s time to flush the flow, OR
netif_receive_skb and its descendants are operating in the context of a the softirq processing loop and you’ll see the time spent here accounted for as
si with tools like
netif_receive_skb begins by first checking a
sysctl value to determine if the user has requested receive timestamping before or after a packet hits the backlog queue. If this setting is enabled, the data is timestamped now, prior to it hitting RPS (and the CPU’s associated backlog queue). If this setting is disabled, it will be timestamped after it hits the queue. This can be used to distribute the load of timestamping amongst multiple CPUs if RPS is enabled, but will introduce some delay as a result.
Tuning: RX packet timestamping
You can tune when packets will be timestamped after they are received by adjusting a sysctl named
Disable timestamping for RX packets by adjusting a
$ sudo sysctl -w net.core.netdev_tstamp_prequeue=0
The default value is 1. Please see the previous section for an explanation as to what this setting means, exactly.
After the timestamping is dealt with,
netif_receive_skb operates differently depending on whether or not RPS is enabled. Let’s start with the simpler path first: RPS disabled.
Without RPS (default setting)
If RPS is not enabled,
__netif_receive_skb is called which does some bookkeeping and then calls
__netif_receive_skb_core to move data closer to the protocol stacks.
We’ll see precisely how
__netif_receive_skb_core works, but first let’s see how the RPS enabled code path works, as that code will also call
With RPS enabled
If RPS is enabled, after the timestamping options mentioned above are dealt with,
netif_receive_skb will perform some computations to determine which CPU’s backlog queue should be used. This is done by using the function
get_rps_cpu. From net/core/dev.c:
get_rps_cpu will take into account RFS and aRFS settings as described above to ensure the the data gets queued to the desired CPU’s backlog with a call to
This function begins by getting a pointer to the remote CPU’s
softnet_data structure, which contains a pointer to the
input_pkt_queue. Next, the queue length of the
input_pkt_queue of the remote CPU is checked. From net/core/dev.c:
The length of
input_pkt_queue is first compared to
netdev_max_backlog. If the queue is longer than this value, the data is dropped. Similarly, the flow limit is checked and if it has been exceeded, the data is dropped. In both cases the drop count on the
softnet_data structure is incremented. Note that this is the
softnet_data structure of the CPU the data was going to be queued to. Read the section above about
/proc/net/softnet_stat to learn how to get the drop count for monitoring purposes.
enqueue_to_backlog is not called in many places. It is called for RPS-enabled packet processing and also from
netif_rx. Most drivers should not be using
netif_rx and should instead be using
netif_receive_skb. If you are not using RPS and your driver is not using
netif_rx, increasing the backlog won’t produce any noticeable effect on your system as it is not used.
Note: You need to check the driver you are using. If it calls
netif_receive_skb and you are not using RPS, increasing the
netdev_max_backlog will not yield any performance improvement because no data will ever make it to the
Assuming that the
input_pkt_queue is small enough and the flow limit (more about this, next) hasn’t been reached (or is disabled), the data can be queued. The logic here is a bit funny, but can be summarized as:
- If the queue is empty: check if NAPI has been started on the remote CPU. If not, check if an IPI is queued to be sent. If not, queue one and start the NAPI processing loop by calling
____napi_schedule. Proceed to queuing the data.
- If the queue is not empty, or the previously described operation has completed, enqueue the data.
The code is a bit tricky with its use of
goto, so read it carefully. From net/core/dev.c:
RPS distributes packet processing load amongst multiple CPUs, but a single large flow can monopolize CPU processing time and starve smaller flows. Flow limits are a feature that can be used to limit the number of packets queued to the backlog for each flow to a certain amount. This can help ensure that smaller flows are processed even though much larger flows are pushing packets in.
The if statement above from net/core/dev.c checks the flow limit with a call to
This code is checking that there is still room in the queue and that the flow limit has not been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap).
Monitoring: Monitor drops due to full
input_pkt_queue or flow limit
See the section above about monitoring
dropped field is a counter that gets incremented each time data is dropped instead of queued to a CPU’s
netdev_max_backlog to prevent drops
Before adjusting this tuning value, see the note in the previous section.
You can help prevent drops in
enqueue_to_backlog by increasing the
netdev_max_backlog if you are using RPS or if your driver calls
Example: increase backlog to 3000 with
$ sudo sysctl -w net.core.netdev_max_backlog=3000
The default value is 1000.
Tuning: Adjust the NAPI weight of the backlog
You can adjust the weight of the backlog’s NAPI poller by setting the
net.core.dev_weight sysctl. Adjusting this value determines how much of the overall budget the backlog
poll loop can consume (see the section above about adjusting
Example: increase the NAPI
poll backlog processing loop with
$ sudo sysctl -w net.core.dev_weight=600
The default value is 64.
Remember, backlog processing runs in the softirq context similar to the device driver’s registered
poll function and will be limited by the overall
budget and a time limit, as described in previous sections.
Tuning: Enabling flow limits and tuning flow limit hash table size
Set the size of the flow limit table with a
$ sudo sysctl -w net.core.flow_limit_table_len=8192
The default value is 4096.
This change only affects newly allocated flow hash tables. So, if you’d like to increase the table size, you should do it before you enable flow limits.
To enable flow limits you should specify a bitmask in
/proc/sys/net/core/flow_limit_cpu_bitmap similar to the RPS bitmask which indicates which CPUs have flow limits enabled.
backlog queue NAPI poller
The per-CPU backlog queue plugs into NAPI the same way a device driver does. A
poll function is provided that is used to process packets from the softirq context. A
weight is also provided, just as a device driver would.
This NAPI struct is provided during initialization of the networking system. From
The backlog NAPI structure differs from the device driver NAPI structure in that the
weight parameter is adjustable, where as drivers hardcode their NAPI weight to 64. We’ll see in the tuning section below how to adjust the weight using a
process_backlog function is a loop which runs until its weight (as described in the previous section) has been consumed or no more data remains on the backlog.
Each piece of data on the backlog queue is removed from the backlog queue and passed on to
__netif_receive_skb. The code path once the data hits
__netif_receive_skb is the same as explained above for the RPS disabled case. Namely,
__netif_receive_skb does some bookkeeping prior to calling
__netif_receive_skb_core to pass network data up to the protocol layers.
process_backlog follows the same contract with NAPI that device drivers do, which is: NAPI is disabled if the total weight will not be used. The poller is restarted with the call to
enqueue_to_backlog as described above.
The function returns the amount of work done, which
net_rx_action (described above) will subtract from the budget (which is adjusted with the
net.core.netdev_budget, as described above).
__netif_receive_skb_core delivers data to packet taps and protocol layers
__netif_receive_skb_core performs the heavy lifting of delivering the data to protocol stacks. Before it does this, it checks if any packet taps have been installed which are catching all incoming packets. One example of something that does this is the
AF_PACKET address family, typically used via the libpcap library.
If such a tap exists, the data is delivered there first then to the protocol layers next.
Packet tap delivery
If a packet tap is installed (usually via libpcap), the packet is delivered there with the following code from net/core/dev.c:
If you are curious about how the path of the data through pcap, read net/packet/af_packet.c.
Protocol layer delivery
Once the taps have been satisfied,
__netif_receive_skb_core delivers data to protocol layers. It does this by obtaining the protocol field from the data and iterating across a list of deliver functions registered for that protocol type.
This can be seen in
__netif_receive_skb_core in net/core/dev.c:
ptype_base identifier above is defined as a hash table of lists in net/core/dev.c:
Each protocol layer adds a filter to a list at a given slot in the hash table, computed with a helper function called
Adding a filter to the list is accomplished with a call to
dev_add_pack. That is how protocol layers register themselves for network data delivery for their protocol type.
And now you know how network data gets from the NIC to the protocol layer.
Protocol layer registration
Now that we know how data is delivered to the protocol stacks from the network device subsystem, let’s see how a protocol layer registers itself.
This blog post is going to examine the IP protocol stack as it is a commonly used protocol and will be relevant to most readers.
IP protocol layer
The IP protocol layer plugs itself into the
ptype_base hash table so that data will be delivered to it from the network device layer described in previous sections.
This happens in the function
inet_init from net/ipv4/af_inet.c:
This registers the IP packet type structure defined at net/ipv4/af_inet.c:
deliver_skb (as seen in the above section), which calls
func (in this case,
ip_rcv function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters are bumped, as well.
ip_rcv ends by passing the packet to
ip_rcv_finish by way of netfilter. This is done so that any iptables rules that should be matched at the IP protocol layer can take a look at the packet before it continues on.
We can see the code which hands the data over to netfilter at the end of
ip_rcv in net/ipv4/ip_input.c:
netfilter and iptables
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into netfilter, iptables, and conntrack.
The short version is that
NF_HOOK_THRESH will check if any filters are installed and attempt to return execution back to the IP protocol layer to avoid going deeper into netfilter and anything that hooks in below that like iptables and conntrack.
Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the softirq context and can lead to latency in your network stack. This may be unavoidable, though, if you need to have a particular set of rules installed.
Once net filter has had a chance to take a look at the data and decide what to do with it,
ip_rcv_finish is called. This only happens if the data is not being dropped by netfilter, of course.
ip_rcv_finish begins with an optimization. In order to deliver the packet to proper place, a
dst_entry from the routing system needs to be in place. In order to obtain one, the code initially attempts to call the
early_demux function from the higher level protocol that this data is destined for.
early_demux routine is an optimization which attempts to find the
dst_entry needed to deliver the packet by checking if a
dst_entry is cached on the socket structure.
Here’s what that looks like from net/ipv4/ip_input.c:
As you can see above, this code is guarded by a sysctl
sysctl_ip_early_demux. By default
early_demux is enabled. The next section includes information about how to disable it and why you might want to.
If the optimization is enabled and there is no cached entry (because this is the first packet arriving), the packet will be handed off to the routing system in the kernel where the
dst_entry will be computed and assigned.
Once the routing layer completes, statistics counters are updated and the function ends by calling
dst_input(skb) which in turn calls the input function pointer on the packet’s
dst_entry structure that was affixed by the routing system.
If the packet’s final destination is the local system, the routing system will attach the function
ip_local_deliver to the input function pointer in the
dst_entry structure on the packet.
Tuning: adjusting IP protocol early demux
early_demux optimization by setting a
$ sudo sysctl -w net.ipv4.ip_early_demux=0
The default value is 1;
early_demux is enabled.
This sysctl was added as some users saw a ~5% decrease in throughput with the
early_demux optimization in some cases.
Recall how we saw the following pattern in the IP protocol layer:
- Calls to
ip_rcvdo some initial bookkeeping.
- Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finishis the callback which finished processing and continued working toward pushing the packet up the networking stack.
ip_local_deliver has the same pattern. From net/ipv4/ip_input.c:
Once netfilter has had a chance to take a look at the data,
ip_local_deliver_finish will be called, assuming the data is not dropped first by netfilter.
ip_local_deliver_finish obtains the protocol from the packet, looks up a
net_protocol structure registered for that protocol, and calls the function pointed to by
handler in the
This hands the packet up to the higher level protocol layer.
Monitoring: IP protocol layer statistics
Monitor detailed IP protocol statistics by reading
$ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51 1 10129840 2196520 1 0 0 0 ...
This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line.
In the IP protocol layer, you will find statistics counters being bumped. Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in
/proc/net/snmp can be found in include/uapi/linux/snmp.h:
Monitor extended IP protocol statistics by reading
$ cat /proc/net/netstat | grep IpExt IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0
The format is similar to
/proc/net/snmp, except the lines are prefixed with
Some interesting statistics:
InReceives: The total number of IP packets that reached
ip_rcvbefore any data integrity checks.
InHdrErrors: Total number of IP packets with corrupted headers. The header was too short, too long, non-existent, had the wrong IP protocol version number, etc.
InAddrErrors: Total number of IP packets where the host was unreachable.
ForwDatagrams: Total number of IP packets that have been forwarded.
InUnknownProtos: Total number of IP packets with unknown or unsupported protocol specified in the header.
InDiscards: Total number of IP packets discarded due to memory allocation failure or checksum failure when packets are trimmed.
InDelivers: Total number of IP packets successfully delivered to higher protocol layers. Keep in mind that those protocol layers may drop data even if the IP layer does not.
InCsumErrors: Total number of IP Packets with checksum errors.
Note that each of these is incremented in really specific locations in the IP layer. Code gets moved around from time to time and double counting errors or other accounting bugs can sneak in. If these statistics are important to you, you are strongly encouraged to read the IP protocol layer source code for the metrics that are important to you so you understand when they are (and are not) being incremented.
Higher level protocol registration
This blog post will examine UDP, but the TCP protocol handler is registered the same way and at the same time as the UDP protocol handler.
net/ipv4/af_inet.c, the structure definitions which contain the handler functions for connecting the UDP, TCP , and ICMP protocols to the IP protocol layer can be found. From net/ipv4/af_inet.c:
These structures are registered in the initialization code of the inet address family. From net/ipv4/af_inet.c:
We’re going to be looking at the UDP protocol layer. As seen above, the
handler function for UDP is called
This is the entry point into the UDP layer where the IP layer hands data. Let’s continue our journey there.
UDP protocol layer
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
The code for the
udp_rcv function is just a single line which calls directly into
__udp4_lib_rcv to handle receiving the datagram.
__udp4_lib_rcv function will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.
Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a
dst_entry to the packet before it is handed off to the upper layer protocol (UDP in our case).
If a socket and corresponding
dst_entry were found,
__udp4_lib_rcv will queue the packet to the socket:
If there is no socket attached from the early_demux operation, a receiving socket will now be looked up by calling
In both cases described above, the datagram will be queued to the socket:
If no socket was found, the datagram will be dropped:
The initial parts of this function are as follows:
- Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
- Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
- Verify the UDP checksum of the datagram and drop it if the checksum fails.
Finally, we arrive at the receive queue logic which begins by checking if the receive queue for the socket is full. From
sk_rcvqueues_full function checks the socket’s backlog length and the socket’s
sk_rmem_alloc to determine if the sum is greater than the
sk_rcvbuf for the socket (
sk->sk_rcvbuf in the above code snippet):
Tuning these values is a bit tricky as there are many things that can be adjusted.
Tuning: Socket receive queue memory
sk->sk_rcvbuf (called limit in
sk_rcvqueues_full above) value can be increased to whatever the sysctl
net.core.rmem_max is set to.
Increase the maximum receive buffer size by setting a
$ sudo sysctl -w net.core.rmem_max=8388608
sk->sk_rcvbuf starts at the
net.core.rmem_default value, which can also be adjusted by setting a sysctl, like so:
Adjust the default initial receive buffer size by setting a
$ sudo sysctl -w net.core.rmem_default=8388608
You can also set the
sk->sk_rcvbuf size by calling
setsockopt from your application and passing
SO_RCVBUF. The maximum you can set with
However, you can override the
net.core.rmem_max limit by calling
setsockopt and passing
SO_RCVBUFFORCE, but the user running the application will need the
sk->sk_rmem_alloc value is incremented by calls to
skb_set_owner_r which set the owner socket of a datagram. We’ll see this called later in the UDP layer.
sk->sk_backlog.len is incremented by calls to
sk_add_backlog, which we’ll see next.
Once it’s been verified that the queue is not full, progress toward queuing the datagram can continue. From net/ipv4/udp.c:
The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to
__udp_queue_rcv_skb. If it does, the datagram is queued to the backlog with a call to
The datagrams on the backlog are added to the receive queue when socket system calls release the socket with a call to
release_sock in the kernel.
__udp_queue_rcv_skb function adds datagrams to the receive queue by calling
sock_queue_rcv_skb and bumps statistics counters if the datagram could not be added to the receive queue for the socket.
Monitoring: UDP protocol layer statistics
Two very useful files for getting UDP protocol statistics are:
Monitor detailed UDP protocol statistics by reading
$ cat /proc/net/snmp | grep Udp\: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0
Much like the detailed statistics found in this file for the IP protocol, you will need to read the protocol layer source to determine exactly when and where these values are incremented.
InDatagrams: Incremented when
recvmsgwas used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing.
NoPorts: Incremented when UDP packets arrive destined for a port where no program is listening.
InErrors: Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and if
sk_add_backlogfails to add the datagram.
OutDatagrams: Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent.
RcvbufErrors: Incremented when
sock_queue_rcv_skbreports that no memory is available; this happens if
sk->sk_rmem_allocis greater than or equal to
SndbufErrors: Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available.
InCsumErrors: Incremented when a UDP checksum failure is detected. Note that in all cases I could find,
InCsumErrorsis incrememnted at the same time as
InCsumErrosshould yield the count of memory related errors on the receive side.
Monitor UDP socket statistics by reading
$ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0 558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0 588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0 769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0 812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0
The first line describes each of the fields in the lines following:
sl: Kernel hash slot for the socket
local_address: Hexadecimal local address of the socket and port number, separated by
rem_address: Hexadecimal remote address of the socket and port number, separated by
st: The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above,
tx_queue: The amount of memory allocated in the kernel for outgoing UDP datagrams.
rx_queue: The amount of memory allocated in the kernel for incoming UDP datagrams.
retrnsmt: These fields are unused by the UDP protocol layer.
uid: The effective user id of the user who created this socket.
timeout: Unused by the UDP protocol layer.
inode: The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check
/proc/[pid]/fd, which will contain symlinks to
ref: The current reference count for the socket.
pointer: The memory address in the kernel of the
drops: The number of datagram drops associated with this socket. Note that this does not include any drops related to sending datagrams (on corked UDP sockets or otherwise); this is only incremented in receive paths as of the kernel version examined by this blog post.
The code which outputs this can be found in
Queuing data to a socket
Network data is queued to a socket with a call to
sock_queue_rcv. This function does a few things before adding the datagram to the queue:
- The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
sk_filteris used to process any Berkeley Packet Filter filters that have been applied to the socket.
sk_rmem_scheduleis run to ensure sufficient receive buffer space exists to accept this datagram.
- Next the size of the datagram is charged to the socket with a call to
skb_set_owner_r. This increments
- The data is added to the queue with a call to
- Finally, any processes waiting on data to arrive in the socket are notified with a call to the
sk_data_readynotification handler function.
And that is how data arrives at a system and traverses the network stack until it reaches a socket and is ready to be read by a user program.
There are a few extra things worth mentioning that are worth mentioning which didn’t seem quite right anywhere else.
As mentioned in the above blog post, the networking stack can collect timestamps of incoming data. There are sysctl values controlling when/how to collect timestamps when used in conjunction with RPS; see the above post for more information on RPS, timestamping, and where, exactly, in the network stack receive timestamping happens. Some NICs even support timestamping in hardware, too.
This is a useful feature if you’d like to try to determine how much latency the kernel network stack is adding to receiving packets.
Determine which timestamp modes your driver and device support with
$ sudo ethtool -T eth0 Time stamping parameters for eth0: Capabilities: software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE) software-receive (SOF_TIMESTAMPING_RX_SOFTWARE) software-system-clock (SOF_TIMESTAMPING_SOFTWARE) PTP Hardware Clock: none Hardware Transmit Timestamp Modes: none Hardware Receive Filter Modes: none
This NIC, unfortunately, does not support hardware receive timestamping, but software timestamping can still be used on this system to help me determine how much latency the kernel is adding to my packet receive path.
Busy polling for low latency sockets
It is possible to use a socket option called
SO_BUSY_POLL which will cause the kernel to busy poll for new data when a blocking receive is done and there is no data.
IMPORTANT NOTE: For this option to work, your device driver must support it. Linux kernel 3.13.0’s
igb driver does not support this option. The
ixgbe driver, however, does. If your driver has a function set to the
ndo_busy_poll field of its
struct net_device_ops structure (mentioned in the above blog post), it supports
A great paper explaining how this works and how to use it is available from Intel.
When using this socket option for a single socket, you should pass a time value in microseconds as the amount of time to busy poll in the device driver’s receive queue for new data. When you issue a blocking read to this socket after setting this value, the kernel will busy poll for new data.
You can also set the sysctl value
net.core.busy_poll to a time value in microseconds of how long calls with
select should busy poll waiting for new data to arrive, as well.
This option can reduce latency, but will increase CPU usage and power consumption.
Netpoll: support for networking in critical contexts
The Linux kernel provides a way for device drivers to be used to send and receive data on a NIC when the kernel has crashed. The API for this is called Netpoll and it is used by a few things, but most notably: kgdb, netconsole.
Most drivers support Netpoll; your driver needs to implement the
ndo_poll_controller function and attach it to the
struct net_device_ops that is registered during probe (as seen above).
When the networking device subsystem performs operations on incoming or outgoing data, the netpoll system is checked first to determine if the packet is destined for netpoll.
For example, we can see the following code in
The Netpoll checks happen early in most of the Linux network device subsystem code that deals with transmitting or receiving network data.
Consumers of the Netpoll API can register
struct netpoll structures by calling
struct netpoll structure has function pointers for attaching receive hooks, and the API exports a function for sending data.
SO_INCOMING_CPU flag was not added until Linux 3.19, but it is useful enough that it should be included in this blog post.
You can use
getsockopt with the
SO_INCOMING_CPU option to determine which CPU is processing network packets for a particular socket. Your application can then use this information to hand sockets off to threads running on the desired CPU to help increase data locality and CPU cache hits.
The mailing list message introducing
SO_INCOMING_CPU provides a short example architecture where this option is useful.
A DMA engine is a piece of hardware that allows the CPU to offload large copy operations. This frees the CPU to do other tasks while memory copies are done with hardware. Enabling the use of a DMA engine and running code that takes advantage of it, should yield reduced CPU usage.
The Linux kernel has a generic DMA engine interface that DMA engine driver authors can plug into. You can read more about the Linux DMA engine interface in the kernel source Documentation.
While there are a few DMA engines that the kernel supports, we’re going to discuss one in particular that is quite common: the Intel IOAT DMA engine.
Intel’s I/O Acceleration Technology (IOAT)
Many servers include the Intel I/O AT bundle, which is comprised of a series of performance changes.
One of those changes is the inclusion of a hardware DMA engine. You can check your
dmesg output for
ioatdma to determine if the module is being loaded and if it has found supported hardware.
The DMA offload engine is used in a few places, most notably in the TCP stack.
Support for the Intel IOAT DMA engine was included in Linux 2.6.18, but was disabled later in 18.104.22.168 due to some unfortunate data corruption bugs.
Users on kernels before 22.214.171.124 may be using the
ioatdma module on their servers by default. Perhaps this will be fixed in future kernel releases.
Direct cache access (DCA)
Another interesting feature included with the Intel I/O AT bundle is Direct Cache Access (DCA).
This feature allows network devices (via their drivers) to place network data directly in the CPU cache. How this works, exactly, is driver specific. For the
igb driver, you can check the code for the function
igb_update_dca, as well as the code for
igb driver uses DCA by writing a register value to the NIC.
To use DCA, you will need to ensure that DCA is enabled in your BIOS, the
dca module is loaded, and that your network card and driver both support DCA.
Monitoring IOAT DMA engine
If you are using the
ioatdma module despite the risk of data corruption mentioned above, you can monitor it by examining some entries in
Monitor the total number of offloaded
memcpy operations for a DMA channel.
$ cat /sys/class/dma/dma0chan0/memcpy_count 123205655
Similarly, to get the number of bytes offloaded by this DMA channel, you’d run a command like:
Monitor total number of bytes transferred for a DMA channel.
$ cat /sys/class/dma/dma0chan0/bytes_transferred 131791916307
Tuning IOAT DMA engine
The IOAT DMA engine is only used when packet size is above a certain threshold. That threshold is called the
copybreak. This check is in place because for small copies, the overhead of setting up and using the DMA engine is not worth the accelerated transfer.
Adjust the DMA engine copybreak with a
$ sudo sysctl -w net.ipv4.tcp_dma_copybreak=2048
The default value is 4096.
The Linux networking stack is complicated.
It is impossible to monitor or tune it (or any other complex piece of software) without understanding at a deep level exactly what’s going on. Often, out in the wild of the Internet, you may stumble across a sample
sysctl.conf that contains a set of sysctl values that should be copied and pasted on to your computer. This is probably not the best way to optimize your networking stack.
Monitoring the networking stack requires careful accounting of network data at every layer. Starting with the drivers and proceeding up. That way you can determine where exactly drops and errors are occurring and then adjust settings to determine how to reduce the errors you are seeing.
There is, unfortunately, no easy way out.
Help with Linux networking or other systems
Need some extra help navigating the network stack? Have questions about anything in this post or related things not covered? Send us an email and let us know how we can help.
If you enjoyed this post, you may enjoy some of our other low-level technical posts: