News...

Slides on polling device drivers from BSDConEurope 2001
Latest release of the polling code for FreeBSD 4.5-RELEASE as of Feb.10 2002

Quick links

Slides from BSDConEurope2001

Device Polling support for FreeBSD

This page contains material related to some recent (Feb. 2002) work I have done to implement polling device driver support in FreeBSD. This work has been done in cooperation with ICSI in the context of the Xorp project, and in some aspects can be still considered work in progress.

The code is now in the FreeBSD CVS repository both for -current and -stable, so you should look there for the most up-to-date version. I may provide patches on this page but they cold be outdated. The column on the left has links to relevant pieces of information, software, references and papers. The rest of the page contains a discussion of what polling is, installation instructions and performance data, etc..

Luigi Rizzo Dip. di Ingegneria dell'Informazione, Univ. di Pisa via Diotisalvi 2 -- 56126 PISA tel. +39-050-568533 Fax +39-050-568522 email: luigi@iet.unipi.it

Overview

"Device polling" (polling for brevity) refers to a technique for handling devices which does not rely on the latter to generate interrupts when they need attention, but rather lets the CPU poll devices to service their needs. This might seem inefficient and counterintuitive, but when done properly, polling gives more control to the operating system on when and how to handle devices, with a number of advantages in terms of system responsivity and performance, as described below.

There are two main functionalities implemented by this code:

device polling support for selected drivers;
a scheduling mechanism to control the sharing of the CPU between kernel and user level processing.

An implementation of these concepts is available in on FreeBSD (both -STABLE and -CURRENT, Feb.10, 2002 and later, i386 architecture, uniprocessor) for the "dc", "fxp" and "sis" network cards.

Install instructions

make sure you have a recent version of the kernel sources (Feb.10 2002 or later), or FreeBSD 4.5-RELEASE patched with the polling patches), and one of the devices supported by polling (dc, fxp, sis). Other devices will still work but use interrupts, so you will not have the advantages of polling;
add the following options to your kernel config, and then compile and install the new kernel:
```
	options DEVICE_POLLING
	options HZ=1000
```

At runtime, you can turn polling mode on with

	sysctl kern.polling.enable=1

and turn it off by setting the variable to 0.
There are a number of sysctl variables which control polling operations, you can see them with sysctl kern.polling, but the only one you should worry about is the one that controls the sharing of CPU between the kernel and user level:

	sysctl kern.polling.user_frac=50

(the value can vary between 1 and 99) shares the CPU evenly between kernel and user level processing, whereas increasing or decreasing the value changes the percentage of CPU assigned to these two states. Of course, if there is not enough work to be done in one of these two states, the excess cycles are made available to the other one.

Principles of operation - Polling

In the normal, interrupt-based mode, devices generate an interrupt whenever they need attention. In turn a interrupt handler is run by the system, which acknowledges the interrupt and takes care of performing whatever processing is needed. For a network card, this generally means processing received packets, either to completion (e.g. forwarding them to the output interface in case of a router) or partially, and then queueing them into some queue for further processing.

Handling an interrupt also implies some overhead because the CPU has to save and restore its state, switch context, and potentially pollute/flush caches around the invokation of the interrupt handler. This can be particularly expensive in high traffic situation when the CPU is frequently interrupted (with all the associated overhead) to process just one packet at a time.

It is almost compulsory, in interrupt mode, that all events that have generated the interrupt are handled before terminating the handler, because a second interrupt request will not be generated for them.

Unfortunately this means that the amount of time spent in the driver is neither predictable nor, in some cases, upper bounded. This can have different effects: almost invariably, under heavy load, an interrupt-based system becomes not controllable because most of the CPU time is spent in handling interrupts with little or no cycles left for processing userland tasks. Additionally, if received packets are not processed to completion, it is likely that the system will not even perform any useful work because it will not find the time to complete processing outside the interrupt drivers.

Polling mode works by having the system periodically look at devices to see if they require attention, and invoking the handler accordingly.

With polling, the context switch overhead is mostly avoided because the system can choose to look at device when it is already in the right context (e.g. handling a timer interrupt, or a trap). Also, contrarily from interrupts, with polling it is not necessary to process all events at once, because the system will eventually look again at the device. This means that the time spent handling devices can be kept under careful control by the operating system.

So the use of polling gives us reduced context switch overhead, and a chance to keep control over the responsivity of the system by limiting the amount of time spent in device handling. The drawback is that there might be an increased latency in processing events, because polling occurs when the system likes, rather than when the device wants (e.g. when a new packet arrives). This latency, however, can be reduced to reasonably low values by raising the clock interrupt frequency.

How does it work in practice

In the FreeBSD implementation, we have implemented polling as follows. Device drivers which support polling supply a new method, *_poll(). This method does most of the work that a interrupt handler does, but the amount of work (e.g. the number of packets processed at each invokation) is upped bounded by using one of the arguments passed to the method, the "burst" parameter. All devices normally operate in interrupt mode until polling is globally enabled on the system by setting the sysctl variable kern.polling.enable=1. At this point, when the first interrupt for the device fires, the driver itself will register as a polling driver with the system, turn off interrupts on the device, and (optionally) invoke the *_poll() method for the device itself.

Drivers registered as polling are periodically scanned by the code to check if they need attention (by invoking their *_poll() method). The scan is done at each clock tick, and optionally within the idle loop and when entering traps (the latter two cases are enabled by the sysctl variables kern.polling.idle_poll and kern.polling.poll_in_trap).

The latency for processing device events is thus upper bounded by one clock tick (that is, when the system is not overloaded). But when either the system is partially idle, or it is doing frequent I/O, this latency is largely reduced.

Principles of operation - Scheduling

One of the key advantages of polling is that it gives us a chance to control the amount of CPU spent in device processing. This is done by adapting the "burst" parameter according to available system resources and user preferences.

The polling code monitors the amount of time spent in each clock tick for processing device events, and divides it by the duration of the tick. If the result is above some percentage pre-set by the user, then the burst parameter is reduced for the next tick, otherwise it is increased (or left unchanged in the rare case the workload matches the pre-set fraction).

This adjustment, which occurs at every tick, tries to make the device-processing time match the user-specified fraction. Of course, if there is not enough device processing to be performed, all remaining CPU cycles will be available for user processing. Conversely, if there is not enough user-level processing to be performed, all remaining CPU cycles will be available for device processing (done in the idle loop, which is not accounted for because these cycles would not be used otherwise).

Reserving some CPU cycles for non-device processing helps reducing the chance of livelock, but does not completely prevent it if packets are not processed to completion (as it is often the network stack). Also, there are some fairness issues when a system has traffic coming in from multiple interfaces, because the "burst" parameter can become very large and the burst of packets grabbed at once from a single interface might easily congest intermediate queues and prevent other traffic from being processed. All of this is handled by accounting the time spent in network software interrupts as "device handling", and by scheduling software interrupts as described in the next section

And in practice again...

Deferred processing of network events in BSD is accomplished by using network soft interrupts (or NETISR's), which are processed with higher priority than user processes, but (generally) lower priority than hardware interrupts. There are 32 different NETISR's, numbered 0..31, with 0 being the highest priority, 31 the lowest.

In our design, the polling loop is implemented as NETISR #0 (NETISR_POLL), which means it will run before other NETISR's. An additional NETISR_POLLMORE is defined, #31, which is run after all NETISR's.
NETISR_POLL invokes the poll methods for all registered drivers with an appropriate value for the burst parameter, and then schedules a NETISR_POLLMORE. The poll methods might in turn schedule other NETISR's, which will be processed after NETISR_POLL and before NETISR_POLLMORE.

The burst parameter passed to the poll methods is computed by splitting the value computed by the control loop (and available as the sysctl variable kern.polling.burst) into chunks of maximum size kern.polling.each_burst, whose value is chosen sufficiently small to avoid overflows in intermediate queues in the system.

As a consequence of this, near the beginning of each clock tick, we have the following sequence of NETISR's being run:

   [hardclock]...[POLL]...[POLLMORE][POLL]...[POLLMORE][POLL]...[POLLMORE]

We account the time from the first POLL to the last POLLMORE as spent in device processing, so we include deferred processing in the computation. And the splitting of the big burst in small chunks helps reducing the chance that one device monopolizes intermediate queues to the detriment of other devices.

Summary of controlling variables

There are a few sysctl variables that are used to control the operation of the polling code. They are summarised below. In most cases you will only need to touch kern.polling.enable and kern.polling.user_frac.

kern.polling.enable (default: 0): Globally enables/disables polling modes. Set to 1 to enable polling, set to 0 to disable it.
kern.polling.user_frac (default: 50): Controls the percentage of CPU cycles available to userland processing. The range is 1..99. Low values mean few cycles reserved to userland (which is useful e.g. in a router box), high values reserve a lot of cycles to userland (e.g. useful for a desktop system). Note that the scheduling is "work conserving" so no cycles are wasted: if any of the two classes of tasks does not have enough work to do, the CPU is available to the other class.
kern.polling.poll_in_trap (default: 0): Controls whether or not to call the *_poll() methods when entering a trap. Disabled by default, you can enable it to improve latency on a box that is doing CPU-intensive work with a reasonable amount of I/O.
kern.polling.idle_poll (default: 1): Controls whether or not to call the *_poll() methods in the idle loop. Enabled by default, I cannot think of good reasons to disable it other than for testing purposes.
kern.polling.burst_max (default: 150): Upper bound for the burst size. The default value is enough for a 100Mbit ethernet run at full speed with the smallest possible packets. You should not modify this unless you know what you are doing.
kern.polling.each_burst (default: 5): Upper bound for the burst size each time the *_poll() method is invoked by NETISR_POLL. The default value is chosen to avoid filling up intermediate queues (e.g. ipintrq, whose default size is 50 slots) with bursts coming from a small number of interfaces. You should not modify this unless you know what you are doing.
kern.polling.reg_frac (default: 20): Not all the invokation of the *_poll() method are equal. Normally, the method only does a quick check for received or transmitted packets, and avoids checking for error conditions or reading status registers from the card (which can be extremely expensive). Every so often (reg_frac calls, precisely) the more expensive call is run to catch up with these events.

Other sysctl variables in the kern.polling are for debugging purposes only.

Performance data

Improved performance is not the main goal of this work, I consider the improved responsivity to be by far more important. Nevertheless, polling does give improved performance, whose amount depends on the system and the offered load.

Some data points related to our testbed machines, which have a 750MHz Athlon processor, a single 33-MHz PCI bus with a 4-port 21143-based card and four other 100Mbit cards. In all tests, we try to use a full-speed stream of minimum-size ethernet packets (corresponding to 148,000 packets per seconds, or 148Kpps). Unless otherwise specified, we enabled fastforwarding, which means that packets are processed within the interrupt/polling handler context instead of being queued for later processing. This slightly reduces the load on the system, and tries to reduce the chance of livelock (which however can still occur, as shown by the last two tests).

Test	Interrupt	Polling	Comments
Transmit	138Kpps	148Kpps	With interrupts, there is not enough CPU left to generate the traffic. With polling, there is extra CPU available to run more tasks.
Rx CPU load	100%	40%	With interrupts, the system is completely unaccessible from userland, it barely responds to keypresses.
Forwarding 1 stream	143Kpps	148Kpps	With interrupts, the system is not controllable.
Forwarding 1 stream, no fast forwarding	110 pps (no K, not a typo!)	148Kpps	Regular forwarding defers processing of packets, and this causes livelock. The max substainable rate in this system with interrupts, is 120Kpps, above that load the forwarding speed rapidly decays to almost 0.
Forwarding 2 streams	0	180Kpps	Livelock in the interrupt case, PCI-bus limited when polling. Results may vary depending on traffic patterns and interrupt sharing.
Forwarding 4 streams	0	185Kpps	Same as above

At least with basic forwarding (no ipfw rules), our system is limited by the PCI bus. But the interesting number from the above table is that even when bombed with almost 600Kpps (full speed streams on 4 interfaces) the system is still able to forward traffic at its peak performance.

Q & A on device polling

What happens when you enable "polling" ?
The system replaces interrupt-based handling of network devices with a mixed interrupt-polled mode ("polling" for brevity), where polling-enabled network cards will be polled at every clock tick, in the idle loop, and optionally at each system call.
Additionally, this implementation also has a very important (yet simple) piece of code which lets you precisely decide how to allocate CPU cycles between user and kernel processing.
Why is this useful ?
Polling saves a bit of CPU cycles because it reduces the number of context switches related to interrupt handling. On boxes with a lot of traffic/interfaces, you can have 10's..100's of thousands of interrupts per second, which could easily burn all of your CPU cycles.
But the most interesting feature is that (combined with an appropriate control algorithm) polling lets you decide what fraction of the CPU you want to dedicate to device handling, so you will never starve user processes even when there is an unordinate amount of traffic coming from your interfaces, and you will not end up experiencing "livelock" i.e. the inability to do useful work because of spending too much time in processing packets that you eventually drop.
What are the drawbacks ?
Because you check interfaces every so often (at least every 1ms with the values suggested here), you might see some additional delay (up to 1/HZ seconds) in responding to incoming traffic.
But this is rarely a real issue. When the system is idle, it is continuously polling devices for available traffic, so there is almost no extra delay. And when the system is busy forwarding other traffic, chances are that the extra 1ms is negligible compared to other delays suffered because of previous work.
In my OS course I was taught that polling is bad.
True, but this is not "pure polling"; we still have interrupts (from the clock, specifically) to suspend processes and tell us to check the status of interfaces. So this approach is a mix of interrupts and polling.
Where should I use polling ?
The typical applications is in boxes such as routers and servers, with one or more fast network interfaces, where you fear that the box can be overloaded by the amount of incoming traffic. Under these circumstances, a conventional *BSD or Linux box will exhibit livelock, or even worse, rapidly drop is throughput to very low values.
With polling, your throughput grows with incoming traffic, up to a maximum level, and roughly remains there when the incoming traffic increases.
Polling could also be useful in workstations meant for pseudo-real-time processing, just because of the scheduling code which gives more predictable response to your box.
Will my router/server run faster ?
Yes and no. In normal conditions (i.e. no overload), you will not experience any change. Under moderate load, the peak throughput might moderately increase (anywhere from 0 to 50%, this is very system dependent). Under heavy load, you will definitely notice.
Can I turn on polling only when overloaded, and use interrupts otherwise ?
Right now you can do it manually, there is not yet support to switch back and forth automatically. I am not 100% sure that it makes sense, and anyways I have not yet figured out a good way to do it. The problem is detecting the "overload" situation when interrupts are active. When you are really overloaded, you might get hundreds of thousands of interrupts per second, or very long interrupt cycles, and it is not easy to set thresholds that work on the wide range (performancewise) of systems supported.
Is there anything here which is i386-specific ?
Basically nothing. There are only two machine-dependent files which have a one-line change, and both can be omitted without substantial changes in performance.
The only issue in making a port to the Alpha or other architectures where FreeBSD runs is that I do not have access to the hardware to test it.
Why does this code not work with SMP ?
It actually might work (if you remove a one line in systm.h which prevents compilation with SMP). However, you would have a single thread doing the polling, whereas an SMP box might in principle handle concurrently interrupts from different devices.
I guess the best answer is that I am not yet sure on whether or not it makes sense to have polling with SMP.