|
Device Polling support for FreeBSDThis page contains material related to some recent (Feb. 2002) work I have done to implement polling device driver support in FreeBSD. This work has been done in cooperation with ICSI in the context of the Xorp project, and in some aspects can be still considered work in progress.The code is now in the FreeBSD CVS repository both for -current and -stable, so you should look there for the most up-to-date version. I may provide patches on this page but they cold be outdated. The column on the left has links to relevant pieces of information, software, references and papers. The rest of the page contains a discussion of what polling is, installation instructions and performance data, etc..
| |||||||||||||||||||||||||||||
Overview"Device polling" (polling for brevity) refers to a technique for handling devices which does not rely on the latter to generate interrupts when they need attention, but rather lets the CPU poll devices to service their needs. This might seem inefficient and counterintuitive, but when done properly, polling gives more control to the operating system on when and how to handle devices, with a number of advantages in terms of system responsivity and performance, as described below.There are two main functionalities implemented by this code:
Install instructions
sysctl kern.polling.enable=1and turn it off by setting the variable to 0. There are a number of sysctl variables which control polling operations, you can see them with sysctl kern.polling, but the only one you should worry about is the one that controls the sharing of CPU between the kernel and user level: sysctl kern.polling.user_frac=50(the value can vary between 1 and 99) shares the CPU evenly between kernel and user level processing, whereas increasing or decreasing the value changes the percentage of CPU assigned to these two states. Of course, if there is not enough work to be done in one of these two states, the excess cycles are made available to the other one. Principles of operation - PollingIn the normal, interrupt-based mode, devices generate an interrupt whenever they need attention. In turn a interrupt handler is run by the system, which acknowledges the interrupt and takes care of performing whatever processing is needed. For a network card, this generally means processing received packets, either to completion (e.g. forwarding them to the output interface in case of a router) or partially, and then queueing them into some queue for further processing.Handling an interrupt also implies some overhead because the CPU has to save and restore its state, switch context, and potentially pollute/flush caches around the invokation of the interrupt handler. This can be particularly expensive in high traffic situation when the CPU is frequently interrupted (with all the associated overhead) to process just one packet at a time. It is almost compulsory, in interrupt mode, that all events that have generated the interrupt are handled before terminating the handler, because a second interrupt request will not be generated for them. Unfortunately this means that the amount of time spent in the driver is neither predictable nor, in some cases, upper bounded. This can have different effects: almost invariably, under heavy load, an interrupt-based system becomes not controllable because most of the CPU time is spent in handling interrupts with little or no cycles left for processing userland tasks. Additionally, if received packets are not processed to completion, it is likely that the system will not even perform any useful work because it will not find the time to complete processing outside the interrupt drivers. Polling mode works by having the system periodically look at devices to see if they require attention, and invoking the handler accordingly. With polling, the context switch overhead is mostly avoided because the system can choose to look at device when it is already in the right context (e.g. handling a timer interrupt, or a trap). Also, contrarily from interrupts, with polling it is not necessary to process all events at once, because the system will eventually look again at the device. This means that the time spent handling devices can be kept under careful control by the operating system. So the use of polling gives us reduced context switch overhead, and a chance to keep control over the responsivity of the system by limiting the amount of time spent in device handling. The drawback is that there might be an increased latency in processing events, because polling occurs when the system likes, rather than when the device wants (e.g. when a new packet arrives). This latency, however, can be reduced to reasonably low values by raising the clock interrupt frequency. How does it work in practiceIn the FreeBSD implementation, we have implemented polling as follows. Device drivers which support polling supply a new method, *_poll(). This method does most of the work that a interrupt handler does, but the amount of work (e.g. the number of packets processed at each invokation) is upped bounded by using one of the arguments passed to the method, the "burst" parameter. All devices normally operate in interrupt mode until polling is globally enabled on the system by setting the sysctl variable kern.polling.enable=1. At this point, when the first interrupt for the device fires, the driver itself will register as a polling driver with the system, turn off interrupts on the device, and (optionally) invoke the *_poll() method for the device itself.Drivers registered as polling are periodically scanned by the code to check if they need attention (by invoking their *_poll() method). The scan is done at each clock tick, and optionally within the idle loop and when entering traps (the latter two cases are enabled by the sysctl variables kern.polling.idle_poll and kern.polling.poll_in_trap). The latency for processing device events is thus upper bounded by one clock tick (that is, when the system is not overloaded). But when either the system is partially idle, or it is doing frequent I/O, this latency is largely reduced. Principles of operation - SchedulingOne of the key advantages of polling is that it gives us a chance to control the amount of CPU spent in device processing. This is done by adapting the "burst" parameter according to available system resources and user preferences.The polling code monitors the amount of time spent in each clock tick for processing device events, and divides it by the duration of the tick. If the result is above some percentage pre-set by the user, then the burst parameter is reduced for the next tick, otherwise it is increased (or left unchanged in the rare case the workload matches the pre-set fraction). This adjustment, which occurs at every tick, tries to make the device-processing time match the user-specified fraction. Of course, if there is not enough device processing to be performed, all remaining CPU cycles will be available for user processing. Conversely, if there is not enough user-level processing to be performed, all remaining CPU cycles will be available for device processing (done in the idle loop, which is not accounted for because these cycles would not be used otherwise). Reserving some CPU cycles for non-device processing helps reducing the chance of livelock, but does not completely prevent it if packets are not processed to completion (as it is often the network stack). Also, there are some fairness issues when a system has traffic coming in from multiple interfaces, because the "burst" parameter can become very large and the burst of packets grabbed at once from a single interface might easily congest intermediate queues and prevent other traffic from being processed. All of this is handled by accounting the time spent in network software interrupts as "device handling", and by scheduling software interrupts as described in the next section
And in practice again...Deferred processing of network events in BSD is accomplished by using network soft interrupts (or NETISR's), which are processed with higher priority than user processes, but (generally) lower priority than hardware interrupts. There are 32 different NETISR's, numbered 0..31, with 0 being the highest priority, 31 the lowest.
In our design, the polling loop is implemented as NETISR #0 (NETISR_POLL),
which means it will run before other NETISR's. An additional
NETISR_POLLMORE is defined, #31, which is run after all NETISR's. The burst parameter passed to the poll methods is computed by splitting the value computed by the control loop (and available as the sysctl variable kern.polling.burst) into chunks of maximum size kern.polling.each_burst, whose value is chosen sufficiently small to avoid overflows in intermediate queues in the system. As a consequence of this, near the beginning of each clock tick, we have the following sequence of NETISR's being run: [hardclock]...[POLL]...[POLLMORE][POLL]...[POLLMORE][POLL]...[POLLMORE]We account the time from the first POLL to the last POLLMORE as spent in device processing, so we include deferred processing in the computation. And the splitting of the big burst in small chunks helps reducing the chance that one device monopolizes intermediate queues to the detriment of other devices.
Summary of controlling variablesThere are a few sysctl variables that are used to control the operation of the polling code. They are summarised below. In most cases you will only need to touch kern.polling.enable and kern.polling.user_frac.
Performance dataImproved performance is not the main goal of this work, I consider the improved responsivity to be by far more important. Nevertheless, polling does give improved performance, whose amount depends on the system and the offered load.Some data points related to our testbed machines, which have a 750MHz Athlon processor, a single 33-MHz PCI bus with a 4-port 21143-based card and four other 100Mbit cards. In all tests, we try to use a full-speed stream of minimum-size ethernet packets (corresponding to 148,000 packets per seconds, or 148Kpps). Unless otherwise specified, we enabled fastforwarding, which means that packets are processed within the interrupt/polling handler context instead of being queued for later processing. This slightly reduces the load on the system, and tries to reduce the chance of livelock (which however can still occur, as shown by the last two tests).
At least with basic forwarding (no ipfw rules), our system is limited by the PCI bus. But the interesting number from the above table is that even when bombed with almost 600Kpps (full speed streams on 4 interfaces) the system is still able to forward traffic at its peak performance. Q & A on device polling
|