[@background] main style for this presentation {tr small, Software Packet Scheduling at Hardware Speed} {br, Luigi Rizzo, Univ. di Pisa} // @@bg,img/tower_640x480.jpg@@ // XXX put the style in the css for proper printing PSPAT ---------------------------------------------- {ce ti,PSPAT: Software Packet Scheduling at Hardware Speed} {author,Luigi Rizzo, Paolo Valente*, Giuseppe Lettieri, Vincenzo Maffione\\ Universita` di Pisa, *Univ. di Modena e Reggio Emilia\\ [[http://info.iet.unipi.it/\~luigi/research.html]]} {ce,Work supported by H2020 project SSICLOPS\\ } Paper at [[http://info.iet.unipi.it/\~luigi/papers/20160921-pspat.pdf]] Problem and use case ------------- - Network links way too fast - NICs and operating systems are catching up - VMs are catching up, too - Demand for Network Function Virtualization Cloud operators need robust, fast packet scheduling - resource management - isolation and protection Software Packet scheduling not there yet Why do we care about software schedulers ------------------ First block in the network path for VMs - there are legitimate users with high PPS - need to protect the virtual switch and the rest of the stack Hardware does not always give perfect isolation - the bus (PCIe) can be a bottleneck - scheduling after the bottleneck is ineffective Scheduling ---------------- Contract between client and resource manager . {b, "If YOU behave, I'll give you some service guarantee"} Guarantees (rate allocation, loss, latency...): - should be a binding promise by the system - should not be affected by others' behaviour We need - algorithms with provable service guarantees - know and robust behaviour under overload (practice) - overload may cause failure to guarantee rate or latency Terminology ------------- {b,Flow} - collection of traffic managed as a single entity by the SA - number can be large {b,Scheduling Algorithm (SA)} - just the algorithm that decides which flow to serve next - different goals (priority, latency, fixed or weighted rate allocation) - can reason about complexity and guarantee {b,Packet Scheduler} - the whole system: Queues + Scheduling Algorithm + I/O - operating conditions and performance are an issue Requirements and state of the art ----------------------------- Requirements * request rates up to 1..10~Mpps * latency <10~us Theory gives us several Scheduling Algorithms\\ Tradeoffs between complexity and guarantees (T-WFI) * {b,DRR} (deficit round robin): 20 ns/decision, O(N) delay * {b,WF2Q+} (Weighted Fair Queueing): O(log N) time, O(1) delay * {b,QFQ/QFQ+} (Quick Fair Queueing): 40-50 ns/decision, O(1) delay Practice gives us some promising tools - fast CPU, buses, multiport NICs - fast I/O frameworks Traditional Software Packet Scheduler ------------ @@fr sz4x,pspat-sw-sched.png@@ {b,PROS} - no hardware dependencies - large choice of algorithms \\ {b,CONS} - heavy lock contention in accessing the scheduler - under congestion I/O becomes serialized - scalability can be problematic TC delivers 2~Mpps, decreasing with number of clients Hardware Packet Schedulers ------------ @@fr sz4x,pspat-hw-sched.png@@ {b,PROS} - fully parallel down to the NIC - reduced system load due to HW offloading \\ {b,CONS} - limited choice of algorithms - the bus is still a point of contention. PCIe access issues ------------------------------------- @@fr sz4x,pcie-write.png@@ - PCIe arbitration is round robin, {b,not programmable} - PCIe service rate is limited by the NIC - PCIe bus can saturate as well @@fr sz5x,pcie-congestion.png@@ Dilemma ----------------- SW flexible but slow, HW not as good as we would like #, {b,Denial:} we don't need fast schedulers - what about NFV ? #, {b,Faith:} hardware will get better - what about existing hardware ? #, {b,Various approximate solutions} - trivial schedulers (FIFO, DRR: fast but poor delay guarantees) - active queue management (RED, CODEL: rely on {b,everyone} behaving) - bounded number of queues: rely on quiet neighbours PSPAT: Packet Scheduling with PArallel Transfers ------------ @@fr sz4x,pspat-our-sched.png@@ {b,Decouple scheduling and transmission} - a dedicated arbiter thread runs the SA (sequential) - traffic is released at link rate to the device driver - possibly one or more threads perform transmission in parallel {b green,Results} * reduced contention, increased parallelism * large speedup compared to TC * the architecture permits a worst case analysis PSPAT components --------------------------------- @@fr sz4x,pspat-our-sched.png@@ - M {b,Clients} and {b,Client Mailboxes} CM[x] - C {b,cores} on a shared memory system - C {b,Client Lists} CL[c], clients recently active on core {i,c} - 1 {b, ARBITER} - N {b,flows} - 1..T {b,Transmit queues} on the NIC - 0..T {b,Transmit Mailboxes} and {b,Dispatcher} N and M can be very large. T can be kept small as needed. CLIENT operation --------------------------------- @@fr sz3x,pspat-our-sched.png@@ Client x on core c wants to transmit packet p: - insert p into CM[x] - if (CL[c].tail != x) append x to CL[c] - done. \\ {b,PROS and CONS} * {green,no contention accessing the lock-free mailbox} * {green,backpressure through mailbox-full indication} * {green,some work offloaded to arbiter and dispatcher} - {red,possible additional latency} - {red,possibly, reduced backpressure (similar to CODEL)} MAILBOX -------------------- Memory-based interactions can be slow {ce,@@sz7x,mem-speed.png@@} * avoid passing queue pointers, as in FastForward (saves a barrier) * use slot state (NULL or not) to indicate new packets * avoid W-W conflicts by releasing slots lazily * reduce R-W conflicts by rate limiting memory accesses ARBITER operation --------------------------------- @@fr sz3x,pspat-our-sched.png@@ Continuously runs a three-phase task # grab packets from active CM[x] (\~1 CM per core), \\ {tt,enqueue()} packets to the Scheduling Algorithm;\\ trim client lists CL[c] # {tt,dequeue()} packets at link rate (leaky bucket) to TM # notify Dispatchers for new work {b,PROS and CONS} * {green,arbitrary choice of Scheduling Algorithm} * {green,arbitrary split/merge of traffic} * {green,no lock contention} * {red,needs to scan all mailboxes -- scale ?} ARBITER: Limit number of mailboxes to scan ----------------------------- Only need to look at mailboxes that may have traffic: - only clients which ran on a core {i,and} sent packets since the previous scan - Client Lists keep track of those clients - scans are very frequent (every few us) Client Lists rarely have more than 1 entry. ARBITER: scan speed ----------------------- - inactive CLs and CMs likely in L1, so only 2-4ns each - active entries may cause a read stall, rate limit access to keep stalls within 10% - plus of course skb access and {tt,enqueue()} cost - try to keep scan within 2-5us Some self correcting mechanisms if a round is slow: - caching and batching more effective - mailboxes fill up, stopping the high speed clients DISPATCHER --------------------------------- @@fr sz3x,pspat-our-sched.png@@ * grab packets from Transmit Mailbox, run device driver \\ {b,PROS and CONS} * {green, number can be configured as needed} * {green, operate at controlled rate, never congested} * {green, can be implemented in the arbiter if using a fast backend} * {red, potentially, extra latency} PSPAT implementation -------------------- Two versions * in-kernel, for complete compatibility with TC: - intercept traffic in {tt,__dev_queue_xmit()}, - deliver to {tt,dev_hard_start_xmit()} - reuses Linux QDISC code - kernel module to implement mailboxes and threads * userspace, for fast prototyping and optimized performance - supports userspace networking (netmap, DPDK...) - can use fast scheduling code from dummynet Performance analysis -------------------- Metrics: - throughput and latency Platforms: - I7 with 40G NIC and Linux 4.7 (in-kernel PSPAT) - dual Xeon E5-2640 (userspace) Sources (one per core, pinned): - UDP sockets (not very fast) - pkt-gen (the netmap version), very fast - Linux pktgen bypasses __dev_queue_xmit() Packet schedulers: - none (HW), TC, PSPAT Throughput measurements ------------------------- * Clients send as fast as possible * variable number of clients * schedulers use QFQ (DRR is marginally faster) * TC and PSPAT rates higher than scheduler's capacity * measurements in PPS as that is the relevant metric Throughput with regular UDP (I7) --------------------- {ce, @@sz7x,udp-tput-i7.png@@} - 2x speedup (limited by the qdisc code) Throughput for userspace PSPAT (Xeon) --------------------- {ce, @@sz7x,udp-tput-xeon.png@@} High speed I/O -------------------- {ce, @@sz4x,udp-tput-xeon.png@@ @@sz4x,netmap-tput-i7.png@@} - Scheduling decisions alone are extremely fast - using netmap we are in 15-20~Mpps territory Rate allocation ----------------- Not shown in the graphs: {red,TC fails to meet rate allocation!} - every {tt,enqueue()} followed by {tt,dequeue()} - scheduler slower than link means there is never any queueing - allocation just matches request rate PSPAT addresses this - first grab all requests, then run dequeue() - queues build up even when the scheduler is slow One way latency measurements -------------------------- Experiments with different link rates and number of clients * one client has {b,weight=100}, sends at half the reserved bandwidth * other clients have {b,weight=1}, send as fast as possible Theory says latency is proportional to MSS/RATE One way latency measurements (1) ------------------------ {ce,@@sz5x,pcie-congestion.png@@} - No big surprises for PSPAT:\\ a couple of extra us due to rate-limited scans and handoffs - Note the huge effect of congestion on the PCIe bus Latency versus rate, Xeon + linux 2.6.32 ------------------- {ce,@@sz5x,pspat-latency-1.png@@} Latency versus rate, I7 + linux 4.5 ------------------- {ce,@@sz5x,pspat-latency-2.png@@} Latency under overload, I7 ---------------------- {ce,@@sz5x,pspat-overload-latency.png@@} Future work --------------------- * Distribute the in-kernel version * Explore throughput improvement with separate dispatchers and batching * Run measurements over different NICs [[http://info.iet.unipi.it/\~luigi/research.html]] [[http://info.iet.unipi.it/\~luigi/papers/20160921-pspat.pdf]]