Chapter 3. Buffers

When you begin talking to vendors about data center switches, you’ll hear and read about buffers. Some of the vendors have knockdown, drag-out fights about these buffers, and often engage in all sorts of half-truths and deceptions to make you believe that their solution is the best. So, what is the truth? As with most things, it’s not always black and white.

To begin, we need to look at the way a switch is built. That starts with the switch fabric.

Note

The term fabric is used because the interconnecting lines of ports and switches evokes the weave of a fabric viewed through a microscope. And all this time I thought there was some cool scientific reason.

Imagine a matrix in which every port on the switch has a connection for input (ingress) and another for output (egress). If we put all of the ingress ports on the left and all the output ports on top and then interconnect them all, it would look like the drawing in Figure 3-1. To make the examples easy to understand, I’ve constructed a simple, though thoroughly unlikely, three-port switch. The ports are numbered ethernet1, ethernet2, and ethernet3, which are abbreviated e1, e2, and e3.

Looking at the drawing, remember that e1 on the left and e1 on the top are the same port. This is very important to understand before moving forward. Remember that modern switch ports are generally full duplex. The drawing simply shows the ins on the left and the outs on the top. Got it? Good. Let’s continue.

First, the fabric allows more than one conversation to occur at a time, provided the ports in each conversation are discrete from the ports in the other conversations. I know, gibberish, right? Bear with me, and all will become clear.

Simple switch fabric of a three-port switch
Figure 3-1. Simple switch fabric of a three-port switch

Remember that full duplex means transmit and receive can happen at the same time between two hosts (or ports, in our case). To help solidify how the fabric drawing works, take a look at Figure 3-2, in which I’ve drawn up how a full-duplex conversation would look between ports e1 and e2.

Look at how the input of e1 goes to the point on the fabric where it can traverse to the output of e2. Now look at how the same thing is happening so that the input of e2 can switch to the output of e1. This is what a full-duplex conversation between two ports on a switch looks like on the fabric. By the way, you should be honored, because I detest those little line jumpers and haven’t used one in probably 10 years. I have a feeling that this chapter is going to irritate my drawing sensibilities, but I’ll endure because I have deadlines to meet, and after staring at the drawings for two hours, I couldn’t come up with a better way to illustrate my point.

Full duplex on a switch fabric
Figure 3-2. Full duplex on a switch fabric

Now that we know what a single port-to-port full-duplex conversation looks like, let’s consider a more complex scenario. Imagine if you will, that while ports e1 and e2 are happily chattering back and forth without a care in the world, some jackass on e3 wants to talk to e2. Because Ethernet running in full duplex does not listen for traffic before transmitting, e3 just blurts out what he needs to say. Imagine that you are having a conversation with your girlfriend on the phone when your kid brother picks up the phone and plays death metal at full volume into the phone. It’s like that, but without the heavy distortion, long hair, and tattoos.

Assuming for a moment that the conversation is always on between e1 and e2, when e3 sends its message to e1, what happens? In our simple switch, e3 will detect a collision and drop the packet. Wait a minute, a collision? I thought full-duplex networks didn’t have collisions! Full-duplex conversations should not have collisions, but in this case, e3 tried to talk to e2 and e2 was busy. That’s a collision. Figure 3-3 shows our collision in action. The kid brother is transmitting on e3, but e2’s output port is occupied, so the death metal is dropped. If only it were that simple in real life.

Switch fabric collision
Figure 3-3. Switch fabric collision

If you think that this sounds ridiculous and doesn’t happen in the real world, you’re almost right. The reason it doesn’t seem to happen in the real world, though, is largely because Ethernet conversations are rarely always on, and because of buffers.

In Figure 3-4, I’ve added input buffers to our simple switch. Now, when port e3 tries to transmit, the switch can detect the collision and buffers the packets until the output port on e2 becomes available. The buffers are like little answering machines for Ethernet packets. Now, when you hang up with your girlfriend, the death metal can be politely delivered in all its loud glory now that the output port (you) is available. God bless technology.

Switch fabric with input buffers
Figure 3-4. Switch fabric with input buffers

This is cool and all, but these input buffers are not without their limitations. Just as an answering machine tape (anyone remember those?) or your voicemail inbox can fill up, so too can these buffers. When the buffers become full, packets are dropped. Whether the first packets in the buffer are dropped in favor of buffering the newest packets or the newest packets are dropped in favor of the older packets is up to the person who wrote the code.

So, if the buffers can fill up, thus dropping packets, the solution is to put in bigger buffers, right? Well, yes and no. The first issue is that buffers add latency. Sending packets over the wire is fast. Storing packets into a location in memory and then referencing them and sending them takes time. Memory is also slow, although the memory used in these buffers is much faster than, say, computer RAM. It’s more like the Layer 2 (L2) cache in your CPU, which is fast, but the fact remains that buffering increases latency. Increased latency is usually better than dropped packets, right? As usual, it depends.

Dropped packets might be OK for something like FTP that will retransmit lost packets, but for a UDP-RTP stream like VoIP, increased latency and dropped packets can be disastrous. And what about environments like Wall Street, where microseconds of latency can mean a missed sale opportunity costing millions of dollars? Dropped packets mean retransmissions, which means waiting, but bigger buffers still mean waiting—they just mean waiting less. In these cases, bigger buffers aren’t always the answer.

In the example I’ve shown, I started with the assumption that the full-duplex traffic to and from e1 and e2 is always on, which is almost never the case. In reality, Ethernet traffic tends to be very bursty, especially when there are many hosts talking to one device. Consider scenarios like email servers, or even better, NAS towers.

Network Attached Storage (NAS) traffic can be unpredictable when looking at network traffic. If you have 100 servers talking to a single NAS tower on a single IP address, the traffic to and from the NAS tower can spike in sudden, drastic ways. This can be a problem in many ways, but one of the most insidious is the microburst.

A microburst is a burst that doesn’t show up on reporting graphs because most sampling is done using five-minute averages. If a monitoring system polls the switch every five minutes and then subtracts the number of bytes (or bits, or packets) from the number reported during the last poll, the resulting graph will show only an average of each five-minute interval. Because pictures are worth 1,380 words (adjusted for inflation), let’s take a look at what I mean.

In Figure 3-5, I’ve taken an imaginary set of readings from a network interface. Once, every minute, the switch interface was polled and the number of bits per second was determined. That number was recorded with a timestamp. If you look at the data, you’ll see that once every 6 to 10 minutes or so, the traffic spikes 50 times its normal value. These numbers are pretty small, but the point I’m trying to make is how the reporting tools might reveal this information.

The graph on the top shows each poll, from each minute, and includes a trend line. Note that the trend line is at about 20,000 bits per second (bps) on this graph.

Microbursts and averages
Figure 3-5. Microbursts and averages

Now take a careful look at the bottom graph. In this graph, the data looks very different because instead of including every one-minute poll, I’ve changed the polling to once every five minutes. In this graph, the data seems much more stable and doesn’t appear to show any sharp spikes. More important, though, is the fact that the trend line seems to be up at around 120,000 bps.

This is typical of data being skewed because of the sample rate, and it can be a real problem when the perception doesn’t meet reality. The reality is closer to the top graph, but the perception is usually closer to the bottom graph. Even the top graph might be wrong, though! Switches operate at the microsecond or even nanosecond level. So, what happens when a 10 Gbps interface has 15 Gbps of traffic destined to it, all within a single second or less? Wait, how can a 10 Gbps interface have more than 10 Gbps being sent to it?

Remember the fabric drawing in Figure 3-3? Let’s look at that on a larger scale. As referenced earlier, imagine a network with 100 servers talking to a single NAS tower on a single IP address. What happens if, say, 10 of those servers push 5 Gbps of traffic to the NAS tower at the same instance in time? The switch port connecting to the NAS switch will send out 10 Gbps (because that is the max), and 40 Gbps of traffic will be queued.

Network switches are designed to forward packets (frames, to be pedantic) at the highest rate possible. Few devices outside of the networking world can actually send and receive data at the rates the networking devices are capable of sending. In the case of NAS towers, the disks add latency, the processing adds latency, and the OS of the device simply might not be able to deliver a sustained 10 Gbps data stream. So, what happens when our switch has a metric pant-load of traffic to deliver and the NAS tower can’t accept it fast enough?

If the switch delivers the packets to the output port but the attached device can’t receive them, the packets will again be buffered, but this time as an output queue. Figure 3-6 shows our three-port switch with output buffers added.

Switch fabric with output buffers
Figure 3-6. Switch fabric with output buffers

As you might imagine, the task of figuring out when traffic can and cannot be sent to and from interfaces can be a complicated affair. It was simple when the interface was either available or not, but with the addition of buffers on both sides, things become more complicated, and this is an extreme simplification. Consider the idea that different flows might have different priorities, and the entire affair becomes even more complicated.

The process of determining when, and if, traffic can be sent to an interface is called arbitration. Arbitration is usually managed by an Application-Specific Integrated Circuit (ASIC) within the switch and generally cannot be configured by the end user. Still, when shopping for switches, some of the techniques used in arbitration will come up, and understanding them will help you decide what to buy. Now that we understand why input and output buffers exist, let’s take a look at some terms and some of the ways in which traffic is arbitrated within the switch fabric:

FIFO

First In/First Out buffers are those that deliver the oldest packets from the buffer first. When you drive into a tunnel and the traffic in the tunnel is slow, assuming no change in the traffic patterns within the tunnel, the cars will leave the tunnel in the same order in which they entered: the first car into the tunnel will also be the first car out of the tunnel.

Blocking

Blocking is the term used when traffic cannot be sent, usually due to oversubscription. A nonblocking switch is one in which there is no oversubscription, and each port is capable of receiving and delivering wire-rate traffic to and from another interface in the switch. If there are 48 10 Gb interfaces and the switch has a fabric speed of 480 Gbps (full duplex), the switch can be said to be nonblocking, but be careful because some vendors will be less than honest about these numbers. For example, stating that a 48-port 10 Gb switch has a 480 Gbps backplane does not necessarily indicate that the switch is nonblocking, because traffic can flow in two directions in a full-duplex environment. 480 Gbps might mean that only 24 ports can send at 10 Gbps, whereas the other 24 receive at 10 Gbps. This would be 2:1 oversubscription to most people, but when the spec sheet says simple 480 Gbps, people assume the best. Clever marketing and the omission of details like this are more common than you might think.

Head-of-Line (HOL) blocking

Packets can be (and usually are) destined for a variety of interfaces, not just one. Consider the possibility that with the FIFO output queue on one interface, packets will buffer on the FIFO input buffer side. If the output queue cannot clear quickly enough, the input buffer will begin to fill, and none of those packets will be switched, even though they might be destined for other interfaces. This single packet, sitting at the head of the line, is preventing all of the packets behind it from being switched, as shown in Figure 3-7. Using the car analogy, imagine that there is a possible left turn directly outside the end of the tunnel. It’s rarely used, but when someone sits there, patiently waiting for a break in oncoming traffic, everyone in the tunnel must wait for this car to move before they can exit the tunnel.

Note

If you’re reading this in a country that drives on the left side of the road, please apply the following regular expression to my car analogies as you read: s/left/right/g. Thanks.

Head-of-line blocking
Figure 3-7. Head-of-line blocking
Virtual Output Queuing (VOQ)

VOQ is one of the methods deployed by switch vendors to help eliminate the HOL blocking problem on their higher-end switch (shown in Figure 3-8). If there were a buffer for each output interface, positioned at the input buffer side of the fabric and replicated on every interface, HOL blocking would be practically eliminated.

Virtual Output Queuing
Figure 3-8. Virtual Output Queuing

Now, because there is a virtual output queue for every interface on the input side of the fabric, should the output queue become full, the packets destined for the full output queue will sit in its own virtual output queue, leaving the virtual output queues for all of the other interfaces unaffected. In our left-turn-at-the-end-of-the-tunnel example, imagine an additional left-turn-only lane being installed. While the one car waits to turn left, the cars behind it can simply pass because the waiting car is no longer blocking traffic.

Allocating a single virtual output queue for each possible output queue would quickly become unscalable, especially on switches with thousands of interfaces. Instead, each input queue can have a smaller set of VOQs, which can be dynamically allocated as needed. The idea is that eight flows is probably more than enough for all but the most demanding of environments.

If you start reading up on buffers elsewhere, you are likely to encounter dire warnings about excessively large buffers, and something colorfully referred to as buffer bloat. Buffer bloat describes the idea that hardware vendors have increasingly included more and more buffers in an attempt to outperform competitors. Even though buffer bloat may be a real concern in the home internet environment, it is likely not a concern in the data center.

Consider what happens when you stream a movie from your favorite streaming source (let’s call them Stream-Co). The servers might have 10 Gbps interfaces, which are connected with 10 Gbps switches, and because they’re a big provider, they might even have 10 Gbps internet feeds. The internet is interconnected with pretty fast gear these days, so let’s say, just for fun, that all the connections from Stream-Co to your ISP network are 10 Gbps. Yeah, baby—fast is good! Now, your cable internet provider switches your stream in 10 glorious gigabits per second, until it gets to the device that connects to your cable modem. Suppose that you have a nice connection, and you can download 50 Mbps. Can you see the problem?

The kickin’ 10 Gbps data flow from Stream-Co has screamed across the country (or even the world) until it gets right to your virtual doorstep, at which point the speed goes from 10 Gbps to 50 Mbps. The difference in speed is not 10:1 like it is in a data center switch, but rather 200:1!

Now let’s play a bit and assume that the cable distribution device has 24 MB buffers. Remember, that 24 MB at 1 Gbps is 20 ms. Well, that same 24 MB at 50 Mbps is 4 seconds! Buffering for 20 ms is not a big deal, but buffering for 4 seconds will confuse the TCP windowing system, and your performance might be less than optimal, to say the least. Additionally, although 24 MB is 4 seconds at 50 Mbps, remember that it’s only 0.019 seconds at 10 Gbps. In other words, this buffer would take less than one-tenth of a second to fill, but 4 seconds to empty.

Think about this, too: propagation delay (the time it takes for packets to travel over distance) from New York to California might be 100 ms over multiple providers. Let’s add that much on top for computational delay (the amount of time it takes for servers, switches, and routers to process packets), which gives us 200 ms. That’s one-fifth of a second, which is a pretty long time in our infinitely connected high-speed world. Imagine that your service provider is getting packets in 200 ms but is buffering multiple seconds of your traffic. To quote some guy I met on the beach in California, that’s not cool, man.

My point with this talk of buffer bloat is to consider all the information before coming to rash conclusions. You might hear vendors pontificate about how big buffers are bad. Big buffers within the data center make a lot more sense than big buffers for cable modems.

Conclusion

Deep buffers can be a significant advantage in a networking switch, and because Arista was one of the first vendors to champion that idea, other vendors attacked the idea before themselves embracing it. Between buffer size, not oversubscribing the fabric (Chapter 5), and the many other features that Arista has embraced (Chapter 6), it has become a force to be reckoned with in the world of networking.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset