Multichassis Link Aggregation (MLAG) is the open-standard (and thus, Arista) term for linking a port-channel or Link Aggregation Group (LAG) to multiple switches instead of just one. The technology accomplishes the same basic goal as Cisco’s Virtual Port Channel (vPC), although, in my experience, MLAG is simpler to configure and more forgiving when used.
As just mentioned, the acronym LAG is an abbreviation for Link Aggregation Group, which is an open-standard way of describing the bonding of multiple physical links into a single logical link. In Cisco parlance, this technology is called Etherchannel. Different vendors use different terms for similar solutions, but the term LAG has become a cross-vendor acceptable way of describing the idea. Why would you want to do this? Let’s take a look.
With a traditional network design, interconnecting three switches at Layer 2 (L2) results in a loop. Loops are bad, so Spanning Tree Protocol (STP) blocks the interface on the link farthest from the root. Figure 18-1 shows an example of this.
In this scenario, there is a LAG connecting switch A to switch B. Switch C connects to both A and B switches, forming a loop. STP has blocked the interface on switch C that leads to switch B in order to break said loop. This design will allow for failover if the link between switches A and C were to fail, but the failover can take 30 seconds or more (substantially less if rapid STP is used). Not only that, but only one-half of the available bandwidth to and from switch C is available for use. Wouldn’t it be nice if we could use that extra link? Even better, if we used LAG technology, a single link failure wouldn’t incur an outage because the second link would already be active.
With MLAG, two Arista switches fool the third switch (or any other Link Aggregation Control Protocol [LACP]–capable device) into thinking that it is connected to a single device. In other words, two Arista switches appear to be one Arista switch to LACP, as shown in Figure 18-2.
With MLAG active using two 10 Gbps links, switch C sees a 20 Gbps logical interface to a single device, even though it is connected to two devices. Arista accomplishes this feat by advertising the same chassis-ID from both switch A and switch B. To do this, switch A and switch B must communicate over the A–B switch link, which must be configured with a VLAN that acts as a peer-link.
MLAG is configured within something called an MLAG Domain. The MLAG Domain ID identifies the switch to another switch that will share MLAGs. Let’s go ahead and build an MLAG pair.
The first thing we need to do is make sure that both MLAG peers are on the same revision of code. Will it work if the switches have different code? Yes, but in a more limited fashion than it used to. Today if you try to peer incompatible revs of EOS, the switches will report an error and refuse to peer. I’m not a fan of this, but from a TAC point of view, it greatly lowers the number of possible permutations that they need to support. To determine what versions are compatible with what versions, look in the release notes.
This isn’t as bad as it might seem at first glance. Looking at the table illustrated in Figure 18-3, EOS 4.21.1F will pair with EOS 4.14.16M, so that’s not so bad. The deal is generally that the last or most current revision of code will be supported from each major release, so if you’re upgrading from 4.14.5M to 4.21.1F, you won’t need to go from 4.14 to 4.15 to 4.16, and so on. You will need to go from 4.15.5M to 4.15.16M, and from there you can go to 4.21.1F. Note also that this is strictly an MLAG compatibility issue and has nothing to do with upgrading standalone switches. You can absolutely upgrade right from 4.14.5M straight to 4.21.1F on a single switch that is not part of an MLAG pair. I’d test that in a lab environment first to see how it might affect your environment, but there is no limitation in the software outside of MLAG.
You can check the MLAG ISSU compatibility between an installed image and another image local to the switch by using the show malg issu compatibility image
command. Let’s look at an example of one that passes the test and one that does not. First, here’s the revision of code running on my switch:
Arista#sho ver | grep image Software image version: 4.19.10M
The switch (a 7280R) is running EOS version 4.19.10M. Here are all of the EOS images that I have stored on flash:
Arista#dir EOS* Directory of flash:/EOS* -rwx 613330599 Oct 8 2018 EOS-4.19.10M.swi -rwx 638234211 Mar 20 01:57 EOS-4.20.1F.swi -rwx 700978970 Nov 12 2018 EOS-4.21.1F.swi 3269361664 bytes total (85319680 bytes free)
First, I’m going to run the check against version 4.20.1F:
Arista#sho mlag issu compatibility flash:EOS-4.20.1F.swi /mnt/flash/EOS-4.20.1F.swi (4.20.1F) is MLAG ISSU incompatible with the current image (4.19.10M). A reboot with this image may cause packet loss. Please consult the release notes to find a compatible image. The new image is compatible with these releases, which may also be compatible with the current version: EOS-4.16.6M-INT EOS-4.16.6M EOS-4.16.7FX-MLAGISSU-TWO-STEP EOS-4.16.7M EOS-4.16.8M EOS-4.16.8FX-MLAGISSU-TWO-STEP EOS-4.16.9M EOS-4.16.10M EOS-4.16.11M EOS-4.16.12M EOS-4.16.13M EOS-4.16.13FX-MLAGISSU-TWO-STEP EOS-4.17.0F-INT EOS-4.17.0F EOS-4.17.1F EOS-4.17.1.1FX-MDP-INT EOS-4.17.2F EOS-4.17.3F EOS-4.17.4M EOS-4.17.5M EOS-4.17.6M EOS-4.17.7M EOS-4.18.0F EOS-4.18.0F-INT EOS-4.18.1.1F EOS-4.18.2.1F EOS-4.18.2-REV2-FX EOS-4.18.3.1F EOS-4.18.4F EOS-4.18.4.2F EOS-4.18.5M EOS-4.19.0F EOS-4.19.1F EOS-4.19.2F Arista#
Well, that certainly threw a lot of output! As I tell network engineers all the time, when you see a page of output, step away from the keyboard and read it! In this case, when I first encountered this output, I had to read it probably six times before it sunk in due to an odd bit of technically accurate grammar. Let me run that command again using grep
to include only the lines that I want to highlight:
Arista#sho mlag issu compatibility flash:EOS-4.20.1F.swi | grep -A3 incompatible /mnt/flash/EOS-4.20.1F.swi (4.20.1F) is MLAG ISSU incompatible with the current image (4.19.10M). A reboot with this image may cause packet loss. Please consult the release notes to find a compatible image.
The phrase that trips me up is “is MLAG ISSU incompatible
.” I would probably prefer something like “is NOT MLAG ISSU compatible,” or something even simpler like, “Nope—WON’T WORK!” but then I suppose that’s why I’m not a developer. When running this command, I look for the list of versions, because if the output of the command spits out a long list of versions, it’s telling you that you should probably use one of those instead of the one you tried.
Let’s try that with a different version:
Arista#sho mlag issu compatibility flash:EOS-4.21.1F.swi /mnt/flash/EOS-4.21.1F.swi (4.21.1F) is MLAG ISSU compatible with the current image (4.19.10M).
Not only does it say, “is MLAG ISSU compatible
,” but there is a notable absence of suggested versions to try instead of the one I checked against. This means that we’re good to go. For the rest of the chapter, both of my switches will be running 4.21.1F, so let’s build a simple MLAG setup using the network shown in Figure 18-4.
We need to create a peer-link over which the two switches can communicate. This link can be a single link, but for redundancy, it should always be a port-channel containing a minimum of two physical links. In this example, there are two 24-port switches, so let’s use the last two interfaces, e47 and e48:
Arista-A(config)#int e47-48 Arista-A(config-if-Et47-48)#channel-group 1000 mode active
Next, we configure the port-channel to be a trunk:
Arista-A(config-if-Et47-48)#int po 1000 Arista-A(config-if-Po1000)#switchport mode trunk
If you’re used to Cisco switches, you’ll notice that the switch did not bark at us about trunk encapsulation. Here’s what would happen on a Cisco switch:
Cisco-1(config)#int f1/0/7 Cisco-1(config-if)#switchport mode trunk Command rejected: An interface whose trunk encapsulation is "Auto" can not be configured to "trunk" mode.
Arista does not negotiate trunk encapsulation, because it supports only dot1q trunks. Older Cisco switches also support Inter-Switch Link (ISL), which is a Cisco proprietary protocol. But enough of my attention deficit issues; let’s continue.
Notice that there is absolutely nothing special about this link. It is a port-channel running as a trunk. This is not an MLAG; rather, it’s the link used to connect the two peers and, as such, is called the peer-link.
With the port-channel configured as a trunk, we need to create a VLAN that will be used only for MLAG peer-to-peer communication. The Arista examples use VLAN 4094, so let’s keep that tradition alive:
Arista-A(config)#vlan 4094 Arista-A(config-vlan-4094)#trunk group MLAG-Peer
The trunk group
MLAG-Peer
command creates a trunk group. A trunk group is a sort of inclusion (or exclusion depending on your point of view) group. When you create a trunk, all VLANs are included on that trunk by default unless you specify otherwise. When we put a VLAN into a trunk group, that VLAN is no longer included in trunks by default. As a result, we now need to assign the same group to the peer-link in order to include that VLAN:
Arista-A(config-vlan-4094)#int po 1000 Arista-A(config-if-Po1000)#switchport trunk group MLAG-Peer
VLAN 4094 will be included only on trunks that are also assigned to the MLAG-Peer
trunk group. By doing this, when we create a new trunk, by default VLAN 4094 will not be included. This keeps the MLAG peer-link traffic on this link, and only on this link (unless you add the MLAG-Peer
trunk group to another trunk, but don’t do that).
The trunk group names for the peer VLAN should be configured to be the same on both switches. Although they are locally significant, do yourself a favor and keep them the same on the two peers. The configuration for VLANs and VLAN trunk groups must be identical in order to successfully establish an MLAG association between two switches.
Now that we know this VLAN is limited to the peer-link, we can disable spanning-tree
on the VLAN:
Arista-A(config)#no spanning-tree vlan 4094
Note that this is a global command, and not an interface command. It will fail with an % Incomplete command
message if run from interface configuration mode because the same syntax is used to set cost and port priority there.
Because Multiple Spanning Tree (MST) is the default on Arista switches, and MST is not VLAN based, this command will not have the same result that it would if Rapid-PVST (RPVST) were enabled. It is still a best practice to disable Spanning Tree from the MLAG peer VLAN in case RPVST is ever enabled.
Disabling STP is almost always a bad idea. In this case, the MLAG peer-link always needs to be up in order to prevent a split-brain scenario. Because the peer-link is using a trunk group, a loop on this VLAN should never occur. The only way a loop could possibly occur would be (in this example) for the MLAG-Peer
trunk group to be included on other links from the MLAG pair. So don’t do that. Ever.
With the physical link and trunk set up, we’re now going to make a Layer 3 (L3) connection between the two switches, as shown in Figure 18-5.
Because MLAG peers communicate with each other over L3, we must assign an IP address to the VLAN on each side:
Arista-A(config)#int vlan 4094 Arista-A(config-if-Vl4094)#ip address 10.255.255.1/30 Arista-A(config-if-Vl4094)#no autostate
The no autostate
command keeps the L3 Switch Virtual Interface (SVI) interface up regardless of whether there are any interfaces active in the VLAN.
Now, we must configure MLAG itself:
Arista-A(config)#mlag Arista-A(config-mlag)#local-interface vlan 4094 Arista-A(config-mlag)#peer-address 10.255.255.2 Arista-A(config-mlag)#peer-link port-channel 1000 Arista-A(config-mlag)#domain-id Arista-AB
The commands should be relatively obvious. We’ve assigned the MLAG local interface to be the VLAN SVI we just created (VLAN 4094); we’ve told the switch that the peer for this MLAG domain is at the IP address 10.255.255.2; the peer-link is riding over port-channel 1000; and the MLAG domain ID is Arista-AB (I try to make the domain ID somehow relate to both switch hostnames).
At this point the MLAG peers look like what is shown in Figure 18-6.
The domain ID is the means whereby the switch differentiates different MLAG groups. I show this in more detail later in this chapter. The MLAG domain ID is case-sensitive and must match on both sides.
At this point, the status of the peer-link should be connected. This can be shown with the command show mlag
:
Arista-A#sho mlag MLAG Configuration: domain-id : Arista-AB local-interface : Vlan4094 peer-address : 10.255.255.2 peer-link : Port-Channel1000 peer-config : consistent MLAG Status: state : Active negotiation status : Connected peer-link status : Up local-int status : Up system-id : 2a:99:3a:06:6f:37 MLAG Ports: Disabled : 0 Configured : 0 Inactive : 0 Active-partial : 0 Active-full : 0
The last section that begins with MLAG Ports
shows all zeroes because we have not created any MLAG interfaces yet, so let’s go ahead and create a simple MLAG.
To reiterate, here are the relevant MLAG configurations for both Arista-A and Arista-B:
------------------------------------- ------------------------------------- | Arista-A | Arista-B | ------------------------------------- ------------------------------------- | vlan 4094 | vlan 4094 | | trunk group MLAG-Peer | trunk group MLAG-Peer | | ! | ! | | interface Port-Channel1000 | interface Port-Channel1000 | | description [ MLAG Peer-Link ] | description [ MLAG Peer-Link ] | | switchport mode trunk | switchport mode trunk | | switchport trunk group MLAG-Peer | switchport trunk group MLAG-Peer | | ! | ! | | interface Ethernet47 | interface Ethernet47 | | description [ MLAG Peer ] | description [ MLAG Peer ] | | channel-group 1000 mode active | channel-group 1000 mode active | | ! | ! | | interface Ethernet48 | interface Ethernet48 | | description [ MLAG Peer ] | description [ MLAG Peer ] | | channel-group 1000 mode active | channel-group 1000 mode active | | ! | ! | | interface Vlan4094 | interface Vlan4094 | | description [ MLAG Link ] | description [ MLAG Link ] | | no autostate | no autostate | | ip address 10.255.255.1/30 | ip address 10.255.255.2/30 | | ! | ! | | mlag configuration | mlag configuration | | domain-id Arista-AB | domain-id Arista-AB | | local-interface Vlan4094 | local-interface Vlan4094 | | peer-address 10.255.255.2 | peer-address 10.255.255.1 | | peer-link Port-Channel1000 | peer-link Port-Channel1000 | | | | ------------------------------------- -------------------------------------
By the way, if you think that side-by-side output is cool, that’s from an eAPI script I wrote that allows me to compare any command from any two Arista switches, provided they’re running eAPI (and I have the passwords, of course). I use this in my classes all the time for troubleshooting. To learn more about eAPI, see Chapter 30.
In this example, I’ve set up two Arista switches (Arista-A and Arista-B) connected to a third Arista switch that’s been cleverly named Arista-C. The first two Arista switches will be forming an MLAG peer, while the C switch will view the link as a regular port-channel. Figure 18-7 depicts how the network looks before we continue.
Take careful note that everything we’re doing on Arista-C has nothing to do with the MLAG configurations on the two MLAG peers (Arista A and B). This is a very important distinction because nothing MLAG-related “escapes” the MLAG domain. There is no MLAG negotiation outside of the two peers! The only thing Arista-C will see coming from the MLAG peers is LACP.
To further prove that point, here’s how I’ve configured Arista-C:
Arista-C(config)#int e7-8 Arista-C(config-if-Et41-42)#channel-group 999 mode active
That’s it! This switch has absolutely nothing to do with MLAG and has no idea that MLAG is in the mix. The only thing it sees is LACP. To Arista-C, the two MLAG peers appear to be a single chassis.
This forms a simple port-channel (Po999) comprising the physical links, Et7 and Et8. All ports are 10 Gbps. The port-channel will use the LACP protocol due to the mode active
keywords in the channel-group
commands.
The problem with the network configuration as it stands is that one of the interfaces in the triangle of network connections will be error-disabled. This is not due to Spanning Tree, but rather LACP on Arista-C, which will receive two different chassis-IDs on E47 and E48. Because those two interfaces are bonded together in a port-channel on Arista-C, LACP expects the remote devices to be a single device. To make that happen, we need to configure the two MLAG peers (Arista A and B) to do that. Luckily, this step is really quite simple.
First, all ports to be bonded between MLAG peers must be in a port-channel. You cannot bond physical interfaces, even (as is the case here) if there is only one on each physical switch. Therefore, the first thing we need to do is to put the physical interface on each MLAG peer into a port-channel:
Arista-A(config)#int e33 Arista-A(config-if-Et1)#channel-group 1 mode active
You must do this on both MLAG peers:
Arista-B(config)#int e33 Arista-B(config-if-Et1)#channel-group 1 mode active
Do the interfaces and port-channel numbers need to match? No, but do yourself a favor and make them match.
I strongly urge you to keep the port-channel assignments the same on the MLAG peers. I’ve worked on installations where the MLAG peers shared an MLAG using different port-channel interfaces, and it was a nightmare to debug during an outage. Keep it simple, and you’ll keep your job.
Now we need a way to bond these two port-channels together across the MLAG pair. To do that, we configure the port-channel itself and apply an MLAG number to the port-channel:
Arista-A(config-if-Et1)#int po 1 Arista-A(config-if-Po1)#mlag 1
And again, we must do this on both of the MLAG peers:
Arista-B(config-if-Et1)#int po 1 Arista-B(config-if-Po1)#mlag 1
That’s it! After all the peer-link stuff is done and the MLAG adjacency is formed, the creation and linking of port-channels is really all that needs to be done from a daily moves-adds-changes perspective. Figure 18-8 illustrates what we’ve built.
It is important to remember that, logically, Figure 18-9 shows how switch C sees the network with MLAG enabled on switches A and B. At this point, switch C has no idea that switches A and B are two different devices, at least so far as LACP is concerned. This is a very important thing to understand because at L3, there are still three devices in the mix. I’m not going to go into a lot of detail on that right now, but remember that MLAG is almost exclusively an L2 thing.
To see the status of individual MLAG interfaces, use the show mlag interfaces
command:
Arista-A(config)#sho mlag int local/remote mlag desc state local remote status ---------- ---------------- --------------- --------- --------- ------------- 1 [ Arista-C ] active-full Po1 Po1 up/up
Here is the output of the same switch with three configured MLAGs, of which only one is active:
Arista-A#sho mlag int local/remote mlag desc state local remote status ---------- ---------------- --------------- --------- --------- ------------- 1 [ Arista-C ] active-full Po1 Po1 up/up 3 inactive Po3 Po3 down/down 5 inactive Po5 Po5 down/down
If MLAG is active, but the peer’s link (not the peer-link!) is down for whatever reason, the status of the MLAG will be Active-partial
:
Arista-A#sho mlag int local/remote mlag desc state local remote status ---------- ---------------- --------------- --------- --------- ------------- 1 [ Arista-C ] active-partial Po1 Po1 up/down
Here is the same output from the peer with the interface that’s down. Check out the local/remote status and how its flipped from the other side because the local interface is always shown first:
Arista-B#sho mlag int local/remote mlag desc state local remote status ---------- ---------- -------------------- ----------- ------------ ------------ 1 [ Arista-C ] active-partial Po1 Po1 down/up
By the way, if you encounter a scenario in which someone has used nonmatching port-channel and MLAG numbers, the show mlag interfaces
command will be where you’d look to figure it out. Also, smack them in the back of the head for doing that. It’s OK. They probably deserve it.
The output of show mlag
shows you totals as opposed to specific interface information. In this case there is one configured MLAG interface that is active-partial:
Arista-A#sho mlag MLAG Configuration: domain-id : Arista-AB local-interface : Vlan4094 peer-address : 10.255.255.2 peer-link : Port-Channel1000 peer-config : inconsistent MLAG Status: state : Active negotiation status : Connected peer-link status : Up local-int status : Up system-id : 2a:99:3a:06:6e:0f dual-primary detection : Disabled MLAG Ports: Disabled : 0 Configured : 0 Inactive : 0 Active-partial : 1 Active-full : 0
To get some more detail regarding the state of MLAG in general, use the show mlag detail
command:
Arista-A#sho mlag detail MLAG Configuration: domain-id : Arista-AB local-interface : Vlan4094 peer-address : 10.255.255.2 peer-link : Port-Channel1000 peer-config : inconsistent MLAG Status: state : Active negotiation status : Connected peer-link status : Up local-int status : Up system-id : 2a:99:3a:06:6e:0f dual-primary detection : Disabled MLAG Ports: Disabled : 0 Configured : 0 Inactive : 0 Active-partial : 1 Active-full : 0 MLAG Detailed Status: State : primary Peer State : secondary State changes : 4 Last state change time : 0:29:21 ago Hardware ready : True Failover : False Last failover change time : never Secondary from failover : False Peer MAC address : 28:99:3a:06:6e:ed Peer MAC routing supported : True Reload delay : 300 seconds Non-MLAG reload delay : 300 seconds Peer ports errdisabled : False Lacp standby : False Configured heartbeat interval : 4000 ms Effective heartbeat interval : 4000 ms Heartbeat timeout : 60000 ms Last heartbeat timeout : never Heartbeat timeouts since reboot : 0 UDP heartbeat alive : True Heartbeats sent/received : 22003/22004 Peer monotonic clock offset : 24.884958 seconds Agent should be running : True P2p mount state changes : 1 Fast MAC redirection enabled : False
You might have noticed that the output of show mlag
and show mlag detail
in the previous examples showed that the configuration was inconsistent. If you’ve ever used Cisco’s vPC, you know that it can be a bit finicky if things aren’t configured properly. In the early days of the Cisco Nexus, I went through some pretty terrible outages due to this behavior, so I was pretty happy to discover that Arista does not enforce config sanity. Of course, that can be a pretty sharp double-edged sword, but thankfully Arista has included the means to check config sanity between your MLAG peers using the show mlag config-sanity
command.
Arista-A#show mlag config-sanity No global configuration inconsistencies found. Interface configuration inconsistencies: Feature Attribute Interfaces Local value Peer value ----------- ------------------- ----------- ------------ ---------- bridging access-vlan mlag1 Po1 100 -
Somewhere along the line, someone (I bet it was me) configured the MLAG’d port-channel to have the switchport access vlan 100
command on one side but not the other. Sure enough, here’s Arista-A:
Arista-A#sho run int po 1 interface Port-Channel1 description [ Arista-C ] switchport access vlan 100 mlag 1
Here’s Arista-B:
Arista-B#sho run int po 1 interface Port-Channel1 description [ Arista-C ] mlag 1
After banging my head on the desk to celebrate my stupidity I removed the offending command and checked again:
Arista-A(config)#int po 1 Arista-A(config-if-Po1)#no switchport access vlan Arista-A(config-if-Po1)#sho mlag config-sanity No global configuration inconsistencies found. No per interface configuration inconsistencies found.
If you’d prefer to see all the sanity information instead of only what’s wrong, you can tack the all
keyword on the end of the command. Be prepared for some verbosity, though:
Arista-A(config-if-Po1)#sho mlag config-sanity all Global configuration: Feature Attribute Local value Peer value ------------ ------------------------------ ----------------- ------------------ bridging admin-state vlan 1 active active bridging admin-state vlan 5 active active bridging admin-state vlan 100 active active bridging admin-state vlan 4094 active active bridging MLAG-Peer trunk-group vlan 4094 True True lag lacp port-id range [0,0] [0,0] lag lacp system-priority 32768 32768 mlag dual-primary-action dualPriActNone dualPriActNone mlag dual-primary-detection-delay 0 0 mlag heartbeat-interval 4000 4000 mlag peer-mac-routing-enabled False False mlag reload-delay 0 0 mlag reload-delay lacp mode False False mlag reload-delay non-mlag 0 0 stp 4094 disabled-vlan True True stp bpduguard rate-limit interval 0 0 stp bridge assurance True True stp forward-time 15 15 stp hello-time 2000 2000 stp loopguard False False stp max-age 20 20 stp max-hops 20 20 stp mode mstp mstp stp mst pvst border False False stp portchannel guard False False stp portfast bpdufilter False False stp portfast bpduguard False False stp transmit hold-count 6 6 Interface configuration: Feature Attribute Interfaces Local value Peer value ------------- ------------------------ ------------- -------------- ------------ bridging access-vlan mlag1 Po1 - - bridging switchport-mode mlag1 Po1 - - bridging trunk-allowed vlan mlag1 Po1 - - bridging trunk-native-vlan mlag1 Po1 - - lag lacp fallback mlag1 Po1 none none lag lag mode mlag1 Po1 lacp lacp vxlan 100 vlan-to-vni Vx1 10100 10100 vxlan arp-ip-address-local Vx1 False False vxlan multicast-group Vx1 0.0.0.0 0.0.0.0 vxlan source-interface Vx1 Loopback1 Loopback1 vxlan udp-port Vx1 4789 4789
I recommend that the show mlag config-sanity
command be the last step in any change-controls that involve MLAG. It can’t hurt, and it might just save your job.
I’ve showed you how to configure MLAG, but I haven’t really explained how it all works. Let’s fix that.
When you connect two Arista switches by configuring MLAG, the two peers negotiate by comparing their system MAC addresses. The one with the lower system MAC address will be the winner and become the primary member of the pair. The loser (there are no trophies for second place in MLAG!) becomes the secondary peer. You cannot force this or override this behavior.
The primary peer “owns” the MLAG peer and is responsible for a bunch of stuff (technical term, that) that we’ll get to, but for now understand that the winner of the negotiation uses its system MAC address as the MLAG System ID. The MLAG System ID (MSI) is then used as the chassis-ID that’s sent to the devices that connect to the MLAG domain using port-channels. This is how MLAG tricks those devices into thinking that there is a single device: send an agreed-upon common chassis-ID from two different physical switches.
You can see the system MAC address by using the show version
command:
Arista-A#sho ver | grep MAC System MAC address: 2899.3a06.6e0f
You can see the MLAG System ID by using show mlag
:
Arista-A#sho mlag MLAG Configuration: domain-id : Arista-AB local-interface : Vlan4094 peer-address : 10.255.255.2 peer-link : Port-Channel1000 peer-config : consistent MLAG Status: state : Active negotiation status : Connected peer-link status : Up local-int status : Up system-id : 2a:99:3a:06:6e:0f dual-primary detection : Disabled MLAG Ports: Disabled : 0 Configured : 0 Inactive : 0 Active-partial : 1 Active-full : 0
If you have an eagle-eye, you might have noticed that the two addresses are actually one bit off. Here I put them next to each other with the system-id
slightly reformatted so that the numbers line up:
2899.3a06.6e0f 2a99:3a06:6e0f
Why this change? This has to do with the fact that the MLAG System ID remains unless both MLAG peers reboot or deconfigure MLAG. This can lead to a rare problem that goes something like this:
Imagine two switches in an MLAG pair. We’ll call them A and B. The two switches have the system MAC addresses of aaaa:aaaa:aaaa and bbbb:bbbb:bbbb, respectively. Because the lower MAC address wins, the MLAG System ID (MSI) becomes aaaa:aaaa:aaaa. Now, imagine that the primary switch fails. The secondary switch takes over (more about that in a bit), but the MSI is still aaaa:aaaa:aaaa. Now, let’s further imagine that switch A has failed and gets sent back to Arista for service. The new switch that takes its place has a system MAC address of cccc:cccc:cccc. When it joins the MLAG pair, the MSI is not renegotiated, and the new switch becomes the secondary in the pair. The MSI is still aaaa:aaaa:aaaa, even though that switch doesn’t physically exist in the pair.
Here’s where things get weird.
Suppose that the original switch A is returned and is then put in the network, not as one of the original peers, but as a third switch to be connected to the MLAG peers. This new third switch that we’ll call Switch C has a system MAC address of aaaa:aaaa:aaaa. Can you see the problem? Switch C (which used to be switch A, remember) will now try to connect to the MLAG pair using LACP, but LACP will see switch C’s own chassis-ID coming in from the MLAG pair and will err-disable
the port.
To prevent this scenario from happening, the MLAG System ID is actually a derivation of the system-MAC address, and to accomplish that, the MSI is the winner’s system MAC address with the locally administered bit set. From Wikipedia:
The second bit of the first byte of a MAC address determines the type of OUI. If the bit is 0 then it is an OUI globally assigned by the IEEE; if the bit is 1 then it is a locally administered MAC address.
To see if your peer is primary or secondary, use the show mlag detail
command, which also shows you the MLAG System ID:
Arista-A#sho mlag detail MLAG Configuration: domain-id : Arista-AB local-interface : Vlan4094 peer-address : 10.255.255.2 peer-link : Port-Channel1000 peer-config : consistent MLAG Status: state : Active negotiation status : Connected peer-link status : Up local-int status : Up system-id : 2a:99:3a:06:6e:0f dual-primary detection : Disabled MLAG Ports: Disabled : 0 Configured : 0 Inactive : 0 Active-partial : 1 Active-full : 0 MLAG Detailed Status: State : primary Peer State : secondary State changes : 4 Last state change time : 22:05:12 ago Hardware ready : True Failover : False Last failover change time : never Secondary from failover : False Peer MAC address : 28:99:3a:06:6e:ed Peer MAC routing supported : True Reload delay : 300 seconds Non-MLAG reload delay : 300 seconds Peer ports errdisabled : False Lacp standby : False Configured heartbeat interval : 4000 ms Effective heartbeat interval : 4000 ms Heartbeat timeout : 60000 ms Last heartbeat timeout : never Heartbeat timeouts since reboot : 0 UDP heartbeat alive : True Heartbeats sent/received : 41441/41441 Peer monotonic clock offset : 24.456001 seconds Agent should be running : True P2p mount state changes : 1 Fast MAC redirection enabled : False
Because the output of show mlag detail
is so verbose, I’m paring that output down in various ways from this point on because during failover scenarios, it’s used a lot, and I don’t want this book to be 700 pages. You’re welcome.
Again, there is no way to force one side to be primary short of rebooting the primary switch in order to force a failover. For someone like me who likes all of the devices on the left side of my Visio drawings to be active, this is maddening. There is also no command that you can use to force a failover (well, you could reboot one of them, but that seems excessive). Because I get to work at Arista, I asked the developers why they would deprive me of forced-failover joy, and the answer I received was basically that there is no reason or benefit to having one side be primary over the other. It’s taken me years to accept that, but in my experience, it’s true. I’ve moved on and let it go. Mostly.
When the primary switch reboots for whatever reason, the secondary switch becomes primary. Note that the MLAG System ID remains the same. Remember that in my lab that Arista-A was primary, so I went and rebooted it. With it rebooting, I looked at Arista-B:
Arista-B#sho mlag det | grep State State : primary Peer State : primary State changes : 9
Curious as to why both sides are primary, I asked the developers who said that this is by design because this peer last saw the other peer as primary and assumes that it still is, but because it’s lost its connection, it has also assumed the role of primary. When the other side comes up and communicates again, the status will change. Indeed, after the other switch comes back up, we see a better status:
Arista-B#sho mlag det | grep State State : primary Peer State : secondary State changes : 9
Remember, if Arista-A failed outright and I replaced it with a new switch, there would no longer be a physical switch with that MAC address in the mix, but the MLAG System ID would remain unchanged unless MLAG is completely deconfigured from both switches in the MLAG domain.
After the Arista-A switch comes back up it remains the secondary switch even though it has the lower system MAC address because there is no preemption. Again, it doesn’t matter which side is primary, so the switches don’t fail over unless there is an outage. Arista does not do preemption because that would just cause more potential network instability, so why force it?
Arista-A#sho mlag det | grep State State : secondary Peer State : primary State changes : 2
When Arista-A (the original primary that failed) comes back online, all of the interfaces on that switch with the exception of L3 interfaces and the MLAG peer-link pairs are set to errdisabled
:
Arista-A#sho int status Port Name Status Vlan Duplex Speed Type Et1 errdisabled 1 auto auto 1000BASE-T Et2 errdisabled 1 full 10G Not Present Et3 errdisabled 1 full 10G Not Present Et4 errdisabled 1 full 10G Not Present Et5 errdisabled 1 full 10G Not Present Et6 errdisabled 1 full 10G Not Present Et7 errdisabled 1 full 10G Not Present [--output removed--] Et45 errdisabled 1 full 10G Not Present Et46 errdisabled 1 full 10G Not Present Et47 [ MLAG Peer ] connected in Po1000 full 10G 10GBASE-CR Et48 [ MLAG Peer ] connected in Po1000 full 10G 10GBASE-CR Et49/1 errdisabled 1 full 100G Not Present Et50/1 errdisabled 1 full 100G Not Present Et51/1 errdisabled 1 full 100G Not Present Et52/1 errdisabled 1 full 100G Not Present Et53/1 errdisabled 1 full 100G Not Present Et54/1 errdisabled 1 full 100G Not Present Ma1 connected routed a-full a-1G 10/100/1000 Po1 [ Arista-C ] notconnect 1 full unconf N/A Po1000 [ MLAG Peer-Link ] connected trunk full 20G N/A
This is to protect your network. If there were something more seriously wrong and this switch were endlessly rebooting, you wouldn’t want all of your connected port-channels to bounce and rehash constantly. Think of this as a type of hold-down timer that lets the network stabilize after an outage or planned reboot.
How long do they stay errdisabled
? The default reload-delay timer is 300 seconds by default for fixed-configuration switches, and 1200 or 1800 seconds for chassis-based switches depending on the hardware platform (starting around EOS 4.21 or so). You can change this behavior with the mlag configuration
command reload-delay seconds
. Any value configured below the default will result in a warning when a reload is done (see “MLAG In-Service Software Upgrade”):
Arista-A(config)#mlag configuration Arista-A(config-mlag)#reload-delay 120
On newer EOS code you can actually define the behavior of non-MLAG interfaces separately from those that belong to an MLAG. A non-MLAG interface is one that does not participate in an MLAG, which I suppose is pretty obvious. The reason for this ability is really to allow L3 interfaces to come up before the L2 MLAG port-channels so that routing protocols can stabilize before MLAGs are rehashed. We do this by using the reload-delay non-mlag timer
command:
Arista-A(config-mlag)#reload-delay non-mlag 60
Remember, all MLAG configurations should be the same on both sides:
Arista-B(config)#mlag configuration Arista-B(config-mlag)#reload-delay 120 Arista-B(config-mlag)#reload-delay non-mlag 60
You can see how much time is left with the show mlag
and show mlag detail
commands during a reload:
Arista-A#sho mlag det | grep state state : Active/Reload (0:01:55 left) Last state change time : 0:00:22 ago P2p mount state changes : 1
You can see what the configured delay is by using show mlag detail
:
Arista-A#sho mlag detail | grep delay Reload delay : 120 seconds Non-MLAG reload delay : 60 seconds
You can see whether MLAG is what’s holding down your interfaces with the show mlag detail
command, as well:
Arista-A#sho mlag det | grep err Ports errdisabled : True
Again, this allows all of the upper-level protocols to stabilize before traffic is forwarded over the links. Additionally, ports don’t always come up in the order in which we might expect. For example, the peer-link should always come up first in order for MLAG to work properly, but I always configure the peer-link to be the last ports on the switch. If the switch were to initialize ports in the order in which they are shown in the configuration, the peer-link would come up last. The delay is applied to all non-peer-link ports to prevent that from happening.
Again, you can configure this interval by using the reload-delay
command within the MLAG configuration, although you should take care when altering this value given that network instability can result when the delay is too short.
The time it takes for a switch to finish booting varies based on the number of ports in the switch and the complexity of the configuration. For example, a 7516R with more than a 1,000 ports will take a bit longer to come up than a 7150 with only 24 ports. The 300-second timer value was chosen as a conservative value for a typical 1–rack unit (RU) switch. If you’re using chassis switches with hundreds of ports, the value might need to be higher, and Arista recommends 12 minutes (720 seconds) for big chassis deployments.
Remember that the other link in the MLAG interface (e33 on Arista-B in this example) is up and forwarding traffic during the Arista-A outage. So long as your devices are dual homed to both switches using MLAG, they should stay online while one of the switches in the MLAG pair reboots.
Split-brain is the scenario in which the peer-link somehow fails completely and both MLAG peers become primary devices. That’s considered bad, though surprisingly in a truly dual-homed environment in which everything is working at L2, it might not be the end of the world. But let’s assume that it’s bad (because it usually is) and see what we can do to prevent it.
Arista calls a split-brain situation dual-primary and has thus created a feature in EOS 4.21 called dual-primary detection. This is similar in principle to that other vendor’s feature called the Peer Keepalive Link in vPC. To configure dual-primary detection you must set the peer-address heartbeat ip-address
mlag
configuration command on each side. Here is the configuration for Arista-A:
Arista-A(config)#mlag configuration Arista-A(config-mlag)#peer-address heartbeat 10.0.0.8
Here is the matching configuration on Arista-B. For each switch, these IP addresses are the management interface IPs on the other peer.
Arista-B(config)#mlag configuration Arista-B(config-mlag)#peer-address heartbeat 10.0.0.7
With that configured you can also alter the behavior should a dual-primary state be detected with the dual-primary command. The only real option here is the number of seconds of delay, which you can set from 1 to 1,000 seconds (the last keyword all-interfaces
in the command has been abbreviated to make it fit on the page). I’ve configured it the same way on both sides:
Arista-A(config-mlag)#dual-primary detection delay 10 action errdisable all-int Arista-B(config-mlag)#dual-primary detect. delay 10 action errdisable all-int
To see whether dual-primary is configured, use the show mlag detail
command:
Arista-B#sho mlag detail | grep -i Dual dual-primary detection : Configured Dual-primary detection delay : 10 Dual-primary action : errdisable-all
With this configured, if the peer-link should go down (you can’t shut down the peer-link with interface commands, so it would need to be a hard failure), whichever switch is secondary will take over as primary immediately but will then start dual-primary detection, which basically listens for heartbeats from the configured IP address configured in the heartbeat
command. It does this only after the delay (if so configured). If dual-primary is detected, it will err-disable all interfaces. When and if the dual-primary state clears, normal MLAG operation should continue.
What if you need to connect one MLAG pair to another MLAG pair (or a pair of Cisco switches using vPC, etc.)? Guess what? Wait for it…nothing changes. Well, you get to use the terrible phrase Bow Tie MLAG, so that’s something.
Remember, MLAG exists to trick LACP into working. MLAG does not need to be “compatible” with another vendor’s solution because the LACP implementation already works. Cisco’s vPC solution accomplishes much the same thing (though internally in very different ways), so all an Arista MLAG pair should see from vPC is LACP, and all a Cisco vPC pair should see from Arista is, again, LACP.
The two switches on the top (A and B) in Figure 18-10 are an MLAG pair, and the two switches on the bottom (C and D) are an MLAG pair. To connect them together as shown, each pair should have its own MLAG domain ID. Actually, that really doesn’t matter—they can be the same—which is contrary to what I wrote in the first edition. The MLAG domain is locally significant to the MLAG domain (it doesn’t leak out) unless you try to attach a third switch somehow, which isn’t allowed, anyway.
What you’ll find if you build this is that it will work if they all have the same MLAG domain. So why require an MLAG domain at all? To make sure that the two configured devices should really be peering. I don’t like having multiple pairs with the same MLAG domain name, but I’ve seen it more than once. Similarly, because the MLAG configuration is local to the peers, I’ve seen multiple MLAG pairs in an environment using the same IP addresses for the peer-links on each pair! I don’t recommend this, but it does seem to work, and I’ve seen many customers who have done this. If it were my network, I can tell you that you’d be fixing that, though. While it might work at L2, if you then migrate to an L3 dynamic environment and do something like redistribute-connected
, you’ll get those IP addresses advertised from every pair.
MLAG In-Service Software Upgrade (ISSU) is a feature enabled on EOS version 4.9.3 and later, and at this point I really hope you’re using code that’s much later than 4.9.3. With MLAG ISSU, you can upgrade an MLAG switch pair with minimal (subsecond) packet loss and no STP reconvergence. Without MLAG ISSU or if you upgrade while ignoring the switch’s dire warnings regarding the state of MLAG ISSU, you’ll likely have one or more network topology changes that will result in one or more STP reconvergence events, and no one wants that.
The Arista documentation on MLAG ISSU indicates that the following steps need to be followed in this order to properly upgrade an MLAG ISSU switch pair:
Verify primary/secondary state of MLAG on each switch using the show mlag detail
command, or to be brief, the show mlag det | grep State
(with a capital “S”) command.
Ensure configuration consistencies.
Resolve ISSU warnings (from the output of reload).
Upgrade MLAG secondary switch.
Monitor MLAG status using show mlag detail
.
Confirm MLAG secondary status.
Upgrade MLAG primary peer switch.
Confirm overall MLAG status.
When upgrading chassis switch peers that contain dual supervisors, you’ll need to upgrade the standby supervisors on both switches, then upgrade the active supervisor on the MLAG secondary, and finally upgrade the last remaining supervisor.
By having switches running MLAG ISSU code, the switches will know whether they can be upgraded without causing an outage. If they cannot, the switch will give you a warning when rebooting. Here’s an example of such a warning on a switch running 4.21.1F:
Arista-A#reload If you are performing an upgrade, and the Release Notes for the new version of EOS indicate that MLAG is not backwards-compatible with the currently installed version (4.21.1F), the upgrade will result in packet loss. Stp is not restartable. Topology changes will occur during the upgrade process. The following MLAGs are not in Active mode. Traffic to or from these ports will be lost during the upgrade process: local/remote mlag desc state local remote status ---------- ---------------- ---------------- --------- ---------- ---------- 1 [ Arista-C ] active-partial Po1 Po1 up/down The configured reload delay of 120 seconds is below the recommended value of 300 seconds. A longer reload delay allows more time to rollback an unsuccessful upgrade due to incompatibility. System configuration has been modified. Save? [yes/no/cancel/diff]:
As I often joke in my classes, network engineers seem genetically predisposed to being incapable of reading walls of text. If you see a bunch of text like this after typing reload
, read the damn screen!
Using the reload now
command will cause the switch to bypass these warnings, so don’t use the reload now
command when doing an MLAG ISSU upgrade. This is not meant to be a trick to avoid walls of text, even if that’s why a bunch of us do it.
Here’s a list of common ISSU warnings and the ways to resolve them.
The version to which you’re upgrading might not be compatible with the version you’re on. But then again, it might! Read the release notes to make sure that it is.
Usually waiting 30 to 120 seconds will reward you with this warning resolving itself. To see the status of STP restartability (I totally made that word up), use the show spanning tree bridge detail
command:
Arista-A#sho spanning-tree bridge detail | inc agent Stp agent restartable : True
The MLAG shown is not active on the other switch in the MLAG pair. If it should be, bring it up. This is a warning that you’ll end up black-holing a device if you continue the reload, so make sure that this is what you’re expecting.
Remember the reload delay we talked about earlier in this chapter? Well, if the switch thinks that it’s too low (lower than the default of 300 seconds for top-of-rack switches and 600 seconds for modular switches), it will bark at you with this warning.
errdisabled
interfacesThis is usually an indication that you’re impatient and haven’t waited long enough for the peer to reboot. Remember, the peer’s MLAG-enabled interfaces will stay in an errdisabled
state for the duration of the reload delay after booting, assuming the other switch is up, and if you’re on a switch that shows this warning, that’s a good assumption.
The biggest step you should take before considering an MLAG ISSU upgrade is to carefully read the release notes and Transfer of Information (TOI) documents found on the Arista support site. You can find them alongside the EOS binary images. Don’t be afraid to call or email your Arista sales engineer or open a TAC case either. Some shops don’t do upgrades often enough to remain sharp on the syntax and gotchas, and these folks love to help.
For an example of using Layer 3 with MLAG, check out Chapter 21 which builds an L3 Equal-Cost MultiPathing (ECMP) network including VXLAN terminating on an MLAG pair.
When MLAG is configured, one of the switches in the MLAG cluster will become the primary switch. The MLAG primary switch will do all of the STP processing, and changes to the secondary will have no effect. There is a pretty big caveat to that statement, though, and that is that changes made to the secondary MLAG switch’s STP configuration will be accepted to the running-config, but they will not take effect unless, that is, the primary MLAG switch relinquishes its role as primary, at which point all of the commands entered on the secondary (now primary) switch will suddenly become active. What’s worse, you might not see this coming. Allow me to demonstrate.
I have two switches, Arista-A and Arista-B, configured as an MLAG pair. I have STP left to defaults, and Arista-A is the primary switch in the MLAG domain. I’ll be working on Arista-B, so here’s proof that it’s the MLAG secondary switch:
Arista-B(config)#sho mlag detail | grep State State : secondary Peer State : primary State changes : 2
And here’s the Spanning Tree status:
Arista-B(config)#sho spanning-tree MST0 Spanning tree enabled protocol mstp Root ID Priority 32768 Address 2899.3a06.6769 Cost 0 (Ext) 5999 (Int) Port 100 (Port-Channel1) Hello Time 2.000 sec Max Age 20 sec Forward Delay 15 sec Bridge ID Priority 32768 (priority 32768 sys-id-ext 0) Address 2a99.3a06.6e0f Hello Time 2.000 sec Max Age 20 sec Forward Delay 15 sec Interface Role State Cost Prio.Nbr Type ---------------- ---------- ---------- --------- -------- -------------------- Et1 designated forwarding 20000 128.247 P2p Edge Et34 alternate discarding 2000 128.234 P2p PEt1 designated forwarding 20000 128.1 P2p Edge PEt34 alternate discarding 2000 128.34 P2p Po1 root forwarding 1999 128.100 P2p
Now, I’ll go into that switch (Arista-B) and start mucking with STP. I want to make the priority lower to force it to be the root:
SW1(config)#spanning-tree root primary
When I make this change, nothing happens:
Arista-B(config)#sho spanning-tree | grep Priority Root ID Priority 32768 Bridge ID Priority 32768 (priority 32768 sys-id-ext 0)
Frustrated because my change has no effect, I decide to hardcode the priority even lower:
Arista-B(config)#spanning-tree priority 4096
Huh–still no change:
Arista-B(config)#sho spanning-tree | grep Priority Root ID Priority 32768 Bridge ID Priority 32768 (priority 32768 sys-id-ext 0)
Beyond frustrated, I start to drink heavily because nothing makes a network change go more smoothly than alcohol.
If I hardcoded the priority to primary (8192) and then 4096, why didn’t it show my change? Disgusted and impatient, I rebooted the other switch, because that was so much easier than reading the documentation. Imagine, though, that instead of me rebooting a switch in a lab that these switches were in production, and after my changes didn’t work, I gave up and walked away. You know, because that’s what happens in real data centers. Anyway, for whatever reason, maybe months later, Arista-A (the primary MLAG switch) reboots. I’ll simulate this with a hard reload of Arista-A:
Arista-A(config)#reload now Broadcast message from root@Arista-A (Sat Jan 26 21:20:42 2019): The system is going down for reboot NOW!
All of a sudden and without any real warning, Arista-B is the now root bridge with a priority of 4096:
Arista-B(config)#sho spanning-tree | grep Priority Root ID Priority 4096 Bridge ID Priority 4096 (priority 4096 sys-id-ext 0)
This happened because Arista-B is now the MLAG primary, as evidenced by the output of show mlag detail | grep state
:
Arista-B(config)#sho mlag det | grep State State : primary Peer State : primary State changes : 3
The fact that this happens like this is not really a problem; it is functioning by design. The problem is that when configuring STP on the secondary MLAG switch, there are no warnings that your changes are being saved, and no warnings that any changes made will take effect when and if this switch becomes the primary. Be very careful about making changes to STP when configuring the MLAG secondary switch.
This behavior was recorded on switches running EOS 4.21.1F. When I told Arista about it some six years ago, developers there told me that “the configuration should be the same on both peers.” Um...thanks.
To be fair, the developers have since added the show mlag config-sanity
command, and had I followed my own advice from earlier in the chapter and issued that command at the end of my change control instead of walking away and not backing out my changes (honestly, I would probably have fired myself if I’d done that), the switch would have told me that I was in danger. Or would it?
Sadly, this is one of the few things that show mlag config-sanity
does not check. I asked the developerss about this, and they said that it was by design without any further explanation. Here’s proof of the fact that it’s not included. First, here’s the relevant configuration from Arista-A:
Arista-A#sho run section span spanning-tree mode mstp no spanning-tree vlan 4094
And here is Arista-B’s relevant configuration:
Arista-B#sho run section span spanning-tree mode mstp no spanning-tree vlan 4094 spanning-tree mst 0 priority 4096
As you can see, Arista-B has a Spanning Tree priority of 4096, which is a big change from the default of 32768 on Arista-A. Here’s what show mlag config-sanity
says on Arista-A:
Arista-A#sho mlag config-sanity No global configuration inconsistencies found. No per interface configuration inconsistencies found.
Here’s what show mlag config-sanity
says on Arista-B:
Arista-B#sho mlag config-sanity No global configuration inconsistencies found. No per interface configuration inconsistencies found.
The lesson to learn here is that the configurations should be the same on both peer switches, and you should always make sure that’s the case both with the show mlag config-sanity
command and something like the sho run section span
(or similar) command.
One last note, because this comes up a lot: no, you should not disable STP if you’re using MLAG (or any vendor’s MLAG-like technology). Ask any networking consultant whether they’ve heard of a Spanning Tree event being caused by someone bringing in a home office switch and connecting it where it didn’t belong. I know I’ve seen that more than once. Hell, I had a client who refused to run more than two Ethernet runs to each cube, insisting that should anyone need more ports, they could just bring in a switch from home. This is an outage waiting to happen, and STP is the last line of defense against the loop-inducing server guy who needs 14 ports on his desk. Do yourself a favor and outlaw switches on (or under) desks. And keep STP running, because when you outlaw desktop switches, only outlaws will have desktop switches...or something.
MLAG works great if you’re in need of a multihomed L2 design. I’ve taught people who favor end-to-end L3 designs who seem to get angry that MLAG exists, which always kind of amuses me. Arista is not forcing anyone to use any one design over another. If you need an L2 solution, MLAG is great. If you need L3, go for it.