There are some pretty useful diagnostic tools on Arista switches, some of which we’ve already covered, such as tcpdump. Sometimes, we need to know more detail about what the switch is doing, and that’s where performance monitoring comes into play.
In Linux, you can follow live updates to a log file by using the tail -f filename
command. You can do something similar when viewing your system log in EOS with the command-line interface (CLI) command show logging follow
. Syslog messages are stored in Linux where any Linux person would expect to see them, namely, in /var/log/:
[admin@Arista ~]$ ls /var/log EosInit error.log messages NorCalInit eventMon.db nginx-access.log Post fneserver nginx-error.log agents inotifyrun-local.log ntpstats btmp inotifyrun-secure.log ppp cli inotifyrun-sys.log qt cron kernel.debug secure eos lastlog spooler eos-console libvirt startup-config-output eos-console-sync logrot.log tallylog eos-monitor maillog wtmp eos-monitor-sync mcelog yum.log
/var/log/messages contains all of the syslog messages that you would see on the console. Note that you need sudo
to view this file:
[admin@Arista ~]$ cd /var/log [admin@Arista log]$ tail messages tail: cannot open 'messages' for reading: Permission denied [admin@Arista log]$ sudo tail messages Nov 29 02:07:34 Arista Stp: %SPANTREE-6-STABLE_CHANGE: Stp state is now stable Nov 29 02:07:56 Arista Cli: %SYS-5-CONFIG_I: Configured from console by admin on vty3 (10.0.0.100) Nov 29 02:08:01 Arista CROND[10746]: (root) CMD (/etc/archivecheck.sh &> /dev/null) Nov 29 02:08:33 Arista Cli: %SYS-5-CONFIG_E: Enter configuration mode from console by admin on vty3 (10.0.0.100) [--output truncated--]
The filesystem in EOS resides in memory, so anything outside of flash: will not survive a reboot. Additionally, some switches can be shipped with solid-state drives (SSDs), and if you have a switch that contains one, the system automatically archives many of the system log files onto that drive, which is drive: in CLI and /mnt/drive in Linux:
[admin@Arista log]$ cd /mnt/drive [admin@Arista drive]$ ls aquota.user archive lost+found var_archive.2016-11-21-03:20:02.dir var_archive.2016-11-21-06:19:05.dir var_archive.2016-11-21-14:30:02.dir var_archive.2016-11-21-19:15:02.dir var_archive.2016-11-27-02:30:02.dir var_archive.2016-11-27-22:29:02.dir var_archive.2016-11-27-22:45:02.dir
A great tool on Linux systems is the top
command. The top
command produces output that automatically updates every few seconds (the default is three seconds in Linux). You can call this command from the CLI with the show process top
command, or from Bash with the command top
. Here’s a sample output from a live 7280SE-72. This is a very healthy switch with nothing unusual going on. Depending on the switch platform you’re using, the processes near the top might change:
Arista(config)#sho proc top once top - 02:22:23 up 1:42, 1 user, load average: 0.28, 0.29, 0.29 Tasks: 289 total, 1 running, 288 sleeping, 0 stopped, 0 zombie %Cpu(s): 7.0 us, 1.2 sy, 0.0 ni, 91.5 id, 0.1 wa, 0.1 hi, 0.0 si, 0.0 st KiB Mem: 3796192 total, 3640996 used, 155196 free, 225716 buffers KiB Swap: 0 total, 0 used, 0 free, 1909320 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3853 root 20 0 985m 264m 118m S 7.9 7.1 5:14.88 SandFap 3463 root 20 0 708m 97m 20m S 4.0 2.6 0:01.45 Aaa 3882 root 20 0 698m 114m 37m S 4.0 3.1 0:15.62 SandMact 11881 admin 20 0 25892 12m 9028 R 4.0 0.3 0:00.11 top 1 root 20 0 7900 3992 2500 S 0.0 0.1 0:02.22 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.06 ksoftirqd/0 5 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kworker/u:0 6 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/0 8 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0 10 root 20 0 0 0 0 S 0.0 0.0 0:00.06 ksoftirqd/1 11 root 20 0 0 0 0 S 0.0 0.0 0:00.07 kworker/0:1 13 root rt 0 0 0 0 S 0.0 0.0 0:00.52 migration/2
Because I’m a maniac with programming skills, I wrote a program called corecrusher that consumes 100% CPU resources for any number of CPU cores. Here I ran it for 30 seconds so that it would consume three cores. Notice that the CPU idle is at only 21.9% and you can see that there are three processes called corecrusher
, each consuming 99.8% of a CPU core.
Arista(config)#sho proc top once top - 02:32:00 up 1:51, 1 user, load average: 1.41, 0.60, 0.40 Tasks: 293 total, 5 running, 288 sleeping, 0 stopped, 0 zombie %Cpu(s): 77.4 us, 0.7 sy, 0.0 ni, 21.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 3796192 total, 3646380 used, 149812 free, 226364 buffers KiB Swap: 0 total, 0 used, 0 free, 1911628 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12573 admin 20 0 5040 652 332 R 99.8 0.0 0:27.50 corecrusher 12574 admin 20 0 5040 652 332 R 99.8 0.0 0:27.48 corecrusher 12575 admin 20 0 5040 652 332 R 99.8 0.0 0:27.54 corecrusher 3853 root 20 0 985m 264m 118m S 8.8 7.1 5:40.01 SandFap 12601 admin 20 0 25872 12m 9140 R 5.9 0.3 0:00.19 top 3553 root 20 0 689m 99m 24m R 2.9 2.7 1:40.72 Smbus 1 root 20 0 7900 3992 2500 S 0.0 0.1 0:02.26 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.06 ksoftirqd/0 5 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kworker/u:0 6 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/0 8 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/1
There are a variety of ways to sort this output, but the default is by CPU utilization. For 99% of what you’d use this tool for, that’s fine. The %CPU
column is what determines the process’s place on the list, but remember that processes on a Linux system can bounce up and down on this list in milliseconds, and you will likely see just that as you watch.
I recommend doing some further reading on the top
command online, but there are a couple of things to always look at when you use it. To begin, the very first line shows three numbers after the words load average:
. These numbers will vary from switch to switch, and they’re not necessarily an indication of anything unless you know what they are normally. In other words, a 7280R switch in one environment might run with a load average of (for example) 0.9, whereas another might run at an average of 1.5. The load average is the average number of processes in the run queue over the interval in question (1 min, 5 mins, 15 mins), and there are so many things that contribute to these numbers that they’re not terribly useful on a switch, in my opinion. That being said, if you see a number like 25.99, there’s probably something bad happening.
The next bold line in the output shows this:
%Cpu(s): 77.4 us, 0.7 sy, 0.0 ni, 21.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
This is a very important line, though again, it’s only for the snapshot of time when the last top
iteration ran. In other words, it’s not historical; it’s a snapshot. The numbers from left to right indicate the following values:
The combined CPUs spent 77.4% of their time on user processes (mostly agents on an Arista switch).
The combined CPUs spent 0.7% of their time on system processes (kernel, etc.).
The combined CPUs spent 0% of their time on user processes that have been niced (this is a Unix thing used for setting priorities that you really don’t need to worry about). Unless you’ve really messed around in Bash, this should probably always be 0, though there are some processes like khelper that pop up with a –20 nice from time to time in order to keep them high priority (a higher niceness means a lower priority).
The combined CPUs are spending 21.9% of their time completely idle. The bigger this number is, the happier the switch is. Note that this is not strictly true, because the forwarding of frames is done by the Application-Specific Integrated Circuit (ASIC) so even with 100% CPU utilization, the switch will likely be operating well.
The CPU spent 0% of its time waiting for I/O to complete.
The CPU spent 0% of its time servicing hardware interrupts.
The CPU spent 0% of its time servicing software interrupts.
The CPU spent 0% of its time stolen by a hypervisor.
On a healthy switch, idle should be high, whereas system and user should be low. On a busier switch, user might spike up and down as processes vie for the CPU’s attention. Remember, though, that this is just control-plane stuff like Open Shortest Path First (OSPF), Border Gateway Protocol (BGP), and Spanning Tree Protocol (STP). Packets are forwarded by ASIC, and not the CPU. Still, high CPU should be monitored, and you should contact Arista support if your CPU stays unreasonably high.
Debugging processes is significantly different in EOS than it is in other vendors’ industry-standard operating system environments. The amount of debug information available in EOS is staggering, but figuring out where to look can be a bit overwhelming if you’re used to using other vendors’ debug commands. After you get the hang of how it works in EOS, though, you’ll be amazed at the power. You’ll also quite possibly never do this unless TAC tells you to, but seeing as how there are limited places to see this documented, I figured it would be a good addition to this book.
First, EOS now has a debug
command, but it’s fairly limited in scope, covering only some basic IP and OSPF topics:
Arista(config)#debug ip ? general General debugging ospf Ospf protocol
In EOS, the concept of debugging is mostly replaced by the concept of tracing. Almost any process (EOS agents) can be traced with—you guessed it—the trace
command. You can’t just go typing trace ospf packets
, though, so step away from the keyboard and keep reading.
Tracing in EOS can produce an overwhelming amount of data, so by default, the output of your trace does not go to the console or to your Secure Shell (SSH) sessions. Because of the large amount of data, output first must be directed to a file. In this section, I show you how to view the output to the screen via a couple of methods, but let’s take a look at how to get a trace started first.
Be careful when using trace
. It can consume system resources, though unlike debug
in other vendors’ operating systems, it does not take priority over other processes, nor does it stop everything else to deliver its output to the console. (Seriously, who thought that was a good idea?)
Using OSPF as an example, it quickly becomes obvious that you don’t trace protocols in EOS, but rather agents. In fact, all agents have hooks in them for the trace system, which is a pretty clever way for developers to gather information from their agents via built-in methods. To learn the name of the agent that you want to trace, issue the show agent names
command:
Arista#sho agent names Aaa Acl AgentMonitor AradLanz Arp ArpInspection Asu Avago Bfd BugAlert CapiApp Cdp Chl822X [ -- lots of output removed in the interest of brevity --] Vm VmTracer Vxlan VxlanController VxlanSwFwd Wbem Xcvr XcvrSlice Xmpp ZeroTouch
Now we have a list of agents that we can trace. One of the frustrating things about this process is that it might not always be obvious what you’re looking for. For example, in this list, there are no agents named Ospf
, Bgp
, or Rip
, and the agents that are listed don’t have very useful descriptions. That output, however, is from EOS 4.15.5M, and things have changed a bit on newer code.
On newer revisions of code (EOS 4.21.1F in this example), there are actually agents for BGP and OSPF:
Arista(config)#sho agent name | egrep -i "ospf|bgp" Bgp Ospf Ospf-vrf Ospf3 Ospf3-vrf
Those agents aren’t running by default, though.
Arista(config)#sho trace Bgp % Agent 'Bgp' is not running
If you’re running FlexRoute (Chapter 20) on a supported Arista switch, those agents will be running:
Arista(config)#service routing protocols model multi-agent Arista(config)#sho trace Bgp Global trace settings for agent Bgp ----------------------------------------------- Tracing sent to stderr date: enabled time: enabled [--output truncated--]
Back to a default switch, if you know a bit about routing, you might recognize the idea of a Routing Information Base (RIB), and there’s a process for that, so let’s see where that leads us. To find out details about what you can trace within an agent, use the show trace
agent-name
command. Let’s do this for the Rib
process:
Arista(config)#sho trace Rib Global trace settings for agent Rib ----------------------------------------------- Tracing sent to stderr date: enabled time: enabled PID: disabled TID: enabled facility name: enabled (width 20) trace level: enabled filename: disabled (width 20) line number: disabled function name: disabled (width 20) Trace facility settings for agent Rib is ----------------------------------------------- AclAggregator enabled ............ AclApi enabled ............ AdjacencyHelper enabled ............ Agent enabled ............ AgentIdManager enabled ............ AgentLogging enabled ............ AgentStageCommonSm enabled ............ AgentStartupModeSm enabled ............ ArcherBackup enabled ............ ArcherEm enabled ............ ArrowATPServerMain enabled ............ ArrowArRTOp enabled ............ ArrowArRtOp enabled ............ ArrowArTableInterface enabled ............ ArrowArUtils enabled ............ ArrowClientLib enabled ............ ArrowDB enabled ............ ArrowLocalTable enabled ............ ArrowProxy enabled ............ ArrowSerDes enabled ............ ArrowServer enabled ............ BfdStatAgentLib enabled ............ BfdStaticRouteSm enabled ............ BgpSmashInfo enabled ............ ConnectionManager enabled ............ Dash::Local<bool> enabled ............ [-- Wow, there's a lot of stuff (removed) --] Rib enabled ............ Rib::Adv enabled ............ Rib::Bgp enabled ............ Rib::Bgp::Keepalive enabled ............ Rib::Bgp::Normal enabled ............ Rib::Bgp::Notification enabled .......... Rib::Bgp::Open enabled ............ Rib::Bgp::Policy enabled ............ Rib::Bgp::Route enabled ............ Rib::Bgp::State enabled ............ Rib::Bgp::Task enabled ............ Rib::Bgp::Timer enabled ............ Rib::Bgp::Update enabled ............ Rib::DebugMessages enabled ............ Rib::DynPolicyRoutes enabled ............ Rib::Isis enabled ............ Rib::Mio enabled ............ Rib::Normal enabled ............ Rib::Ospf enabled ............ Rib::Ospf3 enabled ............ Rib::Parse enabled ............ Rib::Policy enabled ............ Rib::Rip enabled ............ Rib::Route enabled ............ Rib::RouteControl enabled ............ Rib::State enabled ............ Rib::Task enabled ............ Rib::Te enabled ............ Rib::Timer enabled ............ Rib::Tracing enabled ............ [-- Even more stuff removed --] VxlanIntfAggregator enabled ............ Watchdog enabled ............ XFire enabled ............ XFireServer enabled ............
We found it! We’re so clever.
If you are running FlexRoute, the Rib
process will not be running.
Now that we know the name of the agent (Rib
), we can begin to specify trace commands. The first step should always be to point the output to a file. You can put the file anywhere, but Arista recommends that it be stored on flash: or, better yet, drive: if your switch has an SSD drive, given that those locations usually have the most space and can survive a reboot. To do so, use the trace
agent-name
filename
filename
command in configuration mode:
Arista(config)#trace Rib filename flash:Rib.txt
Notice how the agent names all begin with capital letters? That’s not a requirement on the command line for specifying the agent name. I’m just obsessive. If you specify a filename with a capital letter like I have, that filename will need to be consistent when referencing it later.
Now there are two paths you can take. You can either trace it all (generally a bad idea), or you can restrict what you want to see. Because we’re looking to trace BGP, we need to include only the items with BGP in the name, which I’ve taken the liberty of highlighting in bold. Actually, there are a couple more, so let’s try a different way to see them:
Arista(config)#sho trace Rib | grep Bgp BgpSmashInfo enabled ............ GatedBgpHelper enabled ............ Rib::Bgp enabled ............ Rib::Bgp::Keepalive enabled ............ Rib::Bgp::Normal enabled ............ Rib::Bgp::Notification enabled ............ Rib::Bgp::Open enabled ............ Rib::Bgp::Policy enabled ............ Rib::Bgp::Route enabled ............ Rib::Bgp::State enabled ............ Rib::Bgp::Task enabled ............ Rib::Bgp::Timer enabled ............ Rib::Bgp::Update enabled ............ RibBgpPlugin enabled ............ RibGatedBgp enabled ............
Note that these Rib::Bgp
entries appear only if BGP is running.
To include only these entries, use the trace
agent-name
enable
trace-facility-name levels
command. Let me show you how to figure all that out. First, we know the agent name (Rib
), so let’s begin there:
Arista(config)#trace Rib enable ? WORD Trace facility name
The Trace facility name
is one or more entries from that long list we got from using the show trace Rib
command. Look at the leftmost lines in bold from that output. They all begin with Rib::Bgp::
. To include them all, we can use a wildcard and specify Rib::Bgp::*
, as shown in the next example.
This is one more place where capitalization matters. If you specify rib::bgp::*
(no caps), it won’t work. If you use caps the way you might think you should such as RIB::BGP::*
, that also will not work. You need to specify exactly what’s listed from the output of show trace
agent-name
, including proper case and the right number of colons.
That leaves one more item to determine: levels.
Arista(config)#trace Rib enable Rib::Bgp::* ? all Enable tracing at all levels levels Enable tracing at one or more levels
This one is easy, just use all
. Sure, you can drill down even further if you’d like, and to do so, specify the word levels
with a question mark (?
) to see what the possibilities are:
Arista(config)#trace Rib enable Rib::Bgp::* levels ? 0 Enable tracing at level 0 1 Enable tracing at level 1 2 Enable tracing at level 2 3 Enable tracing at level 3 4 Enable tracing at level 4 5 Enable tracing at level 5 6 Enable tracing at level 6 7 Enable tracing at level 7 8 Enable tracing at level 8 9 Enable tracing at level 9 coverage Enable tracing at level coverage function Enable tracing at level function
Like I said, just use all
. (But feel free to mess with each setting to your heart’s content. It’s not like I can stop you!)
Arista(config)#trace Rib enable Rib::Bgp::* all
Note that this will alter the output of the show trace Rib
command to include references to what is being traced:
Arista(config)#sho trace Rib | grep Bgp Trace facility settings for agent Rib is Rib::Bgp::Task/*cf, Rib::Bgp::Normal/*cf,Rib::Bgp::Keepalive/*cf,Rib::Bgp:: Update/*cf,Rib::Bgp::State/*cf,Rib::Bgp::Timer/*cf,Rib::Bgp/*cf,Rib::Bgp:: Open/*cf,Rib::Bgp::Notification/*cf,Rib::Bgp::Route/*cf,Rib::Bgp::Policy/*cf BgpSmashInfo enabled ............ GatedBgpHelper enabled ............ Rib::Bgp enabled 0123456789cf Rib::Bgp::Keepalive enabled 0123456789cf Rib::Bgp::Normal enabled 0123456789cf Rib::Bgp::Notification enabled 0123456789cf Rib::Bgp::Open enabled 0123456789cf Rib::Bgp::Policy enabled 0123456789cf Rib::Bgp::Route enabled 0123456789cf Rib::Bgp::State enabled 0123456789cf Rib::Bgp::Task enabled 0123456789cf Rib::Bgp::Timer enabled 0123456789cf Rib::Bgp::Update enabled 0123456789cf RibBgpPlugin enabled ............ RibGatedBgp enabled ............ SrTeBgpPolicyStatus enabled ............
The numbers in the right side of the Rib::Bgp::
entries indicate the logging levels. Because we chose all
for the logging level, they’re all included. Had we chosen only level 9
, for example, only the 9
would be shown for those entries.
At this point, the trace is active, and provided there is something going on with the agent you’ve chosen, the file will begin to fill up:
Spine-1(config)#dir Rib* Directory of flash:/Rib* -rwx 28464 Jan 31 14:15 Rib.txt 3440762880 bytes total (724103168 bytes free)
As a quick aside, the trace rib enable Rib::Bgp::* all
command will be expanded in the running-config file to something that looks like this, which is also shown in the output of show trace Rib
:
trace Rib setting Rib::Bgp::Task/*cf,Rib::Bgp::Normal/*cf, Rib::Bgp::Keepalive/*cf,Rib::Bgp::Update/*cf,Rib::Bgp::State/*cf, Rib::Bgp::Timer/*cf,Rib::Bgp::Open/*cf,Rib::Bgp::Notification/*cf, Rib::Bgp::Route/*cf,Rib::Bgp::Policy/*cf
Don’t worry about that. It’s just all of the items you included by using the asterisk wildcard on one line, separated by commas.
A file that’s quietly filling with trace output isn’t very useful by itself, so let’s see what’s in it. First, you can monitor the output directly with the trace monitor
agent-name
command. Press Ctrl-C to break out of this output. This is not like using the terminal monitor command in that other company’s operating system; when you issue the trace monitor
agent-name
command, you cannot use the CLI until you break the output:
Arista(config)#trace monitor Rib --- Monitoring /mnt/flash/Rib.txt --- 14:17:08.654507 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option TTL(16) value 1 14:17:08.654538 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option VirtualRoutingandForwarding(39): not supported 14:17:08.654572 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option TOS(17) value 192 14:17:08.654624 task_addr_local: task BGP_65006.10.10.6.2+179 address 10.10.6.1 14:17:08.654658 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option DontRoute(5) value 1 14:17:08.654689 BGP NORMAL: bgp_need_peer_resolution: timeout is 0, resolution disabled 14:17:08.654842 task_connect: task BGP_65006.10.10.6.2+179 addr 10.10.6.2+179: Operation now in progress 14:17:08.654899 task_timer_uset: timer BGP_65006.10.10.6.2+179_Connect <OneShot Set Processing> set to offset 2:28 at 14:19:36 14:17:08.654930 bgp_set_connect_timer: 10.10.6.2 (AS 65006) timer started with timeout 148 14:17:08.654986 task_timer_dispatch: returned from BGP_65006.10.10.6.2+179_ Connect, rescheduled in 2:28 at 14:19:36 14:17:11.654925 task_process_sockets: processing BGP_65006.10.10.6.2+179 fd 101 event 8220
The method I prefer to use is more Unix centric. Using the tail
filename
command shows you the last few lines of a file. For a constantly updating file like this one, the tail –f
filename
command continuously updates until broken (again, with Ctrl-C):
Arista(config)#bash tail -f /mnt/flash/Rib.txt 14:19:03.700397 BGP TASK: bgp_recv_msg_job: processed 1 messages. yield_now no 14:19:03.700426 task_job_delete: delete background job recv msgq job for task BGP.0.0.0.0+179 14:19:03.700518 task_timer_dispatch: calling BGP_65002.10.10.2.2+60402_HoldTime, late by 0.000 14:19:03.700580 task_timer_uset: timer BGP_65002.10.10.2.2+60402_HoldTime <OneShot Set Processing> set to offset 2:59.999796 at 14:22:03 14:19:03.700629 task_timer_dispatch: returned from BGP_65002.10.10.2.2+60402_ HoldTime, rescheduled in 2:59.999 at 14:22:03 14:19:07.550984 task_timer_dispatch: calling BGP.0.0.0.0+179_PolicyCache_Cleanup, late by 0.000 14:19:07.551150 task_job_create: create background job DedupPtrCache Purge for task BGP.0.0.0.0+179 14:19:07.551185 task_timer_dispatch: returned from BGP.0.0.0.0+179_PolicyCache_ Cleanup, timer now inactive 14:19:07.551223 task_job_delete: delete background job DedupPtrCache Purge for task BGP.0.0.0.0+179 14:19:07.551273 task_timer_uset: timer BGP.0.0.0.0+179_PolicyCache_Cleanup <OneShot> set to offset 10 at 14:19:17
I like this method because I can pipe using grep
(which doesn’t work with trace
from the CLI):
Arista(config)#trace monitor Rib | grep 65006 % Invalid input
Here’s what I’m looking for by using Bash commands from Cli
:
Arista(config)#bash tail -f /mnt/flash/Rib.txt | grep 65006 14:22:04.655769 task_timer_dispatch: calling BGP_65006.10.10.6.2+179_Connect, late by 0.000 14:22:04.655874 bgp_connect_timeout: BGP_65006.10.10.6.2+179_Connect 14:22:04.655937 bgp_event: peer 10.10.6.2 (AS 65006) old state Active event ConnectRetry new state Connect 14:22:04.655981 bgp_trap_backward: sending trap for 10.10.6.2 (AS 65006) 14:22:04.656247 task_set_socket: task BGP_65006.10.10.6.2+179 socket 101 14:22:04.656296 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option NonBlocking(8) value 1 14:22:04.656339 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option ReUseAddress(3) value 1 14:22:04.656380 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option TTL(16) value 1 14:22:04.656419 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option VirtualRoutingandForwarding(39): not supported 14:22:04.656458 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option TOS(17) value 192 14:22:04.656553 task_addr_local: task BGP_65006.10.10.6.2+179 address 10.10.6.1 14:22:04.656593 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option DontRoute(5) value 1
Traces stay running even when you’re not looking at them, and they will run, forever consuming disk space, unless you shut them off. Here’s a directory listing from my switch where I left a few traces running for only an hour!
Agent(config)#dir *.txt Directory of flash:/*.txt -rwx 122612063 Jan 7 03:50 RIB-Debug.txt -rwx 152724544 Jan 11 00:22 Rib.txt -rwx 189626048 Jan 11 00:22 STP.txt 1862512640 bytes total (0 bytes free)
That’s 45 megabytes of file space chewed up by something that I don’t need. Let’s not forget that those traces are still running, which means they’re wasting system resources.
Though I did run out of disk space, that’s more likely due to the fact that this is my test lab switch, which has 4 different versions of EOS on the flash: drive, each of which is over 500 MB. If you run these Trace files in your home directory (which runs in memory, remember), then you may cause more significant harm by filling up a filesystem in EOS.
Unfortunately, at least as of EOS version 4.21.1F, I know of no way to disable all traces at once like you can with the undebug all
command in other operating systems. To disable tracing, use the no trace
agent-name
enable * all
command. This works regardless of what other trace facility names you may have chosen, such as Rib::Bgp::
.
Remember that you might have multiple traces running, too, and you’ll need to stop them all individually.
Arista(config)#no trace Sysdb enable * all Arista(config)#no trace Rib enable * all
Don’t forget that although you’ve stopped the traces, the files still reside on disk. If you want to save them elsewhere, you can copy them off using a variety of protocols such as FTP, SCP, TFTP, and more. Here’s a list:
Arista(config)#copy flash:Rib.txt ? boot-extensions Copy to boot extensions configuration certificate: Destination file path drive: Destination file path extension: Destination file path file: Destination file path flash: Destination file path ftp: Destination file path http: Destination file path https: Destination file path running-config Update (merge with) current system configuration scp: Destination file path sftp: Destination file path sslkey: Destination file path startup-config Copy to startup configuration system: Destination file path terminal: Destination file path tftp: Destination file path
Usually, I’ve gotten what I’ve needed from them, so I just delete them, instead:
Arista(config)#del Rib.txt Arista(config)#del Sysdb.txt
With the traces disabled, leaving the filename entries cluttering the configuration doesn’t make any sense, so as a final step, we can clean those up too:
Arista(config)#no trace Rib filename flash:/Rib.txt Arista(config)#no trace Stp filename flash:/Sysdb.txt
Tracing can seem a bit overwhelming if you’re used to other vendors’ offerings and the well-known debug
commands, but I think if you spend some time getting used to this format, you’ll quickly discover that it’s more powerful than debug
and certainly more flexible. The fact that it won’t lock up the switch pushing logs to the console is a nice bonus, too.
Honestly, though, most people I know only use Trace when TAC tells them to.
Here’s a neat instance of something not going quite according to plan. In this case, I have a switch with roughly 400,000 static routes (because I’m a menace and that’s what I do). I did a config replace
on it with a base configuration so that I could work on another chapter, which has tied up the CPU removing all of those CLI commands. When I SSH to the switch during that process, check out what happens:
gad@[AW-Lab]:~$ switch-01 Warning: Permanently added 'student-01,10.0.0.1' (ECDSA) to the list of known hosts. Last login: Wed Jan 30 21:29:28 2019 from 10.1.0.100 Cannot connect to CLI. Entering standalone shell. Note: Standalone CLI does not share data with other agents, but it can run certain commands such as 'bash' to troubleshoot the system within authentication and authorization constraints. Standalone>
Whoa! In all the years that I’ve been working on Arista switches, I’d never seen this. Let’s see what we can see!
Standalone>en Standalone#sho run ! Command: show running-config ! device: Student-01 (unknown-hardware, EOS-4.21.1F) ! ! boot system flash:/EOS-4.21.1F.swi ! prompt Standalone%R%v%P ! transceiver qsfp default-mode 4x10G ! spanning-tree mode mstp ! no aaa root ! no ip routing ! end
Well that’s new. What about interfaces?
Standalone#sho int Standalone#
Nothing! There’s not even a management interface! The management interface is a kernel interface in Linux, and so it is not controlled by the ASIC, and I can’t even see that! Some poking around has found that this is a special Cli
mode that loads without access to SysDB or any other agents. Here’s the relevant help output from the Cli
command in Linux:
[admin@switch-02 ~]$ Cli --help | grep -A1 "without Sysdb" -l, --standalone run in standalone mode (without Sysdb). Implies --disable-guards.(also see --standalone-guards
At this point, I mutter, “fascinating” in my best Mr. Spock voice and then try to trigger standalone mode myself. I couldn’t get that to work and quickly gave up because I had other things to do.
What’s the point of this? To show you that if you do see this, not to panic. After about a minute, I was able to SSH to the switch normally, even with my standalone shell still active. I just thought it was an interesting (fascinating, even) curiosity and thought I’d share it.
Bottom line: if you see this, don’t panic. Wait a minute or two and try to SSH again. I bet it will work. If it doesn’t after a reasonable amount of time, something more serious might be afoot, at which point it might be time to call TAC. Or reboot. Maybe panic. It all depends on your environment and caffeine intake.
Arista Support, or TAC (Technical Assistance Center), is a fantastic resource, and my experience with it has been far superior to that other company that has nothing to do with crystallized cottonseed oil. TAC has helped me when I’ve discovered bugs, it has helped me diagnose configuration problems, and on the rare occasion when we had a switch go bad under warranty, TAC shipped us a new one in no time. After you’re registered with Arista Support as a valid customer, opening a TAC case is as simple as calling, or if the issue isn’t critical, you can use email, as well. And with your valid and registered email address, you can download documentation, EOS revisions, whitepapers, and a host of other useful documents from the Arista.com website.
There is also a fabulous online forum and repository of knowledge at the EOS Central website. At this site, you can post a question to the forum, and you might be surprised to find one of the company founders answering it. There is also a development blog and some tech tips that are well worth a look.
Of course, you can always send me an email, too, but I’m a cranky, old, recalcitrant nerd who doesn’t have a lot of free time, so you’d probably do better emailing Arista. That is, of course, unless you want to tell someone how much you like this book, in which case I’m probably the better choice.
The concept of troubleshooting is one that I am asked about quite a bit, and it’s a tough topic to write about within a single chapter. The idea of writing about common things that go wrong is really what people want, but in my experience, I don’t really see common things that go wrong aside from my favorite thing to troubleshoot, which is user error. My favorite complaint from students and entrenched engineers alike is, “I did everything right, but it still doesn’t work.”