Chapter 21. Proactive Monitoring and Alerting

System management has been greatly improved in Windows 2003. Great strides have been made toward making Windows 2003 better able to monitor itself and alert administrators of problems. Almost all aspects of the operating system can be exposed to various monitoring systems. Service availability, resource usage, replication queues, and most anything else an administrator would care to know about can be monitored across the enterprise. Remote systems can all report to a central location so that the health of the entire network can be viewed from a single location. This chapter endeavors to give you an idea of what technologies are available for systemwide monitoring and offer insights into the different types of monitoring and how they can be used together to give a holistic view of the network. This chapter will also make recommendations on areas on which administrators should focus their monitoring as well as give examples of scripts that can be triggered by monitoring events to try to fix problems and alert the appropriate resources that an event has occurred.

Leveraging Windows Management Instrumentation

Right out of the box, Windows 2003 has a number of items that give it excellent insight into its own operation. Event viewers, SNMP traps, and performance monitors have long been available to Windows systems to allow it to track its own health. Windows 2003 has made these and many other monitoring components available through a central mechanism called Windows Management Instrumentation, or WMI.

Understanding WMI

WMI is Microsoft’s implementation of WBEM, or Web-Based Enterprise Management. WBEM was designed to provide one method for accessing management data that originates from disparate sources. WBEM has been developed over the years by a consortium of companies that all shared a common vision for how monitoring should be implemented. The old methods of proprietary monitoring subsystems for each operating system or platform have made way for an open standard for monitoring, independent from platform or OS-specific APIs. Like most “open standards,” various companies have created their own implementation but these exist as supersets of the original WBEM requirements and follow the standards of Common Information Model and Desktop Management Interface as set forth by the Distributed Management Task Force.

Excellent Source for WMI Scripts

The Internet is an excellent source for finding commonly used WMI scripts. Rather than reinvent the wheel, you can check to see if another scripter has already created a script that does what you need.

Uses for WMI

WMI enables you to query the system for events and cause those events to trigger actions. Actions can be as simple as adding entries to a log file or as complex as changing system parameters and rebooting a system. Windows 2003 ships with several built-in providers for accessing specific subsystems:

  • Performance Monitor Provider—Provides access to Windows NT Performance Monitor data.

  • Registry Provider—Provides access to system Registry data.

  • Registry Event Provider—Sends events when changes occur to Registry keys, values, or trees.

  • SNMP Provider—Provides access to events and data from SNMP devices.

  • Windows NT Event Log Provider—Provides access to data and event notifications from the Windows NT Event Log.

  • Win32 Provider—Provides access to data from the Win32 subsystem.

  • WDM Provider—Provides access to data and events from device drivers that conform to the WMI interface.

  • Shadow Copy ProviderSupplies management functions for the Windows Server 2003 Shared Folders feature.

  • Storage Volume Provider—Enumerates all volumes known to the Windows Mount manager and manages volume drive letters, mount points, and storage quotas.

By using these providers, WMI can be leveraged to act on information captured from these sources. For example, Event notification could be used to detect hardware events or errors. The event could then be passed to the WMI for corrective action based on the specific event that occurred. For example, a Network Interface Card (NIC) detects the presence of an Ethernet signal and sends notification to a script that disables the Wireless Network Interface Card to eliminate the possibility of a wireless connection being used as an entry point to a wired network.

Similarly, the Event Log Provider could pass an event to a WMI script that watched for a specific Event ID and would trigger a restart of a service to fix a known bug. This can be especially useful with internal software that is still under development. If an application were known to have a memory leak, WMI could watch the process and restart a specific service when the process consumed over 256MB of memory, or some other threshold. At the same time the WMI script could alert a developer via e-mail and pass specific system parameters based on WMI queries that could help the developer troubleshoot the process.

There Are Additional Providers

There are additional providers above and beyond the ones included in Windows 2003. When adding services such as load balancing or clustering, check to see if specific WMI providers are available so that those functions can be accessed via WMI as well.

Leveraging Scripts for Improved System Management

The WMI scripting API can be used for the multiple purposes:

  • Enumeration—A WMI scripter can write scripts to enumerate processes running on a local system or settings on a device.

  • Method execution—A WMI scripter can create scripts to initiate or shut down a specific process or application on a remote computer.

  • Queries—Rather than enumerate, a WMI script could query a subsystem to determine if a specific process or device is running on a local or remote system.

  • Event registration and reception—A WMI script would watch for specific events from the Windows NT Event Log on a local computer.

  • Instance modificationWMI can be leveraged for modification of information. For example, workstations created with Sysprep could change their own names to their unique processor ID automatically via a WMI script upon their first boot.

  • Local and remote accessWMI scripts can be targeted to operate on local resources as well as remote resources. This can be especially powerful in allowing one machine to monitor and potentially fix another.

Through combinations of these functions, you have amazing flexibility in creating complex scripts that can react to a number of situations and trigger events, alerts, or even other scripts. By chaining these scripts one can create fairly advanced logic for determining how a system should deal with its own triggers.

Basic WMI Scripts

Here is an example of a basic script that will query a machine called Godzilla for its total physical memory:

strServer = "Godzilla"
Set wbemServices = GetObject("winmgmts:\" & strServer)
Set wbemObjectSet = wbemServices.InstancesOf("Win32_LogicalMemoryConfiguration")

For Each wbemObject In wbemObjectSet
WScript.Echo "Physical Memory (kb): " & wbemObject.TotalPhysicalMemory
Next

By feeding this script a series of machine names, WMI could be used to perform a memory audit on a series of machines to see if they needed to be upgraded prior to an OS upgrade.

This script shows how to enable you to select targets for a script. This script queries the user for a target machine and a target application to terminate on the target machine:

do
SrvrName = inputbox("Enter the name of the system to kill an application on.", "Input")
loop until SrvrName <> ""
do
target = inputbox("Enter the name of the program you wish to kill. ", "Kill a Process")
loop until target <> ""
for each process in GetObject("winmgmts:{impersonationLevel=impersonate}!\"&SrvrName&" 
Basic WMI Scripts
ootcimv2").ExecQuery ("select * from Win32_Process where Name='" & target & " ' ")
process.terminate(0)
Next
wscript.echo "Program terminated."

A script like this could be handy for remote administration. If a user called the help desk and complained of a runaway process that they did not have rights to end, the help desk could run a script like this to end the process for the user. This would eliminate the need for a technician to be dispatched to the user’s location.

Defaulting to the Local Machine

If a script uses a target [strServer =] of “.” it is defaulting to the local machine. If a script is only dealing with a single remote system, the system name can be entered there.

Building Services

By combining C++ or Visual Basic with WMI calls, services can be written to use WMI functions. This enables you to effectively run a script continuously without the system overhead of a script running over and over. A service of this type is created with the Win32_BaseService.Create parameter. It is recommended to run WMI services of this type under the System context. This prevents users from being able to simply stop the service.

Use the AT Scheduler

To control services running under the system context it is necessary to access them via the system context. The easiest way to do this is to use the AT scheduler to launch a command window with an /interactive switch. Anything done from this command window will run under the SYSTEM context.

Building Temporary Event Consumers

A temporary event consumer is created by a consuming application and ends when the application ends. In this way the subscriptions are only active when the consuming application is running. This is one way to reduce the CPU load on a system that is using WMI.

Building Permanent Event Consumers

A permanent event consumer is a COM object that is able to receive a WMI event at all times. A permanent event consumer uses a set of persistent objects and filters to capture a WMI event. Not unlike with a temporary event consumer, a series of WMI objects and filters are set up that capture a WMI event. When an event occurs that matches a filter, WMI loads the permanent event consumer and notifies it about the event. Because a permanent consumer is implemented in the WMI repository and an executable file that is registered in WMI, delivery of an event also continues across reboot of the operating system while WMI is running.

A permanent event consumer continues to receive events until its registration is explicitly cancelled. A permanent event consumer is a combination of WMI classes, filters, and COM objects that reside on the system.

A logical event consumer is an instance of the permanent consumer class. The properties of a logical event consumer specify the actions to take when notified of an event, but do not define the event queries with which they are associated. When signaled, WMI automatically loads the COM object that represents the permanent event consumer into active memory. Typically, this occurs during startup or in response to a triggering event. After being activated, the permanent event consumer acts as a normal event consumer, but remains until specifically unloaded by the operating system.

Be Sure to Test the Service and Watch for Memory Leaks

When creating permanent event consumers, be sure to test the service and watch for memory leaks. Some WMI classes are known to have memory leaks. If the memory consumption of svchost.exe increases continuously, this is a common sign of a WMI memory leak.

Deciding What to Monitor

There is a plethora of parameters that you can monitor. Some are very useful; some are not. By limiting the monitored items to only those in which you are interested, there is less chance of missing important information due to the sheer volume of incoming data. Because different types of monitoring have access to different types of data, the following sections each end with the recommendations of specific items to monitor with that type of monitoring.

Monitoring Hardware

Simple hardware monitoring such as pinging a device to see if it will respond is one way to determine if a machine is up and running. This essentially tests layers 1, 2, and 3 of the OSI model. The problem with this type of monitoring is it only tells you if the box itself is physically responding to ping. It does not tell you whether the machine has a particular service running or if that service is in fact running properly. Hardware monitoring is a good basis for other types of monitoring because it enables you to do things such as event correlation. If a number of machines stop responding to a particular query and a router is not responding to ping the software can safely assume that the reason the machines are unreachable is because the router itself is down. For this level of monitoring, basic hardware monitoring is fairly effective.

Ensure that All Interfaces Are Being Monitored

When monitoring a network with redundant connections it is critical to ensure that all interfaces are being monitored. If only the “far side” IP addresses are being monitored, a packet could be taking an alternative route to get there and a failed local interface could be missed.

Recommended monitoring points include the following:

  • Local routers

  • Remote routers

  • ISP

  • Switches

  • VPN devices

Port-Level Monitoring

Going beyond simple ping tests is an effective way to get more information about a system’s health. Well-known services such as SMTP or POP3 can be monitored easily by querying the server on ports 25 or 110, respectively. This goes beyond the simple ping test by ensuring that port is available, which generally means that the application is running. This enables you to make a determination about whether or not a particular service is available. This can be especially effective in finding problems before users report them. One of the primary goals of monitoring is to proactively capture system problems before users discover them. Most monitoring packages on the market support the capability to monitor a system at a port level. By combining hardware-level monitoring with port-level monitoring a clever administrator can reduce the load on the monitoring system. If the link to a remote network is not responding to a physical ping test there is no reason for the software to continue processing port checks on machines on that network. This allows the system to perform event correlation to reduce the number of false positives and that in turn reduces the number of alerts being sent. This will result in less traffic on the network as well as reduce the “boy that cried wolf” effect.

The Netstat Command

An easy way to find the ports to monitor on a system is via the Netstat command. By identifying the ports that multiple users are connecting to, you can identify the ports that are important to the end users. Running Netstat -nao on a Windows Server 2003 system will list source IP addresses, destination addresses and the ports on which they are connecting. Additionally it will list process identifiers associated with each connection. These PIDs can be compared to the Task Manager to see what processes are connecting on which ports.

Some recommended monitoring points are as follows:

  • Port 25 (SMTP)

  • Port 110 (POP3)

  • Port 80 (Web)

  • Port 443 (Secure Web)

  • Port 53 (DNS)

  • Port 1723 (PPTP)

  • Port 3389 (Remote Desktop Protocol)

Service-Level Monitoring

Going one step further in the area of monitoring is the ability to query a service to see if it is running properly. Rather than simply see that port 25 is responding, a monitoring package can send an SMTP query to see if the server will respond with the correct hello. This enables you to ensure that an e-mail server, for example, is correctly receiving e-mail messages. Similarly, software packages that perform service-level monitoring can query the operating system to see if a service is in the “running” state. Services that have failed or that have been stopped can be identified in this manner.

Identify Dependencies for All Services

When possible, identify dependencies for all services to help reduce redundant alerts. For example, in Exchange, if the System Attendant service is down, there is no point in checking the Information Store service. This would only generate a needless alert.

Because the services running will be very specific to the type of server in question it is recommended that you research the system to determine the specific services needed for the server to properly do its job. For example, on an Exchange server you might monitor the following:

  • Microsoft Exchange IMAP4

  • Microsoft Exchange Information Store

  • Microsoft Exchange Management

  • Microsoft Exchange MTA Stacks

  • Microsoft Exchange POP3

  • Microsoft Exchange Routing Engine

  • Microsoft Exchange Site Replication Service

  • Microsoft Exchange System Attendant

  • Simple Mail Transfer Protocol

  • World Wide Web Publishing Service

  • Antivirus

  • Antispam

Application-Level Monitoring

Monitoring systems at the application level enables you to pull useful performance information from the system. Not only can you determine whether or not a service is running, but you can also determine how well it is running. Specific performance metrics such as SMTP queue sizes or mailbox sizes can be monitored to determine the health of the system. Databases can be monitored for critical things like available file locks, replication status, or even current user load. This type of monitoring allows thresholds to be used to determine when reactive measure should be taken to address a system problem. By layering several types of application-level monitoring, complex tests can be performed on the system. Rather than simply pinging a Web server to make sure it’s running or querying it on port 80 or even checking to see if the World Wide Web publishing service is running, an application-level monitoring system can send a specific query to the Web server and determine whether the correct response was received. This level of monitoring gives you an impressive level of insights into the workings of the network.

This type of monitoring can be exceptionally useful in the area of capacity planning. By monitoring and logging application-level performance counters, you can use long-term system-usage tracking to determine when a resource will become insufficient.

Not unlike service-level monitoring, the key monitoring points of application-level monitoring will vary by application based on the role of the server. An Exchange server, for example, might be monitored for the following types of items:

  • SMTP Queue growth

  • MTA Queue growth

  • MAPI transaction time, average

  • Mailbox sizes

  • NDR count

  • Information Store size

  • User load

  • Concurrent connections

  • Traffic on connectors to foreign mail systems

However, a SQL server might be more concerned with the following:

  • Transaction response time

  • Number of long running transactions

  • Error log tracking

  • Process blocks

  • Page-level locks

  • Table-level locks

  • Exclusive locks

  • Shared locks

  • Log space

  • Database space

  • Cache hit rate

Application-Level Monitoring Solutions

Most application-level monitoring solutions require the installation of a monitoring agent on the target system. This allows the monitored system to have greater knowledge of its applications but could potentially increase the CPU load on the monitored system.

In any case, it is critical for the administrator who is implementing the monitoring solution to work very closely with the application owners to ensure that the important monitoring points are being captured, both from an alerting standpoint as well as from a capacity monitoring and planning standpoint.

Performance Monitoring

Although a monitoring system is quite useful for spotting problems and outages it can also be used to measure and track the performance of the system. By identifying key performance metrics such as memory usage or database transaction times you can not only be aware of outages but also see changes in system performance that would affect the end-user experience. By logging these performance metrics you also have the capability to see a long-term view of the performance of the system. Trends in system usage and trends in resource usage become extremely valuable when tracked over extended periods of time. This information can be used to predict when upgrades will be needed.

Performance Monitoring’s Real Value

Although performance monitoring is useful for identifying problems with a system, its real value comes in long-term trend identification.

Some recommended monitoring points are as follows:

  • CPU usage

  • Available memory

  • Available disk space

  • Transaction rate

  • Network utilization

  • Disk I/O

Monitoring Pitfalls

There are a lot of different types of monitoring packages on the market and there are pros and cons to each type. There are a few things you should be aware of when picking a monitoring package.

Resist the Temptation...

Resist the temptation to turn on monitoring for each and every subsystem available. Limit the scope of the monitoring to data points that will actually be used either for long term performance trending or for failure notification. Enabling too many monitoring points only serves to cloud the valid data and discourage you from addressing all of the data. Monitoring too many data points also imposes an unnecessarily harsh load on the system and reduces the scalability of the monitoring system.

A lot of monitoring packages use agents. This means that some piece of code needs to be installed on each machine that will be monitored. This introduces an unknown to the server. You should always baseline the performance of a server before adding a monitoring agent. In this way, any negative impact to the server’s performance can be accurately measured. Also, be aware of packages that utilize protocols that are built into the operating system. Almost every monitoring package on the market supports Simple Network Management Protocol. SNMP is built into most Windows operating systems. Unfortunately, the version built into older versions of Windows isn’t terribly secure. It sends its traps in clear text. Although updates to SNMP are available and have been included in service packs, most administrators don’t know to reinstall their latest service pack after loading SNMP on the system. Also, an administrator should never leave the default community strings! This is a huge security risk.

Other monitoring packages gather information about a Windows system through the use of NetBIOS calls. Although at first this seems like a good idea, keep in mind that if a legitimate monitoring system can gather vital information about a server via NetBIOS requests, so can any other system. Never enable NetBIOS for monitoring purposes on a system that is reachable from outside your network. This goes for DMZs and wireless networks alike.

Determining What to Monitor and Alert Upon

Monitoring and alerting go hand in hand. The real value in system monitoring is being able to alert an administrator if something goes wrong. As such, it is important to determine what parameters should result in an alert being generated. As a general rule any outage in a nonredundant system should be alerted immediately. Any outage in a redundant system that would result in a single system remaining should be immediately alerted. Any security-related events should generate an alert.

Most monitoring and alerting systems today not only track the failures of systems, subsystems, and services but they can also detect these items coming back online. It is always worthwhile to trigger an event when a system comes back into an “up” state. This can mean the difference between coming in to the office at 4 a.m. and going back to sleep.

Mail System Alerts

Alerts regarding a mail system should not be sent via e-mail only. If this is the only choice, try to have more than one mail server available to relay the message. Similarly, alerts regarding an Internet connection should be sent via a pager that it not dependent on the Internet connection.

Hardware Alerting

Although the knowledge that a computer is responding to a ping is of only limited value, the knowledge that a computer has stopped responding to ping is quite useful. Most network hardware places a high priority on ICMP traffic. It is unusual for a router or switch to fail to respond to ping due to excessive load. This means that hardware monitoring rarely results in false positives. As such, failures determined by hardware monitoring should always be alerted. These alerts should use event correlation so that the failure of a router generates an alert but the apparent failures of the upstream routers do not create additional alerts.

E-Mail Is an Excellent Vehicle for Sending Alerts

Always be aware if an alert method is dependent on the device that is being monitored. E-mail is an excellent vehicle for sending alerts but it becomes much less useful when it’s e-mailing a pager to tell it that the Internet connection is down.

Port-Level Alerting

It is not unusual for a port-level monitoring event to timeout before the system has properly responded. As such it is often necessary to set thresholds that report failures. Rather than having a port-level failure immediately trigger an alert it should be set to require multiple failures on consecutive cycles to generate an alert.

Service-Level Alerting

Service-level monitoring looks to the operating system to determine if service is running. As such, service-level checks rarely produce false positives. Service failures reported via the operating system should immediately generate alerts. Services returning to a “running” state should also generate an alert. These types of “up and down” alerts are often used to determine system uptime.

Application-Level Alerting

Application-level alerts are almost always generated by specific counters meeting specific thresholds. Because application parameters can spike under burst conditions it is necessary to set thresholds for how long a specific parameter must remain at a specific level before triggering an alert. By doing so, false positives can be greatly reduced.

As with application-level monitoring, you should work closely with the application owner to determine thresholds for alerting. The application owner will have a much better understanding of the application and will know how to spot abnormal behavior.

Performance Alerting

Performance counters tend to fluctuate greatly during the operation of a system. As a result, point-in-time monitoring can very easily result in false positives. By setting thresholds for not only the value of a performance counter but also for how long the counter must remain above a threshold, an administrator can greatly reduce the number of false positives generated by the system.

Be Aware of Any Service-Level Agreements

Be aware of any service-level agreements when determining the thresholds for triggering a performance-based alert. If a system has an SLA requiring it to be back in service within one hour, using a threshold of more than 10 minutes would be ill-advised.

Administrators must depend on their familiarity with servers to determine the thresholds for alerting. Although a system might run fine at 75% utilization, if the system normally spikes to no more than 10% utilization, a sustained load of 20% might be enough to cause the administrator some concern. Avoid falling into the trap of only generating alerts if something is “pegged” at max utilization for an extended period of time. Any sustained and drastic change in system behavior is most likely a sign that something is wrong and the appropriate resources should be notified.

Alerting Pitfalls

There are many ways to get an alert to the appropriate resource. It is to your advantage to use more than one method for each alert. The two most commonly used alerts are e-mail and pager. E-mail can be a very effective method but it is susceptible to failures in the e-mail system and the Internet connection. There is nothing more annoying than receiving two messages back to back stating “The Internet router is down!” and “The Internet router is back up!” Mail server failures can elicit the same response. Whenever possible, have the alert sent via a media that isn’t a single point of failure. E-mail along with a pager that is dialed via a phone line is an excellent way to ensure that critical alerts reach their intended target.

If alerts are going to an onsite 24/7 resource, make sure that the staff is responding correctly to alerts by performing regular tests. Monitoring a “fake” server that can be used to trigger alerts is a good way to keep the monitoring staff on their toes. An onsite monitoring staff that ignore alerts or doesn’t know how to react aren’t benefiting anyone.

Don’t just take the default values offered by the monitoring package. Some servers just don’t behave in a normal fashion. Generate alerts on values that are outside the server’s normal behavior. Don’t always assume that a resource will hit 100% if there is a problem. Similarly, don’t focus only on high utilization. If a server has been between 30%–40% utilization for the past year, a sustained 5% should be just as alarming as a sustained 75%. Both situations suggest that something bad has happened.

Responding to Problems Automatically

Traditionally systems are monitored so that problems can be seen immediately and the appropriate administrator can be notified of the issue. The administrator would then go to the site where the system is located to perform the necessary maintenance task to return the system to usable state. Modern monitoring systems are able to not only alert administrators to problems but they are able to react to system events and process commands to attempt to fix the problem on their own. In this way problems can be responded to automatically. Often simple fixes can be attempted by the system itself. If these fixes fail the problem can be escalated to an administrator who can handle the problem in person.

Reactive Monitoring Systems Are No Replacement for Qualified Technical Resources

Although many monitoring systems are quite sophisticated, they are only as good as the administrator that configured their responses. Even then, reactive monitoring systems are no replacement for qualified technical resources.

Triggering External Scripts

One of the simplest ways to allow a system to repair itself in the event of simple problems is by triggering an external script. External scripts also enable the clever administrator to extend the abilities of the monitoring system. Rather than using static reactions to an event, any external script can call external programs to do more advanced tasks to determine who is the administrator on call and page them rather than statically paging the same person every time. External scripts enable an administrator to stack events such as triggering a pager, sending e-mails to multiple recipients, or simply trying to restart a series of services.

External Scripts Might Be Prevented

Some external scripts might be prevented by OS level security settings. The MAPISEND, for example, will not be allowed by default on Outlook 2000 SP-1 and higher. This is because the default security settings don’t allow an external script to use a MAPI profile without user intervention. This behavior can be excluded for a system in the Exchange configuration. Always be aware of these types of limitations with scripts and test them before they get used in production.

External scripts often require additional programs to fully execute all the items an administrator would typically want to occur. One of the most useful things an administrator can do is to initiate a command line e-mail message. Programs like Mapisend.exe can be leveraged to send messages to different resources that indicate the situation that has occurred.

mapisend -u Outlook -p password -r [email protected] -s %hostname% is down -m "at 
External Scripts Might Be Prevented%time% on %date%"

Services Recovery and Notification

Windows 2003 has a built-in function that allows services to not only attempt to restart themselves, but also to alert the system in some way that a restart has occurred. By going into the Properties of a service and going to the Recovery tab there are options for what the system should do on first, second, and subsequent failures. There is also an option to determine when to reset the failure counter. The options are to Take No Action, Restart the Service, Run a Program, or Restart the System. Although restarting the service might seem very tempting, it is preferred to run an external program instead. This external program should restart the service but also it should alert you that the service has been restarted. If the monitoring system in use doesn’t natively detect service failure, the program could send an SNMP trap, send a message to the your system, or trigger an e-mail to you to alert you that the event has occurred. This built-in functionality gives you great flexibility in integrating this built-in function to almost any type of monitoring system.

Append an Entry to a Log File

Have the program that restarts the service append an entry to a log file by passing an error message along with the %time% and %date% parameters to enable you to check a single file to determine when and how often a service is failing and restarting. This information, covering a long stretch of time, can be a very useful troubleshooting tool.

Using Microsoft Operations Manager for Advanced Automation

Microsoft has made a huge push into the arena of enterprisewide system monitoring. Microsoft released Microsoft Operations Manager, or MOM, as a tool that integrates tightly in the monitoring and alerting process of Microsoft’s technologies. The result is a monitoring application that supports hardware-level, port-level, service-level, and application-level monitoring. By having access to Microsoft’s full source code for all Microsoft applications, the MOM developers were able to create management packs that could gather every last iota of useful information from an application and allow rules and thresholds to determine when to generate alerts or log useful information.

Understanding MOM

MOM, as a comprehensive monitoring and alerting package, consists of data providers, event correlation, filters, rules, knowledge packs, and knowledge base integration that work together to not only monitor a system, but also to link the administrator to solutions for problems. Rather than just identify a problem, MOM is able to link the user to the Microsoft Knowledge Base to suggest solutions to the issue that has arisen. Similarly, MOM enables you to store problem resolution in a local knowledge base so that other site administrators can learn from the past experiences of their coworkers. Rather than reinvent the wheel each time, MOM helps companies pool together the “islands of knowledge” that exist at any company. By putting the resources into a central location, it is easier for administrators to draw from it.

Benefits of MOM

MOM is oriented around three primary goals—managing, tuning, and securing Windows and Windows-based applications.

In the area of managing, MOM offers full-time monitoring of all aspects of the Windows server–based environment. It provides proactive alerting and responses by using built-in filters and logic to recognize events and conditions that can lead to failure in the future.

Like most monitoring applications, MOM collects long-term trending data about the performance of a system. MOM takes this concept one step further by providing suggestions for improving performance and enabling you to compare the results of performance adjustments to historical information. This addresses one of the fundamental issues with performance tuning, which is having a valid benchmark of the data that can be referenced historically to see if changes to the system are actually improving the performance of the system. MOM provides the empirical data needed to measure the effect of system tuning.

Windows 2000 and Windows 2003 provide excellent auditing capabilities. The problem is that this can produce an incredible amount of data that must be reviewed regularly by the system administrator. The sheer volume of data will limit the amount of attention an administrator can give to the data. This makes it nearly impossible to really review the security logs for subtle security problems. The natural tendency of the system administrator is to reduce the number of items being audited. Although this frees up time for the administrator, it reduces the amount of valuable data entering the system. Unlike an administrator, MOM will tirelessly monitor the logs on every server round the clock, correlating individual events to identify potential hacking attempts or security breeches. MOM can be an administrator’s best friend because it is able to take on the tedious task of reviewing the event logs on all servers in the enterprise to determine if the conditions for a failure are present.

Statistics suggest that 40% of system outages are caused by application failure, including software bugs, applications-level errors, and interoperability problems. Another 40% of outages are attributed to Operator Errors, including configuration errors, entering data incorrectly, and failure to monitor. The other 20% are attributed to hardware failures, power failures, natural disasters, and so on.

As you can see, application-level errors and operator errors together account for 80% of system outages. As such, the greatest return on investment for system uptime is to focus on application failures and operator errors. Although the end users are very good at spotting and reporting system outages, it is greatly preferred to predict potential outages and fix them proactively.

In large companies, administrators tend to work in groups with other administrators who are knowledgeable in a specific area. By putting these specialists together in teams, systems can be effectively managed by these experts. The downside to this philosophy is that a company ends up creating isolated containers of knowledge. Groups that specialize in managing a specific application might not be knowledgeable about the operating system that it runs on. Similarly, applications that are dependent on other applications are usually managed by administrators who only understand their own application, not the applications upon which they are dependent. The result of this is that information outside a group’s area of expertise is not well utilized. An Exchange support group might be getting error messages in the event log that reference data about the connection to a SAN. Without SAN knowledge, the Exchange group can’t know if the log entries are problems or simply informative messages. This can make it very easy to ignore potential problems. MOM attempts to combat this type of issue by providing its own expertise and knowledge. MOM can correlate events with other events and predict the actual outcome. MOM draws information from each of the separate systems in the network and places it in a single location. Equally important, MOM stores this information long term. A busy administrator can easily miss key event log entries because they are overwritten by other events. MOM, on the other hand, reads each and every log diligently and reacts to events based on filters and logic. By storing these key events centrally over a long period of time, administrators are able to go back and look at historic events on a server. By having access to all the data centrally, MOM is able to act on the big picture rather than only be able to react to individual system problems.

Similarly, by having access to all the data and seeing the big picture, MOM is able to filter out false positives by understanding what errors are actually results of a “lowest common denominator” error. For example, if MOM knows that the local router interface is down, it knows not to report all objects known to be on the far side of that router as down as well. It knows that the service checks or application parameter checks on those systems are failing because the system is unreachable. This drastically reduces the number of false positives and reduces the load on the system.

The other area in which MOM really shines is in helping to secure the servers in the enterprise. MOM is able to monitor remote servers for the presence of security patches and hot fixes. Because MOM is tied in with the Microsoft Knowledge Base, it is able to determine what patches should be on a system based on the services it sees the system running.

Having a centralized view of a distributed environment makes managing the security of the environment much easier. By being able to monitor such a large number of events on a server and having access to a centralized knowledge base, MOM is able to perform a basic level of Intrusion Detection as well. MOM will recognize patterns in traffic and events on a server that most administrators will miss.

Third-Party Monitoring and Alerting

There are many other third-party monitoring solutions on the market that provide various levels of monitoring, reporting, alerting, and trend analysis. Some of the more popular ones are

  • HPOpenview

  • Unicenter TNG

  • Servers Alive

  • What’s Up? Gold

  • BMC Patrol

  • SiteScope

  • MRTG

Aside from HPOpenview and BMC Patrol, most of these applications are meant for smaller networks and do not provide the depth of monitoring options that an administrator would get from something like MOM. For small environments, these applications do a good job of alerting administrators when monitored parameters surpass a particular threshold. But these applications are insufficient for providing the capability to support knowledge base links, local knowledge bases, or event correlation with other events to determine holistic situations.

Improving Monitoring Via SMS

Most administrators view SMS as purely a tool for distributing software. Although it is very good at this task, a clever administrator can leverage the capabilities of SMS to further enhance their monitoring environment. Most monitoring packages deal exclusively with servers and network hardware. SMS, on the other hand, is focused mostly on desktops. Because licensing a monitoring package to monitor desktops is usually prohibitively expensive, SMS is a logical choice because it is already gathering information on all the desktops.

SMS software inventory reports are a great source of data to mine for potential intrusions into the network. Monitoring systems for unexpected software packages is a great way to catch viruses or Trojan horses that install themselves on a system. After all, one of the key points of monitoring is to improve network security. As any administrator will tell you, the greatest threat to his network’s security is the end users. SMS is a great tool to keep tabs on end users’ computers.

Summary

This chapter touched on many aspects of monitoring and alerting. You learned that monitoring can come from many sources. ICMP ping checks, OS service statuses, port query responses, and detailed application interrogation can all supply an administrator with information about the status of a system. Monitoring systems can leverage existing standards like SNMP (Simple Network Management Protocol), WMI, or NetBIOS queries or they might require dedicated agents to be installed on the systems to be monitored.

Monitoring information can be useful in both a short term and long term sense. Short term monitoring can determine whether a service stops, whether a piece of hardware changes status, or whether a system stops responding to service requests. Performance counters can be queried for point-in-time information to see if they are within expected ranges. This information can also be useful from a long-term tracking standpoint. By tracking when services go from up to down an administrator can begin to accurately track system uptime. Performance counters can be used from a capacity-planning standpoint to predict when resources will become insufficient. Administrators can plot system resource consumption against the number of users supported to gain valuable insight into the actual capacity of the systems. By having this information, upgrades to the system can be accurately measured in terms of the increase in capacity.

This chapter has shown the various types of monitoring available on the market and has made suggestions about how to determine what parameters should be monitored on different types of servers.

This chapter has introduced the reader to the concepts of scripting. Scripting is an excellent way to extend the capabilities of a monitoring and alerting system by creating complex responses to simple inputs. Rather than just raise a flag when an error occurs, a script can attempt to fix the situation and alert multiple resources about the situation.

You also saw how dedicated monitoring and alerting applications like MOM take the concept to new heights. By centralizing monitoring results, MOM also centralizes the information and expertise. By applying its own knowledge of software applications, MOM is often better able to diagnose problems than the administrator is. MOM intelligently links the administrator to knowledge specific to the situation encountered. By offering knowledge packs for all Microsoft applications, MOM has a significant advantage over other monitoring systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset