Chapter 4. Monitoring Application Performance with System Monitor

The Microsoft Windows 2000 family of servers provides a graphical performance-monitoring tool called System Monitor, which is a Microsoft Management Console (MMC) snap-in. (It was called Performance Monitor, or Perfmon, in previous versions of Windows NT.) This snap-in assists you with monitoring the performance of your Web applications, database systems, and hardware resources on Windows-based servers. MMC has two parts: System Monitor, for monitoring your system in real time; and Performance Logs and Alerts, for monitoring your system in logging mode.

In this chapter we discuss how to use the MMC snap-in, which resources to monitor, and how to interpret some of the most common system counters. Performance objects and counters specific to the .NET Framework are discussed in Chapter 7, ASP.NET and IIS counters and objects are detailed in Chapter 6, and SQL Server–specific counters and objects in Chapter 8. Note that we refer to the Windows 2000 server family in this chapter, not Windows .NET Server, because at the time of this writing, Windows .NET Server is still in beta.

Using System Monitor

The units of measurement used to monitor hardware and software resources through System Monitor are called counters, which are then further grouped into categories called objects. In some cases, counters also have instances. For example, when monitoring the processor activity of a Web server, you monitor the % Processor Time counter, which is found under the Processor object. If there is more than one processor in the server, you can choose to monitor the total activity of all of the processors or instances for each individual processor.

A default set of System Monitor objects and counters are made available when you install Windows 2000 Server. If a specific application, such as SQL Server, is installed on the server, SQL Server–specific objects and counters are also made available. In this case System Monitor will use remote procedure calls to collect information from SQL Server.

System Monitor consumes a small amount of CPU and disk resources on the system it is monitoring, which you should keep in mind when measuring and determining the performance of systems and applications. If you are monitoring the machine remotely, System Monitor will also consume bandwidth from the network card. On highly utilized systems and applications hosted on the Windows 2000 platform, this overhead could be cause for concern, especially if you’re monitoring production systems, but in most dedicated test environments it should not be a concern.

In addition to monitoring performance counters, System Monitor allows you to

  • view selected system performance objects and counters in real time.

  • log performance counter information for later analysis.

  • monitor multiple Windows 2000 servers from one instance of System Monitor.

  • create alerts to notify you when certain performance thresholds or conditions occur. For example, when the processor time goes above 90 percent, you can configure the alert to log an entry in the Event Viewer, send a network message, start a performance data log file, run a program when the condition occurs, or all of the above.

  • trace events to record data when certain activities occur for processes, threads, disk I/O, network I/O, file details and page faults. Trace logs require a parsing tool to interpret the output. You can create such a tool using APIs provided at the following location: http://msdn.microsoft.com/msdn-files. Tracing is rarely used, except by Microsoft support providers.

This chapter looks at the first two methods in detail and also provides an overview of the fourth (creating notification alerts).

Viewing Real-Time Performance Data

In real-time mode you can view performance data in three different categories: chart, histogram, or report. Viewing data in real time is useful if you know what you are looking for or if you want to verify the existence of a particular bottleneck. When you have to run a stress test overnight for a bottleneck to appear, the best approach is to log the performance data. See the section "Logging and Viewing Logged Data" later in this chapter for more information on this topic.

Chart View

The real-time chart view is an excellent method for identifying trends in the data over time and comparing multiple instances of a counter such as % Processor Time. For the most part you will find yourself using chart view for your performance tests, so it’s important to become familiar with this particular view to take full advantage of its rich feature set. There are two views available under chart view, graph view and histogram view. An example of the graph view is shown in Figure 4-1.

System Monitor graph view
Figure 4-1. System Monitor graph view

Graph View

  1. Click Start, point to Programs, then Administrative Tools, and then click Performance.

  2. Click System Monitor in the console tree on the left to view System Monitor in the right pane.

    To explain the power of chart view for monitoring performance, we’ll walk you through an example of using the graph and histogram views. So, let’s say you want to monitor processor utilization of a multiprocessor system. First, launch System Monitor. The performance console that contains System Monitor opens as shown in Figure 4-2.

    Tip

    You can also launch System Monitor by typing perfmon in the Run command.

    System Monitor console tree
    Figure 4-2. System Monitor console tree
  3. If the monitor is not in chart view, click the View Chart button at the top of the right pane to switch to chart view.

Counter Color

Now that you have a graph view containing processor utilization information, let’s explore some of the features of System Monitor you can use to work with the performance data you are collecting. Some of the features are specific to chart view, while others can be used under the report and histogram views as well.

It is important to use the colors to distinguish the data you are monitoring. You can choose each color from a palette (in the Property Name list box of the System Monitor Properties dialog box) or you can base the colors on system colors defined using the Display icon in Control Panel. When using the palette, note the following:

  • BackColorCtl refers to the area surrounding the chart.

  • BackColor refers to the chart data-display area.

  • ForeColor refers to the color of the text in the display and legend.

Counter Scale

Depending on the counter you are monitoring, there will be times when you’ll need to adjust the counter’s scale so the counter information displayed makes more sense. For instance, when monitoring memory-related counters, you will need to increase the scale of the counters that deal with bytes of information. The value of a counter scale can be changed exponentially from .0000001 to 1000000.

Counter Characteristics

When you’re selecting counters for data collection, note the counter characteristics. Some counters are formulated as instantaneous and others are averages. Instantaneous counters display the most recent measurements, while averaging counters display the average of the last two measurements over the period between two samples. For the processor object counters on a multiprocessor server, each processor will be listed in the Instances selection box. System Monitor will list the instances from zero. For example, a four-processor server will be listed from instance 0 to 3, where the third instance represents the fourth processor.

It is important that you understand the implications of the counter on the data. For example, if transaction/sec is being monitored, be aware that the data is calculated as the number of transactions counted for the sample time selected. The number of transactions is divided by the number of seconds in the interval selected. Additionally, note how to interpret a spike when working with averaging counters. For example, when you first begin to monitor the % Processor Time counter you may see an initial spike in processor usage. For an accurate view of processor utilization, wait until the second or third reading for the average value for the counter.

Parent Instance Name

If you are monitoring threads of the Microsoft Windows Explorer process, track the Windows Explorer instance of the Thread object (Windows Explorer would be the parent instance), and then each thread running Windows Explorer (these threads are child instances). The instance index allows you to track these child instances. The instance index for the thread you want might be 0, 1, and so on, for each thread, preceded by the number sign (#). The operating system configures System Monitor properties to display duplicate instances by default. Instance index 0 is hidden; numbering of additional instances starts with 1. You cannot monitor multiple instances of the same process unless you display instance indexes.

Computer Name

Each object has counters that are used to measure various aspects of performance, such as transfer rates for disks or the amount of processor time consumed for processors.

Computer Name is the name of the computer that will be displayed at the bottom of the chart view. Be careful when collecting the same objects and counters from different servers; use the colors and different fonts to distinguish between instances on different servers.

Value Bar

The value bar under the chart contains statistical information for the currently selected counters. The value bar can be turned on or off by right-clicking anywhere in the chart and selecting Properties, selecting the General tab in the System Monitor Properties dialog box and deselecting Value Bar under Display Elements.

The values displayed in the value bar are as follows:

  • Last

    The last value displayed for the currently selected counter.

  • Average

    The average value of the currently selected counter.

  • Minimum

    The minimum value of the currently selected counter.

  • Maximum

    The maximum value of the currently selected counter.

  • Duration

    The total elapsed time displayed in the graph, this value is based on the interval value. The interval value you set determines how often counter data should be collected. For more information on the interval setting see the section "How Often Should You Collect Data?" later in this chapter

Histogram View

The histogram view is the preferred method of viewing data when monitoring multiple instances of the same counter. For example, you can compare the % Disk Read Time for all of the drives in your server to understand which drive is being taxed with read requests. You can switch to histogram view by clicking the Histogram button on the toolbar or typing Ctrl+B. Additionally, you can select Histogram on the General tab of the System Monitor Properties dialog box.

Histogram view
Figure 4-3. Histogram view

Report View

Report view is extremely useful when monitoring counters dealing with logical and physical I/O, such as disk or network I/O. For example, if you have to monitor all the processes running on your Web server concurrently, doing so under chart view would create an extremely hard-to-read graph or histogram. Instead, you can switch to report view for an easy-to-read view of the data.

To view real-time data as a report, you can click the Report tool on the toolbar or type Ctrl+R. Or, you can select Report on the General tab of the System Monitor Properties dialog box.

How Often Should You Collect Data?

For both real-time performance monitoring and data logging, you can set a specific interval for data collection. The interval you set for data collection will have a significant impact on your ability to capture potential performance bottlenecks. For the most part, the type of bottleneck you are investigating will determine the interval period you set. For instance, if you are monitoring a problem that manifests itself slowly, such as a memory leak, you should set a longer interval period. On the other hand, if the bottleneck tends to occur frequently, set a lower interval period. When you’re not sure of the bottleneck or when it’s occurring, setting the interval period to 15 minutes should be sufficient to start with.

Also consider the overall length of time you want to monitor when choosing this interval. Updating every 15 seconds is reasonable if you will be monitoring for no more than four hours. If you’ll be monitoring a system for eight hours or more, do not set an interval shorter than 300 seconds (five minutes). Setting the update interval to a frequent rate (that is, a low value) can cause the system to generate a large amount of data, which can be difficult to work with especially if you’re simultaneously monitoring a large number of counters.

Monitoring many objects and counters can also generate a large amount of data and consume disk space. Try to strike a balance between the number of objects you monitor and the sampling frequency to keep log file size within manageable limits.

If you prefer to maintain a long update interval when logging, you can still view data fluctuations that occur between those intervals. To do so, see the next section, "Logging and Viewing Logged Data" for information about manipulating time ranges within logs.

Logging and Viewing Logged Data

One of the most valuable features of System Monitor is its logging capability. Regular logging of performance data allows you to compare before and after a change to the system’s hardware, software, or application. For example, your company decides to launch a new marketing campaign selling its most popular widgets at 50 percent off. This sale causes a dramatic increase in traffic to the company’s Web site. Logging performance data will allow you to compare and contrast the effects of the increased user transaction before, during, and after the sale. This information can then be analyzed to determine whether the Web site is suffering from any bottlenecks or whether you have adequate hardware to support future marketing campaigns.

You can capture System Monitor data for a specific period of time and then analyze and compare the performance log files later on. This will allow you to view the system behavior over time, which can reveal trends in system usage that you might not see when viewing real-time data. For example, in your log files you might find that disk utilization is typically high from 8 P.M to 10 P.M and lower for the remainder of time. You might be able to equate this trend to heavy data entry or a database backup during these times.

Note

When you use System Monitor, it is most efficient to start the log locally on the server you want to collect data from. You can access the log file later from a remote system if you need to. If you must log over the network, reduce the number of objects and counters to the most critical ones.

To start logging information on a Windows 2000 server using System Monitor, follow these steps:

  1. Click Start, Programs, then point to Administrative Tools, and click Performance.

  2. Expand Performance Logs And Alerts in the left pane of the performance window.

  3. Select Counter Logs in the console tree.

  4. In the right pane, right-click and choose New Log Settings from the shortcut menu as shown in Figure 4-4.

    Create log
    Figure 4-4. Create log
  5. Enter a name to identify the log settings in the New Log Settings dialog box as shown in Figure 4-5, and then click OK. We chose IBuyspy for the name of this log setting.

    New Log Settings
    Figure 4-5. New Log Settings
  6. The name you choose for the log setting will appear as the title for the Performance Log dialog box. Click the Add button to open the Select Counters dialog box.

  7. Add the counters that you want saved to the log file by highlighting them and then clicking the Add button. When finished adding counters, click Close in the Select Counters dialog box. On the General tab of the Performance Log dialog box, you can also set the interval for sampling data.

  8. Click the Log Files tab to set log file–specific information, such as location, file name, file type, and maximum file size limit. Use CSV files for the default file type, especially if you are collecting data for a long period of time, in order to maintain manageability. As shown in Figure 4-6, in this example we save the file locally.

    Note

    It is always a good idea to select a high interval for the sampling data value when collecting performance data over a long period of time. This is most important to note when saving data as a binary file—binary files are always larger than CSV files.

    Log file information
    Figure 4-6. Log file information
  9. Click the Schedule tab to set the schedule for logging. If you do not enter a time for the logging to stop, it will continue until you stop it manually.

  10. Click OK. The log file name should appear in the right pane. The green icon next to the log name shows logging has started, as shown in Figure 4-7.

    Log file started
    Figure 4-7. Log file started

    The icon will be red if the logging is stopped. You can start and stop logging manually in this window by right-clicking the name of the log setting and choosing Start or Stop from the shortcut menu.

After you have collected information and stopped the logging process, use the following steps to load logged data into the system and view it.

  1. Click System Monitor in the left pane to view System Monitor in the right pane.

  2. Click the View Log File Data button at the top of the right pane to open the Select Log File dialog box, as shown in Figure 4-8. Navigate to the folder and select the log file you want to view and then click open.

    Log File dialog box
    Figure 4-8. Log File dialog box
  3. Click the Add button to add the counters you want to view from the selected log file. When you close the Add Counters dialog box you will see the selected counters in System Monitor, as shown in Figure 4-9. Only the counters selected for initial logging will be available for viewing.

    Add Counters
    Figure 4-9. Add Counters

You can narrow your view of the log file by following these additional steps:

  1. After you have loaded your log file into System Monitor, click the Properties button at the top of the right pane.

  2. In the System Monitor Properties dialog box, click the Source tab.

  3. Near the bottom of the Source tab, slide both ends of the slide window to include the times that you want to see in the View range as shown in Figure 4-10.

    Log file information
    Figure 4-10. Log file information
  4. (Optional) Use the other tabs in System Monitor Properties dialog box to change the different characteristics of the output.

  5. Click OK to return to System Monitor and view the selected range of the data.

Monitoring Remote Computers

With System Monitor you can remotely collect and monitor data from multiple machines. There could be several reasons why you would want to monitor remote machines—for example, the server in question could be located in a lab or production environment across the country.

Before you start monitoring remote computer performance you must have Access This Computer from the Network rights. These rights are granted by following these steps:

  1. From the Administrative Tools folder, launch the Local Security Policy program.

  2. Double-click the Local Policies folder to expand it.

  3. Double-click the User Rights Assignment folder. The currently defined list of policy rights will be displayed.

  4. Find the policy Access This Computer from the Network and double-click to open it. A list of users and groups assigned to the policy will be displayed.

  5. To assign additional users or groups to the policy click the Add User or Group button.

To monitor a remote computer follow these steps:

  1. Start System Monitor and then press Ctrl+I to open the Add Counters dialog box.

  2. In the Select Counters From Computer list, select or type the name of computer you want to monitor.

You should keep your connection speed in mind when monitoring computers remotely. If you have a slow connection speed (128 Kbps or slower) you may want to switch from the default chart view to report view. This way you’ll be passing far fewer graphics over the wire, and avoiding delays in results being displayed on your screen.

Monitoring Objects, Counters, and Instances for Performance Bottlenecks

In this section we elaborate on some of the most commonly used performance counters. These counters are key when you’re attempting to determine processor, memory, and disk bottlenecks. We provide some real world examples that show how to determine each bottleneck. We do not cover all of the system counters, as they would require a book of their own.

Note

If you don’t see a counter mentioned that you need information on, you can use online help to get a description of each counter. To access this information, open System Monitor, follow the steps to add a counter, which are described earlier in this chapter. When the Add Counters dialog box opens, click on the counter you want information on and then click the Explain button.

Processor Bottlenecks

When analyzing the performance of a Web application, one of the most commonly observed components is CPU utilization. The server’s CPU is performing complex operations and therefore is a logical place to start when observing the performance of the Web server. The general processor information is contained in the processor object. The primary objective of monitoring processor-specific counters is to identify any potential processor bottlenecks on the server. As a best practice you should limit CPU utilization to an average of 75 percent or below for each processor, although short bursts of 100 percent utilization could be tolerable depending on the nature and user base of the application. High CPU utilization can lead to high context switching (discussed later in the chapter) which causes undesirable overhead. So, though you may be running at 90 percent CPU utilization, you might not be getting optimal throughput compared to when you’re running at 70 percent or 75 percent utilization.

For a system with multiple processors, System Monitor lists an instance of each processor in the Add Counters dialog box. You can also view the average value of all processors by monitoring the Total instance. In a single processor system, System Monitor lists the total and one processor instance; both refer to the single processor.

Below is a list of counters that should be monitored when investigating a processor level bottleneck. Additionally, we describe best practices when using the Processor object counters.

  • % Processor Time

    The percentage of elapsed time that the processor spends to execute a non-idle thread. It is calculated by measuring the duration of the idle thread that is active in the sample interval, and subtracting that time from interval duration. (Each processor has an idle thread that consumes cycles when no other threads are ready to run). This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. It is calculated by monitoring the time that the service is inactive and subtracting that value from 100 percent.

  • % Privileged Time

    The percentage of elapsed time that the process threads spent executing code in privileged mode. When a Windows system service is called, the service will often run in privileged mode to gain access to system-private data. Such data is protected from access by threads executing in user mode. Calls to the system can be explicit or implicit, such as page faults or interrupts. Unlike some early operating systems, Windows uses process boundaries for subsystem protection in addition to the traditional protection of user and privileged modes. Some work done by Windows on behalf of the application might appear in other subsystem processes in addition to the privileged time in the process.

  • % User Time

    The percentage of time the thread is running in the code of a user mode process or code other than the operating system’s code. % User Time should always be viewed for sanity against % Privileged Time on System and Processor objects, as they are measures of non-idle time and sum the total of non-idle time.

  • % Interrupt Time

    The time the processor spends receiving and servicing hardware interrupts during sample intervals. This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended during interrupts. Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity. This counter displays the average busy time as a percentage of the sample time.

  • Interrupts/sec

    The average rate, in incidents per second, at which the processor received and serviced hardware interrupts. It does not include deferred procedure calls (DPCs), which are counted separately. This value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network interface cards, and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended. The system clock typically interrupts the processor every 10 milliseconds, creating a background of interrupt activity.

To observe the efficiency of a multiprocessor computer, use the counters listed in Table 4-1.

Table 4-1. Counters for Multiprocessor Computers

Counter

Description

Process % Processor Time

The sum of processor time on each processor for all threads of the process.

Processor(_Total) % Processor Time

Lists activity for all processors in the computer. It’s the average non-idle time of all processors during the time interval divided by the number of processors. It should be noted that the counter will have a value of 50% if all processors are busy for half of the sample interval and if half of the processors are busy for the entire interval.

Process % Processor Time

The sum of processor time on each processor for all threads of the process.

A processor bottleneck occurs when the demand is overshooting supply of processor threads of the system or applications being deployed. This is caused by processor demands being queued and thus maintaining high CPU utilization until the queue is being emptied, which causes the system response to degrade.

When you find that the processor utilization on a server is consistently high (90 percent or higher) it usually leads to processes queuing up, waiting for processor time, and causing a bottleneck. Such a sustained high level of processor usage is unacceptable for a server.

Let’s discuss an example of high processor utilization. If you are monitoring an IIS server hosting a single Web site that relies upon a legacy COM+ application written in Visual Basic 6 to parse through extensive XML documents, you may find that the COM+ application is utilizing more than 90 percent of the processor’s time. This high processor utilization by the COM+ application affects the Web application’s ability to handle new connections to the site. If you understand the type of bottleneck (in this case, a processor bottleneck) and the root cause of the bottleneck (processor hungry COM+ application), you can decide how to handle the resource problem. One solution may be to physically separate the COM+ application from the Web server, or to convert your code to more efficient and faster performing managed code.

Note

When examining processor usage, keep in mind the role of the computer and the type of work being done. High processor values on a SQL server are less desirable than on a Web server.

There are two methods for correcting most processor bottlenecks. The first is to add faster or additional processors to your system. The downside to this option is that its not cost effective and is a temporary solution. The next surge in traffic to your Web site will cause you to scramble to add additional hardware or replace the old servers with newer faster servers. The other and more appropriate route is to analyze the software to see which specific process or portion of the application is causing this bottleneck. As a rule, you should always try to performance tune your software before reverting to the more costly route of adding additional hardware. In addition to monitoring counters found under the Processor object, there are other counters found under the System object that you should monitor when verifying the existence of a processor bottleneck.

System Object

The System object and its associated counters measure aggregate data for threads running on the processor. They provide valuable insights into your overall system performance. The following system counters are the most important to monitor.

  • Processor Queue Length

    The number of threads in the processor queue. Unlike the disk counters (discussed later in the chapter), this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload.

    One way to determine if a processor bottleneck exists with your application is to monitor the System Processor Queue Length counter. A sustained queue length along with an over-utilized processor (90 percent and above) is a strong indicator of a processor bottleneck.

    When monitoring the Processor Queue Length counter we generally do not want to see a sustained processor queue length of 2 or more along with high processor utilization. If you find that the queue length is 2 or higher, but your processor utilization is consistently low, you may be dealing with some form of processor blocking rather than a bottleneck.

    You can also monitor Processor % Interrupt Time for an indirect indicator of the activity of disk drivers, network adapters, and other devices that generate interrupts.

  • Context Switches/sec

    The combined rate at which all processors on the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is pre-empted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. It is the sum of Thread\Context Switches/sec for all threads running on all processors in the computer and is measured in numbers of switches. There are context switch counters on the System and Thread objects. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

    A system that experiences excessive context switching due to inefficient application code or poor system architecture can be extremely costly in the terms of resource usage. Your goal should always be to decrease the amount of context switching occurring at your application or database servers. Context switches essentially prevent the server from getting any real work done as valuable processor resources are taken up dealing with a thread that is no longer able to run because it is blocked waiting for a logical or physical resource, or the thread puts itself to sleep. Symptoms of high context switching can include lower throughput coupled with high CPU utilization, which begins to occur at switching levels of 15,000 or higher. You can determine whether context switching is excessive by comparing it with the value of Processor % Privileged Time. If this counter is at 40 percent or more and the context-switching rate is high, then you should investigate the cause for the high rates of context switches.

    Finally, when monitoring your system you should make sure that the SystemContext Switches/sec counter that reports system wide context switches is close to, if not identical to, the value provided by the _Total instance of the ThreadContext Switches/sec counter. Monitoring this over time can help you determine the range by which the two counters’ value might vary.

Disk Bottlenecks

Disk space is a recurring problem. No matter how much drive space you configure your servers or network storage devices with, your software seems to consume it. However, disk bottlenecks problems are related to time, not disk space. When the disk becomes the limiting factor in your server, it is because the components involved in reading from and writing to the disk cannot keep pace with the rest of the system.

The parts of the disk that create a time bottleneck are less familiar than the megabytes or gigabytes of space. They include the I/O bus, the device bus, the disk controller, and the head stack assembly. Each of these components contributes to and, in turn, limits the performance of the disk configuration.

System Monitor measures different aspects of physical and logical disk performance. To truly understand the state of disk resource consumption you will need to monitor several disk counters, and in some instances you will need to monitor them for several days. On top of this, you will probably find yourself churning through some mathematical formulas to determine whether or not a disk bottleneck exists at your server. These formulas are detailed in the real world example below. However, before we delve into these formulas let’s review some of the counters you will monitor when hunting down a disk bottleneck. These counters will allow you to troubleshoot, capacity plan and measure the activity of your disk subsystem. In the case of some of the counters the information they provide is required for the aforementioned disk bottleneck formulas.

  • Average Disk Queue Length

    The average number of both read and writes requests that were queued for the selected disk during the sample interval.

  • Average Disk Read Queue Length

    The average number of read requests that were queued for the selected disk during the sample interval.

  • Average Disk Write Queue Length

    The average number of write requests that were queued for the selected disk during the sample interval.

  • Average Disk sec/Read

    The average time, in seconds, of a read of data from the disk.

  • Average Disk sec/Transfer

    The time, in seconds, of the average disk transfer.

  • Disk Reads/sec

    The rate of read operations on the disk.

  • Disk Writes/sec

    The rate of write operations on the disk.

How the ACE Team Discovered a Disk Bottleneck

An internal product team at Microsoft was interested in evaluating server hardware from two different vendors. These servers would be used to host the SQL database for a Web application they were designing. This Web application would be accessed by several thousand customers simultaneously; therefore, selecting the right hardware was critical for the success of their project. The product team was interested in conducting several stress tests and monitoring the effect these tests had on the SQL server’s resources. A stress test harness was developed that simulated production environment activity. The stress harness was written using Visual Basic and run on client machines as a Win32 application. One hundred client machines were configured to execute the stress test harness. The stress harness was designed to spawn instances that simulated five users per instance, each connecting to a different database (that is, db1 through db5) on the server. The workflow used results in each client executing a SQL batch file via ADO or an OSQL instance for each operation. These batch files were generated using SQL Profiler to trace manual user navigation of the site then saving the trace as a SQL batch file. The operations performed in this manner for these tests were:

  • Load the login page

  • Select a user name and hit enter

  • Load the tasks page

  • Submit actual work times to the manager

  • Load the resource views page

  • Set and save notification reminders

  • Delegate one task to another resource

The client machines were configured so that all of the 500 databases at the SQL server would be accessed during the tests. This helped prevent any one of the databases from receiving a majority of the SQL transactions. After configuring the client machines, the stress test harness was started and run for 20 minutes (15 minutes were set aside as a warm up period). During these 20 minutes, performance data at the SQL server was collected for benchmark purposes.

A wait time of 10 and 60 seconds was used when executing the load against the targeted databases. Each simulated user started the test at a random offset from the global start time of the test and performed one operation. The user would then wait either 10 or 60 seconds before beginning the next operation.

On executing both scenarios a significant disk read times and write times was noticed which prompted an investigation as to the disk capacity of the hardware being utilized. The calculations indicated the I/O per disk exceeded the manufacturer’s specified I/O that the disk can successfully handle.

The performance data collected during the 10 second and 60 second wait-time benchmark indicated the existence of a disk bottleneck at Server 1. In order to verify this, our team applied the performance data gathered from the physical disk activity to the following formula:

I/Os per Disk = [Reads + (4×Writes)] / Number of Disks

If the calculated I/Os per disk exceeded the capacity for the server, this would verify the existence of a disk bottleneck. The disk I/O capacity and calculated disk I/O per disk is outlined below. It should be noted that for each of the calculations, 85 random I/Os per disk is used as the capacity for a disk in a RAID 5 configuration.

10-Second Wait Time Test Scenario on Server 1

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [269.7 + (4×74.6) ] / 5

Calculated I/Os per disk = 113.62 random I/Os per disk

At 113.62 random I/Os per disk Server1 is suffering from a disk bottleneck as the capacity for each disk in the server was only 85 random I/Os per disk.

10-Second Wait Time Test Scenario on Server 2

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [138.3 + (4×43.0)] / 4

Calculated I/Os per disk = 77.7 random I/Os per disk

At 77.7 random I/Os per disk Server 2 is below the capacity of 85 random I/Os per disk, therefore no disk bottleneck exists.

60-Second Wait Time Test Scenario on Server 1

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [294.8 + (4×71.8) ] / 5

Calculated I/Os per disk = 116.4 random I/Os per disk

At 116.4 random I/Os per disk Server 1 is suffering from a disk bottleneck as the capacity for each disk in the server is only 85 random I/Os per disk.

60-Second Wait Time Test Scenario on Server 2

Disk I/O capacity = 85 random I/Os per disk

Calculated I/Os per disk = [68.9 + (4×24.0) ] / 4

Calculated I/Os per disk = 41.2 random I/Os per disk

At 41.2 random I/Os per disk Server 2 is significantly below the capacity of 85 random I/Os per disk, therefore no disk bottleneck exists. At 113.62 and 116.4 random I/Os per disk respectively Server1 is suffering from a disk bottleneck as the capacity for each disk in the server is only 85 random I/Os per disk thus exceeding the manufacturer’s specified number of disk I/Os the hardware can sustain.

Disk Architecture Matters to Performance

Today, many Web applications are built to interact with database server. Many if not all of the applications we test use SQL Server 2000, and in most cases we find some significant performance gains by tuning the SQL server. These wins come through optimization of the SQL code, database schema, or disk utilization. When designing the architecture of your database, you will be required to select how data and log files are read and written from disk. For example, do you want to write your log files to a RAID device versus a non-RAID device? If you do not make the right choices, this can lead to a disk bottleneck. In one such case we were able to apply formulas that proved or disproved the existence of a disk bottleneck. You will find details of the project and formulas utilized in the real world example above.

Memory

When analyzing the performance of your Web applications, you should determine if a system is starving for memory due to a memory leak or other application fault, or if the system is simply over-used and requires more hardware. In this section we discuss the counters you should monitor to determine the existence and then cause of the memory bottleneck. (Note that there are tools available to you other than System Monitor to analyze memory utilization of a server. It may be worth your while to investigate some of these tools, as they can save time when monitoring the system.)

  • Page faults/sec

    The average number of pages faulted per second. It is measured in number of pages faulted per second because only one page is faulted in each fault operation; hence this is also equal to the number of page fault operations. This counter includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers of soft faults without significant consequences. However, hard faults, which require disk access, can cause significant delays.

  • Available Bytes

    Indicates how many bytes of memory are currently available for use by processes. Pages/sec provides the number of pages that were either retrieved from disk due to hard page faults or written to disk to free space in the working set due to page faults.

  • Page Reads /sec

    The rate at which the disk was read to resolve hard page faults. It shows the number of read operations, without regard to the number of pages retrieved in each operation. A hard page fault occurs when a process references a page in virtual memory that is not in the working set or elsewhere in physical memory, and must be retrieved from disk. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It includes read operations to satisfy faults in the file system cache (usually requested by applications) and in non-cached mapped memory files. Compare the value of Memory\Pages Reads/sec to the value of Memory\Pages Input/sec to determine the average number of pages read during each operation.

  • Page writes /sec

    The rate at which pages are written to disk to free up space in physical memory. Pages are written to disk only if they are changed while in physical memory, so they are likely to hold data, not code. This counter shows write operations, without regard to the number of pages written in each operation. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.

  • Pages/sec

    The rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory\Pages Input/sec and Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

How the ACE Team Discovered a Memory Leak

In this example we discuss how we were able to determine the existence of a memory leak in an application that was submitted to our team for performance testing. Performance analysts on our team met with the development team to understand some of the common user scenarios for the Web application. The analyst discussed existing performance issues the development team was aware of. The developers were concerned about memory usage by COM+ applications running on the Web server. Keeping this in mind, the analyst thought the best approach in ruling out memory issues would be to execute a series of stress tests. These tests would help to uncover memory utilization issues at the server if they truly existed. The analyst built test scripts of the user scenarios provided by the development team and executed a short stress test. Performance logs recorded resource utilization at the server hosting the COM+ application. During this one-hour test the analyst observed a memory consumption of approximately 20 MB. He noted that this memory was still not released three hours after the test was stopped. These findings prompted a further investigation into the application’s memory consumption (see Table 4-2). A 12-hour continuous stress test was conducted to analyze the applications memory behavior. At the end of the 12-hour continuous test it was discovered that in addition to heavy CPU activity, growth in private bytes was significant for the test period and the server was extremely low on virtual memory (see Table 4-3). Of the 671 MB acquired by the dllhost private bytes, 640 MB was still allocated three hours after the test ended. Virtual memory growth appeared to be centered almost entirely on private bytes for the dllhost process. For the 1-hour test, the memory only grew from between 38 to 58 megabytes. For the 12-hour test, this growth was much higher, from 368 to 671 megabytes. The memory was not released until the server was rebooted. The dllhost process was then analyzed to identify the processes that were involved in the execution of the dllhost to narrow down the potential memory leak to a specific process. After identifying the exact process causing the memory leak, the code for that process was profiled and the developer was able to pinpoint exactly where in his code memory was not being managed correctly. Of course with managed code, you won’t find yourself running into the slew of memory management issues you did in the days of unmanaged code.

Table 4-2. Summary of 1-hour Test Results

Windows 2000 IIS 5.0

~ Average-IIS

~Maximum / Total-IIS

System-% Total Processor Time

55%

100%

Inetinfo-% Total Processor Time

.5%

1%

Dllhost-% Total Processor Time

41%

100%

Memory: Available in Megabytes

164 MB

185 MB

Memory: Pages/sec

0

.2

Inetinfo: Private in Megabytes

14 MB

14 MB

Dllhost: Private in Megabytes

38 MB

56 MB

Table 4-3. Summary of 12-hour Test Results

Windows 2000 IIS 5.0

~ Average-IIS

~Maximum / Total-IIS

System-% Total Processor Time

69%

100%

Inetinfo-%Total Processor Time

.6 %

1.5%

Dllhost-% Total Processor Time

71%

100%

Memory: Available in Megabytes

56 MB

196 MB

Memory: Pages/sec

51

295

Inetinfo: Private in Megabytes

14 MB

14.4 MB

Dllhost: Private in Megabytes

368 MB

671 MB

Memory leaks should be investigated by monitoring Memory Available bytes, Process Private Bytes and Process Working Set. A memory leak would typically indicate Process Private Bytes and Process Working Set increasing while MemoryAvailable bytes would be decreasing. This should be verified in Task Manager by identifying PID and then trace this back to your application. Memory leaks should always be verified by running a performance test for an extended period of time to verify the applications reaction when all available memory is depleted.

Create and Configure Alerts

You can configure the Performance Logs and Alerts service to fire off alerts when a specified performance event has occurred at the server. For example, if the available memory at the Web server drops below 20 MB, an event could be trigged that satisfies one or all of the following conditions:

  • Logs an entry to the application event log

  • Sends a network message to a specified user

  • Starts a performance data log

  • Runs a specified program

There are several instances when configuring an alert to trigger an event helps increase your testing efficiency. One is when you are running an extended stress test. Let’s say the stress test must be run over a 24-hour period and you are particularly interested in what happens with the Web server’s memory. You could configure an alert that records an event to the application event log each time a spike occurs with the Pages/Sec counter. This way, you don’t have to try to count the number of spikes in an enormous log file. You can simply sort the application event log for each instance you are most concerned with.

To create an alert follow these steps:

  1. Open Performance and click Start, point to Programs, point to Administrative Tools, and then click Performance.

  2. Double-click Performance Logs and Alerts, and then click Alerts. Any existing alerts will be listed in the details pane. A green icon indicates that an alert is running; a red icon indicates an alert has been stopped or is not currently active.

  3. Right-click a blank area of the details pane and click New Alert Settings.

  4. In Name, type the name of the alert, and then click OK.

  5. To define a comment for your alert, along with counters, alert thresholds, and the sample interval, use the General tab. To define actions that should occur when counter data triggers an alert, use the Action tab, and to define when the service should begin scanning for alerts, use the Schedule tab.

Note

You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. The subkey is:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesSysmonLogLog Queries

In general, administrators have this access by default. Administrators can grant access to users using the Security menu in Regedt32.exe. In addition, to run the Performance Logs and Alerts service (which is installed by Setup and runs in the background when you configure a log to run), you must have the right to start or otherwise configure services on the system. Administrators have this right by default and can grant it to users by using Group Policy.

Caution

Incorrectly editing the registry may severely damage your system. Before making changes to the registry, you should back up any valued data on the computer.

To define counters and thresholds for an Alert, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts, and then click Alerts.

  3. In the details pane, double-click the alert.

  4. In Comment, type a comment to describe the alert as needed.

  5. Click Add.

For each counter or group of counters that you want to add to the log, perform the following steps:

  1. To monitor counters from the computer on which the Performance Logs and Alerts service will run, click Use Local Computer Counters.

    Or, to monitor counters from a specific computer regardless of where the service is run, click Select Counters From Computer and specify the name of the computer you want to monitor.

  2. In Performance object, click an object to monitor.

  3. In Performance counters, click one or more counters to monitor.

  4. To monitor all instances of the selected counters, click All Instances. (Binary logs can include instances that are not available at log startup but subsequently become available.)

    Or, to monitor particular instances of the selected counters, click Select Instances From List, and then click an instance or instances to monitor.

  5. Click Add.

  6. In Alert When The Value Is, specify Under or Over, and in Limit, specify the value that triggers the alert.

  7. In Sample Data Every, specify the amount and the unit of measure for the update interval.

  8. Complete the alert configuration using the Action and Schedule tabs.

Note

When creating a monitoring console for export, be sure to select Use Local Computer Counters. Otherwise, counter logs will obtain data from the computer named in the text box, regardless of where the console file is installed.

To define actions for an alert, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts, and then click Alerts.

  3. In the details pane, double-click the alert.

  4. Click the Action tab.

  5. To have the Performance Logs and Alerts service create an entry visible in Event Viewer, select Log An Entry in the Application Event Log.

  6. To have the service trigger the messenger service to send a message, select Send a Network Message to and type the name of the computer on which the alert message should be displayed.

  7. To run a counter log when an alert occurs, select Start Performance Data Log and specify the counter log you want to run.

  8. To have a program run when an alert occurs, select Run This Program and type the file path and name or click Browse to locate the file. When an alert occurs, the service creates a process and runs the specified command file. The service also copies any command-line arguments you define to the command line that is used to run the file. Click Command Line Arguments and select the appropriate check boxes for arguments to include when the program is run.

To start or stop a counter log, trace log, or alert manually, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts, and click Counter Logs, Trace Logs, or Alerts.

  3. In the details pane, right-click the name of the log or alert you want to start or stop, and click Start to begin the logging or alert activity you defined, or click Stop to terminate the activity.

    Note

    There may be a slight delay before the log or alert starts or stops, indicated when the icon changes color (from green for started to red for stopped, and vice versa).

To remove counters from a log or alert, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts, and then click Counter Logs or Alerts.

  3. In the details pane, double-click the name of the log or alert.

  4. Under Counters, click the counter you want to remove, and then click Remove.

To view or change properties of a log or alert, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts.

  3. Click Counter Logs, Trace Logs, or Alerts.

  4. In the details pane, double-click the name of the log or alert.

  5. View or change the log properties as needed.

To define start or stop parameters for a log or alert, follow these steps.

  1. Open Performance.

  2. Double-click Performance Logs and Alerts, and then click Counter Logs, Trace Logs, or Alerts.

  3. In the details pane, double-click the name of the log or alert.

  4. Click the Schedule tab.

  5. Under Start log, click one of the following options:

    • To start the log or alert manually, click Manually. When this option is selected, to start the log or alert, right-click the log name in the details pane, and click Start.

    • To start the log or alert at a specific time and date, click At, and then specify the time and date.

  6. Under Stop Log, select one of the following options:

    • To stop the log or alert manually, click Manually. When this option is selected, to stop the log or alert, right-click the log or alert name in the details pane, and click Stop.

    • To stop the log or alert after a specified duration, click After, and then specify the number of intervals and the type of interval (days, hours, and so on).

    • To stop the log or alert at a specific time and date, click At, and then specify the time and date. (The year box accepts four characters; the others accept two characters.)

    • To stop a log when the log file becomes full, select options as follows:

      • For counter logs, click When the Log File is Full. The file will continue to accumulate data according to the file-size limit you set on the Log Files tab (in kilobytes up to two gigabytes).

      • For trace logs, click When the n-MB Log File is Full. The file will continue to accumulate data according to the file-size limit you set on the Log Files tab (in megabytes).

  7. Complete the properties as appropriate for logs or alerts:

    When setting this option, take into consideration your available disk space and any disk quotas that are in place. An error might occur if your disk runs out of disk space due to logging.

    • For logs, under When a Log File Closes, select the appropriate option:

      • If you want to configure a circular (continuous, automated) counter or trace logging, select Start a New Log File.

        If you want to run a program after the log file stops (for example, a copy command for transferring completed logs to an archive site), select Run This Command. Also type the path and file name of the program to run, or click Browse to locate the program.

    • For alerts, under When An Alert Scan Finishes, select Start a New Alert Scan if you want to configure continuous alert scanning.

To delete a log or alert, follow these steps:

  1. Open Performance.

  2. Double-click Performance Logs and Alerts.

  3. Click Counter Logs, Trace Logs, or Alerts.

  4. In the details pane, right-click the name of the log or alert, and click Delete.

When you schedule a log to close at a specific time and date or close the log manually, the Start a New Log File option is unavailable.

Conclusion

This chapter discussed how to utilize System Monitor to assist you in performance testing applications and identifying system level bottlenecks. We reviewed several sets of objects and counters one must monitor to find these system level bottlenecks. Understanding System Monitor is critical to successful performance testing and analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset