Chapter 7. Performance Analysis of Managed Code

In this chapter, we will explore the performance considerations relevant to the .NET common language runtime environment (CLR). First we will review the CLR features that have the most influence on the performance of .NET Web applications, and go into detail about specific performance counters used to analyze typical .NET Web application behavior. Next, we will discuss two applications that Microsoft uses when profiling managed code: Compuware’s DevPartner Studio, and Xtremesoft’s Appmetrics.

CLR and Performance

The common language runtime is the part of the .NET Framework that provides the management we refer to when we speak of .NET managed code. For .NET applications, CLR stands in for the Windows kernel, providing vital services such as loading, memory management and protection, exception handling, and the means to easily interoperate with other components and applications. In addition to reprising the features of a classic runtime environment, CLR also takes on the job of compiling .NET applications on the system where they will actually be running.

Microsoft’s reasons for creating a new runtime environment go beyond the scope of this book, but many of the particular features and trade-offs of CLR’s design are of immediate interest.

Microsoft Intermediate Language

The biggest difference between traditional applications and .NET applications is that .NET applications are not directly compiled into native instructions for the processor on which they will eventually run. Instead, .NET applications are compiled from any number of .NET languages (such as Visual Basic .NET, C++ .NET, or C#) into Microsoft Intermediate Language (MSIL), which is then packaged and distributed in the form of assemblies. An assembly is a file or set of files containing objects compiled into MSIL and a manifest that describes them.

Note

You can browse the contents of an assembly using the tool ildasm.exe, which Microsoft provides with the .NET Framework.

With this design, code represented as MSIL can be analyzed and managed by CLR. Its benefits include garbage collection, whereby CLR determines which objects in memory are no longer in use and automatically de-allocates them, and memory type safety, meaning that CLR knows how a given object in memory is meant to be accessed and can verify in advance that no executable code will misuse it. In addition, managed code simplifies interoperability between applications and components written in different languages.

The Just-in-Time Compiler

Code written in MSIL is never executed. Instead, CLR uses a built-in compiler called the Just-in-Time Compiler (JIT) to generate native machine instructions for execution.

Code is typically compiled only as needed. When a process calls a method for the first time, the JIT steps in and compiles the method on the spot. (If another application or instance of the same application later calls the same method, it will have to compile its own instance of the method as well.) One part of this process is verification, in which CLR verifies that the code is safe, meaning it only accesses objects in memory as they are intended to be accessed. After the code is compiled, execution proceeds from the address where the generated native instructions are located. Finally, when the process terminates, the native instructions that were generated are discarded.

This process provides a huge performance advantage when measured against classic Web applications written using ASP. Until now, ASP has been an interpreted language, meaning that it has carried the overhead cost of having to interpret code as it goes along, never reducing that code to a more efficient compiled form the way ASP.NET does.

However, the case is not as clear cut when measured against classic compiled applications. Compiling code at run time, instead of ahead of time, obviously incurs a performance impact. Microsoft has taken measures to minimize the impact, and in a few cases, JIT compiled code can even outperform its unmanaged counterpart.

One performance benefit of compiling code at run time is that so much more is known about the operating environment at run time than the developer could possibly have known at design time. Certain optimizations may be available to the JIT based on the number of system processors and their individual features, as well as what other system resources are available and how they are being used at the time.

On the other hand, only a limited amount of optimization can be done before the time required to optimize the code has the potential to outweigh the benefit of optimization. Recognizing this, the JIT implements certain algorithms to avoid optimizations that are unlikely to save as much time as it costs to attempt them.

Note

If you’re interested in quantifying exactly how the JIT affects performance, you’ll find a number of helpful performance counters in the .NET CLR Jit performance object.

The Pre-JIT Alternative

Included with the .NET Framework is the tool ngen.exe, used to compile assemblies from MSIL into native instructions at the time they are installed, in a process referred to as Pre-JIT. At first glance, Pre-JITting looks like the best of all worlds—why compile at run time when the compiler can still benefit from knowing the details of the system at install time?

The truth is that the impact of JITting at run time is most noticeable when the application is first loaded. Since Web applications rarely reload, if ever, there’s little reason to Pre-JIT them. Another reason not to Pre-JIT is that you miss out on the optimizations made available by knowing the state of the system at run time.

Note

On the other hand, the JIT could afford to spend more time computing code optimizations at install time than it can at run time. The current version of .NET does not take advantage of this, but future versions may do so, possibly making Pre-JIT more suitable for Web-based applications.

The Life and Times of a .NET Web Application

Now that we’ve introduced JIT, we will explore some of the other ways CLR influences the performance of an application over the course of its execution. Bear in mind that as far as CLR is concerned, it does not matter which high-level programming languages the application components were written in. By the time CLR encounters them, they are either managed assemblies written in MSIL, or they are unmanaged code to be run outside of CLR.

Load Time—AppDomains

When CLR loads a new application, applications are placed in special memory areas set aside for them called AppDomains. Because CLR provides memory type safety, it is possible for multiple applications to safely cohabit within the same AppDomain. Applications in the same AppDomain function as a group in the sense that they can share data quickly and efficiently, and if the AppDomain is unloaded, all applications and assemblies loaded into that domain are unloaded together.

Run Time—Interoperability

As a .NET application runs, it may make calls into unmanaged code, such as COM components or standard Windows DLLs. Whenever execution of a thread passes between managed code and unmanaged code, a transition is said to occur. These transitions carry certain costs.

One cost of making a transition is that the arguments and return values being passed between the caller and callee must be marshaled. Marshaling is the process of arranging the objects in memory according to the expectations of the code that will process them. Naturally, data types such as strings and complex structures are more expensive to marshal than simple types like integers.

Note

In the case of strings, it is often necessary to convert them to different formats such as ANSI and Unicode. This is an example of an expensive marshalling operation.

Another cost of transitioning concerns CLR’s memory manager, known as the garbage collector. (The garbage collector will be discussed in more detail later in the chapter.) Whenever a transition into unmanaged code occurs, CLR must identify all the objects referenced by the call to unmanaged code, to ensure the garbage collector does not move them and thereby disrupt the unmanaged thread. Objects that have been identified as possibly in use by unmanaged code are said to be pinned.

Note

Obviously, the most desirable behavior for an application is to minimize the number of transitions needed to do a given amount of work. When testing, use the # of marshalling counter in the .NET CLR Interop performance object to locate areas where application threads are repeatedly transitioning between modes and doing only a small amount of work before transitioning back.

Run Time—Garbage Collection

One of CLR’s most prominent features is automatic memory management, better known as garbage collection. Rather than requiring developers to implement their own memory management, CLR automatically allocates memory for objects when they are created, and periodically checks to see which objects the application is done using. Those objects that are no longer in use are marked as garbage and collected, meaning that the memory they occupy is made available for use by new objects.

Generations and Promotion

Naturally, garbage collection needs to be fast, since time spent managing memory comes at the expense of time spent letting the application do its job.

One assumption about memory management that has withstood considerable scrutiny can be summarized by simply saying that the vast majority of objects are usually needed for only a short amount of time. Microsoft’s garbage collector (GC) makes the most of this by sorting objects into three categories, or generations, numbered 0, 1, and 2. Each generation has a heap size, which refers to the total number of bytes that can be occupied by all objects in that generation. These heap sizes change over the course of an application’s execution, but their initial sizes are usually around 256 KB for generation 0, 2 MB for generation 1, and 10 MB for generation 2.

Objects in generation 0 are youngest. Any time an application creates a new object, the object is placed in generation 0. If there is not enough room on the generation 0 heap to accomodate the new object, then a generation 0 garbage collection occurs. During a collection, every object in the generation is examined to see if it is still in use. Those still in use are said to survive the collection, and are promoted to generation 1. Those no longer in use are de-allocated. You will notice that the generation 0 heap is always empty immediately after it is collected, so there is always room to allocate a new object—that is, unless the system is out of memory, as we discuss below.

Note

You may wonder what happens if a new object is so large that its size exceeds the space available on the generation 0 heap all by itself. Objects larger than 20 KB are allocated on a special heap all their own, known as the large object heap. You’ll find performance counters to track the large object heap size in the .NET CLR Memory performance object.

In the course of promoting objects from generation 0 to generation 1, the GC must check to see if there is room to store the promoted objects in generation 1. If there is enough room on the generation 1 heap to accomodate objects promoted from generaton 0 the true GC terminates, having only collected generation 0. If, on the other hand, the capacity of the generation 1 heap will be exceeded by promoting objects into it from generation 0, then generation 1 is collected as well. Just as before, objects that are no longer in use are de-allocated, while all surviving objects are promoted, this time to generation 2. You’ll notice that after generation 1 is collected, its heap is occupied only by those objects newly promoted from generation 0.

Just as generation 1 must sometimes be collected to make room for new objects, so must generation 2. Just as before, unused objects in generation 2 are de-allocated, but the survivors remain in generation 2. Immediately after a collection of generation 2, its heap is occupied by surviving as well as newly promoted objects.

Immediately following a collection, a heap’s contents are re-arranged so as to be adjacent to each other in memory, and the heap is said to be compacted.

Notice that any time generation 1 is collected, so is generation 0, and whenever generation 2 is collected, the GC is said to be making a full pass because all three generations are collected.

As long as only a few objects need to be promoted during a collection, then the garbage collector is operating efficiently, making the most memory available with the least amount of work. To optimize the likelihood that the garbage collector will operate efficiently, it is also self-tuning, adjusting its heap sizes over time according to the rate at which objects are promoted. If too many objects are being promoted from one heap to another, the GC increases the size of the younger heap to reduce the frequency at which it will need to collect that heap. If, on the other hand, objects are almost never promoted out of a heap, this is a sign that the GC can reduce the size of the heap and improve performance by reducing the application’s working set. The exception here is generation 2: since objects are never promoted out of generation 2, the GC’s only choice is to increase the size of the generation 2 heap when it starts getting full. If your application’s generation 2 heap grows too steadily for too long, this is probably a sign that the application should be reviewed for opportunities to reduce the lifetime of objects. When generation 2 can no longer accommodate promoted objects, this means the garbage collector cannot allocate space for new objects, and attempts to create new objects will cause a System.OutOfMemoryException.

The GC also attempts to keep the size of the generation 0 heap within the size of the system’s L2 cache. This keeps memory I/O costs to a minimum during the most frequent collections. When monitoring your application, it may be helpful to see if it allows the GC to take advantage of this optimization.

Pinned Objects

As mentioned earlier, pinned objects are those that have been marked as possibly in use by threads executing unmanaged code. When the GC runs, it must ignore pinned objects. This is because changing an object’s address in memory (when compacting or promoting it) would cause severe problems for the unmanaged thread. Objects therefore survive any collection that occurs while they are pinned.

When monitoring application performance, pinned objects indicate memory that cannot be managed or reclaimed by the garbage collector. Pinned objects are usually found in places where the application is using significant amounts of unmanaged code.

Finalization

Some objects might store references to unmanaged resources such as network sockets or mutexes. Since de-allocating such an object would result in loss of the reference to the unmanaged resource, developers might specify that the GC must cause the object to clean up after itself before it can be de-allocated, in a process called finalization.

Finalization carries several performance costs. For example, objects awaiting finalization cannot be de-allocated by the garbage collector until they are finalized. Moreover, if an object pending finalization references other objects, then those objects are considered to be in use, even if they are otherwise unused. In contrast to the garbage collector, the programmer has no way to directly control the finalization process. Since there are no guarantees as to when finalization will occur, it is possible for large amounts of memory to become tied up at the mercy of the finalization queue.

When a garbage collection occurs, objects pending finalization are promoted instead of collected, and tracked by the Finalization Survivors counter in the .NET CLR Memory performance object. Objects referenced by finalization survivors are also promoted, and tracked by the Promoted Finalization counters in the .NET CLR Memory performance object.

When monitoring an application that uses objects that require finalization, it is important to watch out for excessive use of memory by objects that are pending finalization directly or otherwise.

Differences Between Workstation and Server GC

Whenever a collection occurs, the GC must suspend execution of those threads that access objects whose locations in memory will change as they are promoted or compacted. Choosing the best behavior for the GC depends on the type of application.

Desktop applications that interact directly with individual users tend to allocate fewer memory objects than Web-based applications that serve hundreds or even thousands of users, and so minimizing the latency involved in a garbage collection is a higher priority than optimizing the rate at which memory is reclaimed.

Therefore, Microsoft implements the GC in two different modes. Note that the best GC is not chosen automatically - CLR will use the Workstation GC (mscorwks.dll) unless the developer specifies that the application requires the Server GC (mscorsvr.dll) instead.

Note

In our experience, with most Web application scenarios, we have found that the Server GC out performs the Workstation GC.

Run Time—Exceptions

Whenever a method encounters a situation it can’t deal with in the normal course of execution, it creates an exception object that describes the unexpected condition (such as out of memory or access denied). The exception is then thrown, meaning the thread signals CLR that it is in a state of distress, and cannot continue executing until the exception has been handled.

When an exception is thrown, the manner of its disposal will depend on whether or not the application has code to handle the exception. Either CLR will halt the application because it cannot handle the exception gracefully, or CLR will execute the appropriate exception handler within the application, after which the application may continue execution. (An application could be designed to terminate gracefully after handling certain exceptions; in that case we would say that the application continues, if only to terminate as intended.)

Suppose method main() calls method foo(), which in turn calls method bar(), and bar() throws a System.FileNotFoundException. The CLR suspends execution of the thread while it looks for an exception filter that matches the thrown exception. Method bar() might have an exception handler whose filter specifies System.DivideByZeroException. The FileNotFoundException would not match this filter, and so CLR would continue in search of a matching exception filter. If none of the exception filters specified by function bar() matched the exception, the system would recurse up the call stack from bar() to the function that called it, in this case, foo(). Now, suppose foo() has an exception handler that specifies System.FileNotFoundException. The exception handler in foo() will execute, thereby catching the exception.

When we speak of throw-to-catch depth, we refer to the number of layers up the call stack CLR had to traverse to find an appropriate exception handler. As it was in our hypothetical example, the throw-to-catch depth was 1. If bar() had caught its own exception, the depth would have been 0. And if CLR had needed to recurse all the way up to main(), the depth would have been 2.

Once an exception has been caught, execution of the application resumes inside a block of code called a finally block. The purpose of a finally block is to clean up after whatever operations might have been interrupted by the exception. Finally blocks are optional, but every finally block that exists between the method that threw the exception and the method that caught it will be executed before the application resumes regular execution.

Therefore, in our example above, if functions foo() and bar() each implement a finally block, both will execute before program flow returns to normal. If the developer chose not to write a finally block for bar(), but did write one for foo(), the finally block in foo() would still execute.

Exceptions in Unmanaged Code

When managed code calls unmanaged code, and that unmanaged code throws an exception which it does not catch, the exception is converted into a .NET exception, and the CLR becomes involved in attempting to handling it. As with any other. NET exception, CLR will halt the application if it is not handled.

Unmanaged exceptions, which do not concern CLR, won’t be tabulated by any of the .NET CLR performance counters. On the other hand, .NET exceptions which originated in unmanaged will be tabulated by the # of Exceps Thrown counters once they are converted. When tabulating .NET exceptions converted from unmanaged code, the Throw to Catch Depth performance counter will only count stack frames within the .NET environment, causing the throw-to-catch depth to appear shorter than it actually is.

Exceptions and Performance

Exception handling is expensive. Execution of the involved thread is suspended while CLR recurses through the call stack in search of the right exception handler, and when it is found, the exception handler and some number of finally blocks must all have their chance to execute before regular processing can resume.

Exceptions are intended to be rare events, and it is assumed that the cost of handling them gracefully is worth the performance hit. When monitoring application performance, some people are tempted to hunt for the most expensive exceptions. But why tune an application for the case that isn’t supposed to happen? An application that disposes of exceptions quickly is still just blazing through exceptions instead of doing real work. Therefore, we recommend that you work to identify the areas where exceptions most often occur, and let them take the time they need so that your application can continue running gracefully.

.NET Performance Counters

Now that you have been introduced to those aspects of the .NET Framework that have a direct impact on the performance of your Web application, we will discuss some of the new .NET performance counters that allow you to measure the performance of the .NET Framework and your managed code. This section is not intended to discuss all of the counters; doing so would require far more than a chapter of material. Instead, we set out to present those counters that would give you the most bang for your buck. The counters presented below, in our opinion, are the ones that can tell the most about your application in the shortest amount of time. Note that this subset of counters does not represent all of the requirements for monitoring the performance of your .NET Web application. Depending on your system architecture, you may find it necessary to monitor other .NET related counters along with counters not specific to .NET.

Tip

If you are interested in capturing performance counter data as part of an application that you are developing, you can reference under managed languages in the System.Diagnostics.PerformanceCounter namespace.

.NET CLR Memory Object

All of the counters found under this object relate memory usage by the .NET framework. No matter whether you are running a .NET Web application or .NET desktop application, these counters will help you understand how the framework is using the system’s memory resources. It is important to note that if your application consists of both managed and unmanaged code, these counters will only draw a partial picture of memory usage, since they do not track memory use by unmanaged code even though it may be running as part of the same application.

# GC Handles Performance Counter

The # GC Handles performance counter displays the current number of garbage collection handles in use. Garbage collection handles are handles to resources outside of CLR and the managed environment. A single handle may only occupy a tiny amount of memory in the managed heap; however, the unmanaged resource it represents could actually be very expensive. You may encounter a large amount of activity with GC handles if multiple objects were created through the use of your Web application. For instance, if a particular user scenario required the allocation of an unmanaged resource such as a network socket each time a user executed that scenario an object consisting of this array would be created along with a corresponding GC handle. When under heavy load—specifically when this scenario is called—your Web site would create a large number of GC handles, possibly causing your application to become unstable.

# Gen 0 Collections

This and the following two counters are important for understanding how efficiently memory is being cleaned up. The # Gen 0 Collections counter displays the number of times generation 0 objects have been garbage collected since the start of your application. Each time an object that is still in use is garbage ­collected at generation 0, it is promoted from generation 0 to generation 1. As we described earlier, one scenario in which generation 0 promotions occur is if your Web application needs to create a new object whose required memory resources exceed the resources available at generation 0. In that case an object remaining in use at the generation 0 level would be promoted, freeing the resources needed for the newest object. The rate of Gen 0 collections will usually correspond with rate at which the application allocates memory.

# Gen 1 Collections

This counter displays the number of times the Gen 1 heap has been collected since the start of the application. You should monitor this counter in the same fashion as the # Gen 0 Collections counter. If you see numerous collections at generation 1, it is an indication that there are not sufficient resources to allocate for objects being promoted from generation 0 to generation 1. Thus, objects will be promoted from generation 1 to generation 2, leading to high resource utilization at the generation 2 level.

# Gen 2 Collections

This counter displays the number of times generation 2 objects have been garbage collected since the start of the application. Of the three counters discussing generation-level collection information (# Gen 0 Collections, # Gen 1 Collections and # Gen 2 Collections) the # Gen 2 Collections is the most important to monitor. With Web applications if you are seeing a high activity for this counter, the aspnet_wp process could be forced to restart. The restart will occur if the amount of global memory has been fully allocated to resources at the generation 2 level. The restart of the aspnet_wp process forces additional memory to be allocated to the global memory.

# Total Committed Bytes

This counter displays the amount of virtual memory committed by your application. It is obviously ideal for an application to require as little memory as possible, thereby reducing the amount of work required for the garbage collector to manage it.

% Time in GC

This counter indicates the amount of time spent by the garbage collector on behalf of an application to collect and compact memory. If your application is not optimized, you will see the garbage collector working constantly, promoting and deleting objects. This time spent by the garbage collector reflects its use of critical processor and memory resources.

Gen 0 heap size

The Gen 0 heap size counter displays the maximum bytes that can be allocated in generation 0. The generation 0 size is dynamically tuned by the garbage collector; therefore, the size will change during the execution of an application. A reduced heap size reflects that the application is economizing on memory resources, thereby allowing the GC to reduce the size of the application’s working set.

Gen 0 Promoted Bytes/sec

This counter displays the amount of bytes promoted per second from generation 0 to generation 1. Even though your application may exhibit a high number of promotions, you may not see a high number of promoted bytes per second if the objects being promoted are extremely small in size. You should monitor the # Gen 1 heap size counter along with this counter in order to verify whether promotions are resulting in poor resource allocation at the generation 1 level.

Gen 1 heap size

This counter displays the current number of bytes in generation 1. Unlike its Gen 0 heap size counterpart, the Gen 1 heap size counter does not display the maximum size of generation 1. Instead, it displays the current amount of memory allocated to objects at the generation 1 level. When monitoring this counter, you will want to monitor the # Gen 0 Collections counter simultaneously. If you find a high number of generation 0 collections occurring, you will find the generation 1 heap size increasing along with them. Eventually, objects will need to be promoted to generation 2, leading to inefficient memory utilization.

Gen 1 Promoted Bytes/sec

Gen 1 Promoted Bytes/sec displays the number of bytes promoted per second from generation 1 to generation 2. Similar to the approach for the Gen 0 ­Promoted Bytes/sec counter, you should monitor the Gen 2 heap size counter when monitoring the Gen 1 Promoted Bytes/sec counter. The two counters will provide you with a good indication of how much memory is being allocated for objects being promoted from generation 1 to generation 2.

Gen 2 heap size

This counter displays the current number of bytes in generation 2. When monitoring an application that is experiencing a high number of promotions from generation 1 to generation 2, the generation 2 heap size will increase since objects cannot be further promoted.

.NET CLR Loading

The following counters found under the .NET CLR Loading performance object, when used alongside other counters such as % Processor Time, allow you to gain a more detailed understanding of the effects on system resources through the loading of .NET applications, AppDomains, classes and assemblies.

Total AppDomains

This counter displays the peak number of AppDomains (application domains) loaded since the start of the application. As mentioned earlier, AppDomains are a secure and versatile unit of processing that CLR can use to provide isolation between applications running in the same process. AppDomains are particularly useful when you need to run multiple applications within the same process. In the case of a Web application, you may find yourself having to run multiple applications within the aspnet_wp process. From a performance standpoint, understanding the number of AppDomains currently running on the server is critical because each time you create or destroy an AppDomain system resources are taxed. Just as important is the need to understand the type of activity occurring between AppDomains. For example, if your applications must cross AppDomain boundaries during execution, this will result in context switches. Context switches (as discussed in Chapter 4) are expensive, particularly when a server is experiencing 15,000 context switches per second or more.

Total Assemblies

This counter displays the total number of assemblies loaded since the start of the application. Assemblies can be loaded as domain-neutral when their code can be shared by all AppDomains, or they can be loaded as domain-specific when their code is private to the AppDomain. If the assembly is loaded as domain-neutral from multiple AppDomains, then this counter is incremented once only. You should be aware of the total number of assemblies loaded on the server because of the resources needed to create and destroy them. Sometimes developers will load assemblies that aren’t really required by the application. Alternatively, developers may not be aware of how many assemblies they are truly loading because they are making an indirect reference.

Total Classes Loaded

This counter displays the total number of classes loaded in all of the assemblies since the start of the application. Each class loaded is not a static class, so it has a constructor. When calling the class the developer will have to instantiate the class, which is more resource intensive than creating the object once and calling the object’s method.

.NET CLR LocksAndThreads

When tracking down a bottleneck that could be related to thread or process contention, the .NET CLR LocksAndThreads performance object is the best place to start. Here, we describe those counters under the .NET CLR Locks­AndThreads performance object that can help rule out possible contention issues quickly and efficiently.

Contention Rate/sec

This counter displays the number of times per second that threads in the run time attempt to acquire a managed lock unsuccessfully. It should be noted that under conditions of heavy contention, threads are not guaranteed to obtain locks in the order they’ve requested them.

Total # of Contentions

This counter displays the total number of times threads in CLR have attempted to acquire a managed lock unsuccessfully.

Current Queue Length

This counter displays the total number of threads currently waiting to acquire some managed lock. If you see that the queue length continues to grow under constant application load, you may be dealing with an irresolvable lock rather than a resolvable lock. The difference between irresolvable and resolvable locks is that irresolvable locks are caused when an error within the application code’s logic makes it impossible for the application to release a lock on an object.

.NET CLR Exceptions

Applications that throw excessive amounts of exceptions can be extremely resource intensive. Ideally, an application should not throw any exceptions. However, many times developers will intentionally throw exceptions as part of the error checking process. This exception generating code should be cleaned up before taking an application into production. Here we have listed two counters found under the .NET CLR Exceptions object. If you choose to monitor only one of these, you should pay most attention to the # of Exceps Thrown/sec counter. If you see this counter exceed 100 exceptions per second, your application code warrants further investigation.

# of Exceps Thrown

This counter displays the total number of exceptions thrown since the start of the application. These include both .NET exceptions and unmanaged exceptions that are converted into .NET exceptions (for example, a null pointer reference exception in unmanaged code would get rethrown in managed code as a .NET System.NullReferenceException), but excludes exceptions which were thrown and caught entirely within unmanaged code. This counter includes both handled and unhandled exceptions. Exceptions that are rethrown will be counted again. This counter is an excellent resource when you are attempting to determine what portion of the code may be generating a high number of exceptions. You could do this by walking through the application while simultaneously monitoring this counter. When you find a sudden jump in the exception count, you can go back and review the code that was executed during that portion of the walkthrough in order to pin down where an excessive number of exceptions are thrown.

# of Exceps Thrown /sec

This counter displays the number of exceptions thrown per second. These include both .NET exceptions and unmanaged exceptions that get converted into .NET exceptions but excludes exceptions that were thrown and caught entirely within unmanaged code. This counter includes both handled and unhandled exceptions. As mentioned earlier, if you monitor a consistently high number of exceptions per second thrown (100 or more), you will need to review the source code in order to determine why and where these exceptions are being thrown.

.NET CLR Security

Depending on how much emphasis you place on the security of your Web application, you will find the following set of counters to be either extremely active or hardly used. These counters should be kept active when truly necessary. Conducting security checks of your application is critical even if there is an effect upon application performance. However, using the security features of the .NET Framework unwisely will not only create security holes in your application, but performance issues will emerge due to poor application design.

Many times you will monitor a counter and see excessive activity for that counter. This activity can be deceiving unless you truly understand what is going on with the counter. The # Link Time Checks counter is just one example. The count displayed is not indicative of serious performance issues, but it is indicative of the security system activity. This counter displays the total number of linktime Code Access Security (CAS) checks since the start of the application. An example of when a linktime CAS check would occur is when a caller makes a call to a callee demanding execution of an operation. The linktime check is performed once per caller and at only one level, thus making it less resource expensive than a stack walk.

% Time in RT checks

This counter displays the percentage of elapsed time spent in performing run­time Code Access Security (CAS) checks since the last such check. CAS allows code to be trusted to varying degrees and enforces these varying levels of trust depending on code identity. This counter is updated at the end of a runtime security check; it represents the last observed value and is not an average. If this counter contains in a high percentage, you will want to revisit what is being checked and how often. Your application may be executing unnecessary stack walk depths (the Stack Walk Depth counter is discussed next). Another cause for a high percentage of time spent in runtime checks could be numerous linktime checks.

Stack Walk Depth

This counter displays the depth of the stack during that last runtime CAS check. Runtime CAS check is performed by walking the stack. An example of when the stack is walked would be when your application calls an object that has four methods (method A–D). If your code calls method A, a stack walk depth of 1 would occur. However, if you were to call method D, which in turn calls methods C, B and A, a stalk walk of depth of 4 would occur.

Total Runtime Checks

This counter displays the total number of runtime CAS checks performed since the start of the application. Runtime CAS checks are performed when a caller makes a call to a callee demanding a particular permission. The runtime check is made on every call by the caller, and the check is done by examining the current thread stack of the caller. Utilizing information from this counter and that of the Stack Walk Depth counter, you can gain a good idea of the performance penalty you are paying for executing security checks. A high number for the total runtime checks along with a high stack walk depth indicates performance overhead.

Profiling Managed Code

In this next section we’ll be discussing how to instrument and profile your managed (and your unmanaged) code using Compuware’s DevPartner Studio 7.0. There are many good profilers available on the market, but we are using DevPartner Studio as an example because it is the profiler of choice used by the ACE Team at Microsoft.

Using Compuware DevPartner Studio

Compuware Corporation’s DevPartner Studio Professional Edition can assist you in creating reliable, high-performance applications. The performance analysis component makes it easy to pinpoint performance bottlenecks anywhere in your code, third party components, or operating system, even when source code is not available. An evaluation version of DevPartner Studio 7.0 Professional Edition can be obtained at http://www.compuware.com/products/devpartner/.

Profiling with DevPartner Studio

In many applications, a relatively small portion of the code is responsible for much of the application’s performance. The challenge is to quickly identify which parts of the code are the most likely candidates for changes that can improve performance, so developers can focus their limited time on tuning efforts that have a high probability of improving overall performance.

The performance analysis capability in DevPartner Studio measures the frequency of execution and execution time down to the line of code for a wide variety of components: Visual Basic, Visual C++, Visual Basic .NET, C#, Visual C, native C/C++, as well as Web applications using ASP.NET, JScript and VBScript when using IE or IIS.

Collecting performance data is straightforward with DevPartner Studio. For managed code, simply run your application with Performance Analysis enabled. For unmanaged code, enable the Instrumentation Manager and rebuild. While you are exercising your application, you can optionally use the session controls (start, stop, and clear) to focus your data collection on areas of specific interest.

One possible methodology is to collect performance data at the method level only (rather than the line level), avoiding the instrumentation step for the moment, assessing which methods are most expensive, and then running line-level data collection on the methods of most interest. This technique points you very quickly in the direction of which subset of methods are the most likely candidates for improvement.

Note

To avoid collecting data for all system (nonsource) files, check Exclude System Images on the DevPartner Performance and Coverage Exclude Images options page. Once you optimize your source code, turn off this option so you can examine how your application uses system code, especially if you are using the .NET Framework.

Profiling Session Window

Once you are done executing your application, performance data is displayed in a Session window, as shown in Figure 7-1. The filter pane on the left lists the source files and system images used during the session, along with the percentage of time spent in each file during execution. You can quickly browse to any file and view the methods contained within that file. In this example, note that we have an application with a mixture of native C++ and C# code. You can also select useful collections of files or methods (such as the Top 20 methods) in order to focus attention on which code is using the most execution time or is called most frequently. The Session data pane on the right provides the detailed method list and associated source code, along with overall summary information.

DevPartner Studio Performance Analysis Session Window
Figure 7-1. DevPartner Studio Performance Analysis Session Window

One way to proceed is to sort the session data by the average time spent in each method, and then to begin to examine the most expensive methods for possible improvements. By selecting a method in the Session data pane, you can examine the source code in more detail. Figure 7-2 provides example source code, which is annotated with the number of times each line of code has executed, the percentage of time spent in called (children) functions, and the total time spent executing the line of code. The most expensive line is also highlighted, which could be your starting point for candidate code to further tune.

DevPartner Studio Performance Analysis Source Code Window
Figure 7-2. DevPartner Studio Performance Analysis Source Code Window

Profiling Method Details

Another approach to improving performance is to explore the relationships between the functions called in your application. By selecting a method, the details of that method—including what other methods it calls and what methods call it—are displayed as shown in Figure 7-3. The top section of the display identifies the selected method and contains performance data for the method. The Parents section lists all methods that called the selected method, and the Children section lists all methods called by the selected method. Using the Method Details view allows you to quickly understand the method calling relationships and costs, traverse the calling sequence to help you better understand both how your own code is working, and understand the impact of calls to the supporting infrastructure.

DevPartner Studio Performance Analysis Method Details Window
Figure 7-3. DevPartner Studio Performance Analysis Method Details Window

Working with Distributed Applications

DevPartner Studio provides performance data gathering and reporting capabilities for the distributed application environment, including Web-based applications. It provides end-to-end profiling for distributed, component-based applications. For distributed Web-based applications, DevPartner collects data for Web applications created in Visual Studio .NET, as well as applications that use the scripting languages supported by IE and IIS.

When you run a distributed application, DevPartner can collect data for each separate local or remote process, including server session data, and correlate the session data. Data correlation combines session data from multiple processes into a single session file that you can view to analyze results for the entire application. DevPartner automatically correlates the session data between different processes when there are

  • DCOM-based calls between methods in different processes

  • HTTP requests between IE as client and IIS as server

To preserve the relationship between the methods of DCOM objects or the relationship between HTTP client and server (IE and IIS), DevPartner automatically correlates the data from those sessions. It then combines the correlated data with the client session data into a single session file. You can view the session file with the correlated data and navigate between calling and called methods in the Method Details view. You can use the Correlate Performance Files command by choosing DevPartner from the Tools menu to manually combine data from different session files when there is no COM-based relationship or client/server relationship between IE and IIS.

Effective Performance Analysis for .NET

The .NET Framework is particularly rich and complex, and you can accomplish a lot with a few lines of code. This offers great opportunities to developers, but can make it difficult to tune application performance. For example, you may discover that 95 percent of your application’s execution time is spent in the .NET Framework. How do you improve performance in that case? Here are some basics to make the performance analysis process using DevPartner Studio more productive.

Understand What You Want to Measure

Consider how your application behaves before you begin collecting performance data. For example, if you are profiling a Web services or ASP.NET application, think about how Web caching will affect your results. If your test run inputs the same data repeatedly, your application will fetch pages from the cache, skewing the performance data. In such a case, you could take pains to ensure variable input data, or more simply, edit the machine.config file to turn off caching while you test. Comment out the line that reads:

<add name="OutputCache" type="System.Web.Caching.OutputCacheModule"/> 
Understand Start-up Costs

The .NET Framework performs many one-time initializations. To prevent these from skewing performance results, warm up the application by exercising all the features you want to profile, and then clear the data using the Clear button on the Session Control toolbar. Next, run a test that exercises the same features to get a more accurate performance picture.

Understand .NET Framework Costs

Use % with Children on the Method List or Source tab to see how much time you are spending in the .NET Framework. Use the Child Methods window in Method Details to drill into the .NET Framework to understand which calls are expensive and why. Rework the application to do less work or to call the .NET Framework less often.

Collect Complete Data for Distributed Applications

When you analyze performance for a Web application, a multi-tier client/server application, or an application that uses Web services, include all remote application components in the analysis. Use DevPartner Performance and Coverage Remote Agent to configure performance data collection on remote systems. If your application uses native C/C++ components, instrument the components for performance analysis before collecting data. Of course, the recommendations regarding awareness of application behavior, start-up costs, and .NET Framework costs apply equally to collecting data for server-side components.

Using AppMetrics to Monitor .NET Enterprise Services Components

COM+ provides COM (unmanaged code) components with services to let applications easily achieve higher scalability and throughput. For .NET Framework (managed code) components, these same services are also available through Enterprise Services. These services include transaction coordination across distributed resources, object pooling, role-based security, etc. You can set up your managed and unmanaged code components to use these services.

AppMetrics for Transactions (AppMetrics) is a monitoring system for Enterprise Services applications that we use internally to profile the performance of heavily used COM+ components. Many of the application groups we deal with find themselves having to wrap their managed code in COM+ components in order to communicate with legacy systems. Capturing performance data produced by AppMetrics enables us to easily determine whether or not the COM+ application may be the cause of poor application performance.

You can use AppMetrics to monitor managed and unmanaged Enterprise Services components. It is designed to monitor applications running in either pre-production or production environments. AppMetrics monitors applications without any code instrumentation.

AppMetrics Manager and Agent Monitors

To reduce the effects of monitoring on system and application performance, AppMetrics uses a Manager and Agent setup. In this arrangement, AppMetrics runs its Agent on the application server. The Agent collects data about the Enterprise Services components while using a minimal amount of system resources. This lets AppMetrics capture more precise data about the applications, whether they run under simulated or real load.

The Agent sends its data to a Manager, which resides on a separate machine. Based on this data, the Manager generates metrics about the applications and their components. These metrics include the total number of activations for component instances, the rates at which the instances finish, and the actual durations of individual instances. You can use these metrics to find any bottlenecks that may occur in your applications during runtime.

The Manager machine stores the metrics in a database. From here, you can generate reports about the application processes and their component instances.

Setting Up AppMetrics Manager and Agent

To evaluate your managed and unmanaged code components while they run under Enterprise Services, set up AppMetrics within your application system with the following tasks:

  1. On the AppMetrics Manager machine, add a Manager monitor.

  2. Add an Agent monitor to the Manager monitor. In effect, this creates the Agent monitor on the application server.

  3. Select the application(s) to be monitored on the application server.

    Diagnostics Application Configuration Panel
    Figure 7-4. Diagnostics Application Configuration Panel
  4. Within the selected applications, you can set up specific components for monitoring.

Pre-Production Monitoring in AppMetrics

For pre-production monitoring, AppMetrics offers a type of analysis that it calls the Diagnostics Monitor. The Diagnostics Monitor records details about individual component and transaction activity in a running application.

The Diagnostics Monitor displays the metrics in a drill-down report. This report shows durations for each active component in the application. It also shows the logical chain of method calls between the components. This is important because merely viewing the metrics for each component in isolation from other components does not tell the whole story. It says nothing about how a component may make calls to other components.

With the information about the method call chains between components in the drill-down report, you can begin to see the relationships between components during runtime. It lets you analyze the overall response time of a component based on its constituent parts.

From this analysis, you can determine if the bottleneck occurs either in the root component object or somewhere further down the method call chain in a subordinate component.

Figure 7-5 shows a snippet from an AppMetrics for Transactions report, which illustrates the following:

  • The logical chain of method activity.

  • An FMStock7.DAL.Broker component that invokes a subordinate FMStocks7.GAM.7 component instance.

  • The CreditAccountBalance method, which was invoked on the subordinate component.

  • The unique naming convention for cross-application calls. When the code makes such calls, they are preceded with the other application name plus a colon. In this case, the FMStocks7.GAM.7 component resides in a different application from the FMStocks7.DAL.Broker component.

Diagnostics drill-down report
Figure 7-5. Diagnostics drill-down report

Note

In addition to the information shown here, the Diagnostics drill-down report also provides actual start times, end times, and error codes for each item.

With the information about cross-application calls in these reports, you can evaluate whether the subordinate component deserves to be in the other application. Cross-application calls can be expensive in terms of CPU utilization. They can also prolong the duration of the method call itself.

If one component makes several cross-application calls during a single transaction into a subordinate component, it may be advisable to make the subordinate component run in the same application process with the first component. You can accomplish this in several ways. For example, you can move the subordinate component into the same application with the first component. Alternatively, you can move the subordinate component into a library application.

Additional Diagnostics Information

Before an application goes to production, you will want to know its limitations. Figure 7-6 shows a snippet from an AppMetrics report that details method activity on the FMStocks7.GAM.7 component instances during a 10-minute period of intense load.

Observe that the average duration for the monitored methods is quite long and not all method calls succeeded during the period.

Diagnostics report on methods
Figure 7-6. Diagnostics report on methods

Observing the performance of a component over time can be quite revealing. Figure 7-7 shows a snippet from an AppMetrics for Transactions report, which illustrates component activity over a four-hour period of intense load.

Diagnostics report on components
Figure 7-7. Diagnostics report on components

Production Monitoring

To monitor Enterprise Services applications running in full production environments, AppMetrics offers the Production Monitor. This monitor generates metrics about the activity in Enterprise Services applications on an interval basis, but with lower overhead and information granularity than is available in the Diagnostics Monitor. More specifically, the Production Monitor calculates totals and rates of activity about components and their instances, as shown in Figure 7-8, Figure 7-9, and Figure 7-10.

Active components chart
Figure 7-8. Active components chart
Component rate chart
Figure 7-9. Component rate chart
Component duration chart
Figure 7-10. Component duration chart

The Production Monitor can also generate application-process metrics, such as application starts, stops, and crashes. The AppMetrics runtime UI shows process metrics for each Enterprise Services application process. The following example shows resource utilization by the process corresponding to FMStocks7.GAM.

Application Process Runtime view
Figure 7-11. Application Process Runtime view

The Production Monitor also offers proactive monitoring through alerts. This means that you can set up thresholds of activity for a specific component metric, such as its average duration. If AppMetrics detects activity above these levels in your system, it can send alerts by email or SNMP. AppMetrics can also respond to alert conditions by invoking a custom COM component, where you can program an automated response to the condition.

The AppMetrics runtime UI shows metrics for each monitored Enterprise Services component. Figure 7-12 shows that the FMStocks7.DAL.Broker component instances are taking an average of over ten-seconds from creation to completion. Since such numbers fall outside the specified thresholds, an alert will be triggered.

Application Process Runtime view
Figure 7-12. Application Process Runtime view

Conclusion

Understanding how to profile and interpret the performance of managed code is key to building scalable and robust .NET Web applications. Profiling of managed code can be done using System Monitor and the .NET performance counters. This counter information can then be used in conjunction with any code instrumentation that you may be doing. If you want to read more about writing high-performing scalable code, look for the upcoming Microsoft Press book, Writing Scalable Code by Simon Meacham and Mike Parkes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset