Performance Measurement and Tools

Before we dive into the specifics of the CLR and .NET, we need to understand performance measurement in general, as well as the many tools available to us. You are only as powerful as the tools in your arsenal, and this chapter attempts to give you a solid grounding and set the stage for many of the tools that will be discussed throughout the book.

Choosing What to Measure

Before deciding what to measure, you need to determine a set of performance requirements. The requirements should be general enough to not prescribe a specific implementation, but specific enough to be measurable. They need to be grounded in reality, even if you do not know how to achieve them yet. These requirements will, in turn, drive which metrics you need to collect. Before collecting numbers, you need to know what you intend to measure. This sounds obvious, but it is actually a lot more involved than you may think. Consider memory. You obviously want to measure memory usage and minimize it. But which kind of memory? Private working set? Commit size? Paged pool? Peak working set? .NET heap size? Large object heap size? Individual processor heaps to ensure they are balanced? Some other variant? For tracking memory usage over time, do you want the average for an hour, the peak? Does memory usage correlate with processing load size? As you can see, there are easily a dozen or more metrics just for the concept of memory alone. And we have not even touched the concept of private heaps or profiling the application to see what kinds of objects are using memory!

Be as specific as possible when describing what you want to measure.

Story In one large server application I was responsible for, we tracked its private bytes (see the section on Performance Counters in this chapter for more information about various types of memory measurement) as a critical metric and used this number to decide when we needed to do things like restart the process before beginning a large, memory-intensive operation. It turned out that quite a large amount of those “private bytes” were actually paged out over time and not contributing to the memory load on the system, which is what we were really concerned with. We changed our system to measure the working set instead. This had the benefit of “reducing” our memory usage by a few gigabytes. (As I said, this was a rather large application.)

Once you have decided what you are going to measure, come up with specific goals for each of those metrics. Early in development, these goals may be quite malleable, even unrealistic, but should still be based on the top-level requirements. The point at the beginning is not necessarily to meet the goals, but to force you to build a system that automatically measures you against those goals.

Your goals should be quantifiable. A high-level goal for your program might state that it should be “fast.” Of course it should. That is not a very good metric because “fast” is subjective and there is no well-defined way to know you are meeting that goal. You must be able to assign a number to this goal and be able to measure it.

Bad: “The user interface should be responsive.”

Good: “No operation may block the UI thread for more than 20 milliseconds.”

However, just being quantifiable is not good enough either. You need to be very specific, as we saw in the memory example earlier.

Bad: “Memory should be less than 1 GB.”

Good: “Working set memory usage should never exceed 1 GB during peak load of 100 queries per second.”

The second version of that goal gives a very specific circumstance that determines whether you are meeting your goal. In fact, it suggests a good test case.

Another major determining factor in what your goals should be is the kind of application you are writing. A user interface program must at all costs remain responsive on the UI thread, whatever else it does. A server program handling dozens, hundreds, or even thousands of requests per second must be incredibly efficient in handling I/O and synchronization to ensure maximum throughput and keep the CPU utilization high. You design a server of this type in a completely different way than other programs. It is very difficult to fix a poorly written application retroactively if it has a fundamentally flawed architecture from an efficiency perspective.

Capacity planning is also important. A useful exercise while designing your system and planning performance measurement is to consider what the optimal theoretical performance of your system is. If you could eliminate all overhead like garbage collection, JIT, thread interrupts, or whatever you deem is overhead in your application, then what is left to process the actual work? What are the theoretical limits that you can think of, in terms of workload, memory usage, CPU usage, and internal synchronization? This often depends on the hardware and OS you are running on. For example, if you have a 16-processor server with 64 GB of RAM with two 10 GB network links, then you have an idea of your parallelism threshold, how much data you can store in memory, and how much you can push over the wire every second. It will help you plan how many machines of this type you will need if one is not enough.

Premature Optimization

You have likely heard the phrase, coined by Donald Knuth, “Premature optimization is the root of all evil.” The context of the quote is in determining which areas of your program are actually important to optimize. This brings us to Amdahl’s Law, which describes the theoretical maximum speedup of a software program through optimization, in particular how it applies to sequential programs and picking which parts of a program to optimize. Micro-optimizing code that does not significantly contribute to overall inefficiency is largely a waste of time. This concept most obviously applies to micro-optimizations at the code level, but it can apply to higher levels of your design as well. You still need to understand your architecture and its constraints as you design or you will miss something crucial and severely hamstring your application. But within those parameters, there are many areas which are not important (or you do not know which sub-areas are important yet). It is not impossible to redesign an existing application from the ground up, but it is far more expensive than doing it right in the first place. When architecting a large system, often the only way you can avoid the premature optimization trap is with experience and examining the architecture of similar or representative systems. In any case, you must bake performance goals into the design up front. Performance, like security and many other aspects of software design, cannot be an afterthought, but needs to be included as an explicit goal from the start.

The performance analysis you will do at the beginning of a project is different from that which occurs once it has been written and is being tested. At the beginning, you must make sure the design is scalable, that the technology can theoretically handle what you want to do, and that you are not making huge architectural blunders that will forever haunt you. Once a project reaches testing, deployment, and maintenance phases, you will instead spend more time on micro-optimizations, analyzing specific code patterns, trying to reduce memory usage, etc.

You will never have time to optimize everything, so start intelligently. Optimize the most inefficient portions of a program first to get the largest benefit. This is why having goals and an excellent measurement system in place is critical—otherwise, you do not even know where to start.

Average vs. Percentiles

When considering the numbers you are measuring, decide what the most appropriate statistics are. Most people default to average, which is certainly important in most circumstances, but you should also consider percentiles. If you have availability requirements, you will almost certainly need to have goals stated in terms of percentiles. For example:

“Average latency for database requests must be less than 10ms. The 95th percentile latency for database requests must be less than 100ms.”

If you are not familiar with this concept, it is actually quite simple. If you take 100 measurements of something and sort them, then the 95th entry in that list is the 95th percentile value of that data set. The 95th percentile says, “95% of all samples have this value or less.” Alternatively, “5% of requests have a value higher than this.”

The general formula for calculating the index of the Pth percentile of a sorted list is:

0.01 * P * N

where P is the percentile and N is the length of the list.

Consider a series of measurements for generation 0 garbage collection pause time in milliseconds with these values (pre-sorted for convenience):

1, 2, 2, 4, 5, 5, 8, 10, 10, 11, 11, 11, 15, 23, 24, 25, 50, 87

For these 18 samples, we have an average of 17ms, but the 95th percentile is much higher at 50ms. If you just saw the average number, you may not be concerned with your GC latencies, but knowing the percentiles, you have a better idea of the full picture and that there are some occasional GCs happening that are far worse.

This series also demonstrates that the median value (50^th percentile) can be quite different from the average. The average value of a series of measurements is often prone to strong influence by values in the higher percentiles.

Percentiles values are usually far more important for high-availability services. The higher availability you require, the higher percentile you will want to track. Usually, the 99th percentile is as high as you need to care about, but if you deal in a truly enormous volume of requests, 99.99th, 99.999th, or even higher percentiles will be important. Often, the value you need to be concerned about is determined by business needs, not technical reasons.

Percentiles are valuable because they give you an idea of how your metrics degrade across your entire execution context. Even if the average user or request experience in your application is good, perhaps the 90th percentile metric shows some room for improvement. That is telling you that 10% of your execution is being impacted more negatively than the rest. Tracking multiple percentiles will tell you how fast this degradation occurs. How important this percentage of users or requests is must ultimately be a business decision, and there is definitely a law of diminishing returns at play here. Getting that last 1% may be extremely difficult and costly.

I stated that the 95th percentile for the above data set was 50ms. While technically true, it is not useful information in this case—there is not actually enough data to make that call with any statistical significance, and it could be just a fluke. To determine how many samples you need, just use a rule of thumb: You need one “order of magnitude” more samples than the target percentile. For percentiles from 0-99, you need 100 samples minimum. You need 1,000 samples for 99.9th percentile, 10,000 samples for 99.99th percentile, and so on. This mostly works, but if you are interested in determining the actual number of samples you need from a mathematical perspective, research sample size determination.

Put more exactly, the potential error varies with the square root of the number of samples. For example, 100 samples yields an error range of 90-100, or a 10% error; 1,000 samples yields an error range of 969-1031, or a 3% error.

Do not forget to also consider other types of statistical values: minimum, maximum, median, standard deviations, and more, depending on the type of metric you are measuring. For example, to determine statistically relevant differences between two sets of data, t-tests are often used. Standard deviations are used to determine how much variation exists within a data set.

Benchmarking

If you want to measure the performance of a piece of code, especially to compare it to an alternative implementation, what you want is a benchmark. The literal definition of a benchmark is a standard against which measurements can be compared. In terms of software development, this means precise timings, usually averaged across many thousands (or millions) of iterations.

You can benchmark many types of things at different levels—entire programs to single methods. However, the more variability that exists in the code under test, the more iterations you will need to achieve sufficient accuracy.

Running benchmarks is a tricky endeavor. You want to measure the code in real-world conditions to get real-world, actionable data, but creating these conditions while getting useful data can be trickier than it seems.

Benchmarks shine when they test a single, uncontended resource, the classic example being CPU time. You certainly can test things like network access time, or reading files off an SSD, but you will need to take more care to isolate those resources from outside influence. Modern operating systems are not designed for this kind of isolation, but with careful control of the environment, you can likely achieve satisfactory results.

Testing entire programs or submodules are more likely to involve this use of contended resources. Thankfully, such large-scope tests are rarely called for. A quick profile of an app will reveal those spots that use the most resources, allowing for narrow focus on those areas.

Small-scope micro-benchmarking most commonly measures the CPU time of single methods, often rerunning them millions of times to get precise statistics on the time taken.

In addition to hardware isolation, there are a number of other factors to consider:

  • Code must be JITted: The first time you run a method takes a lot longer than subsequent iterations.
  • Other Hidden Initialization: There are OS caches, file system caches, CLR caches, hardware caches, code generation, and myriad other startup costs that can impact the performance of code.
  • Isolation: If other expensive processes are running, they can interfere with the measurements.
  • Outliers: Statistical outliers in measurement must be accounted for and probably discarded. Determining what are outliers and what is normal variance can be tricky.
  • Narrowly Focused: CPU time is important, but so is memory allocation, I/O, thread blocking, and more.
  • Release vs. Debug Code: Benchmarking should always be done on Release code, with all optimizations turned on.
  • Observer Effects: The mere act of observing something necessarily changes what is being observed. For example, measuring CPU or memory allocations in .NET involves emitting and measuring extra ETW events, something not normally done.

The sample code that accompanies this book has a few quick-and-dirty benchmarks throughout, but for the above reasons, they should not be taken as the absolute truth.

Instead of writing your own benchmarks, you should almost certainly use an existing library that handles many of the above issues for you. I’ll discuss a couple of options later in this chapter.

Useful Tools

If there is one single rule that is the most important in this entire book, it is this:

Measure, Measure, Measure!

You do NOT know where your performance problems are if you have not measured accurately. You will definitely gain experience and that can give you some strong hints about where performance problems are, just from code inspection or gut feel. You may even be right, but resist the urge to skip the measurement for anything but the most trivial of problems. The reasons for this are two-fold:

First, suppose you are right, and you have accurately found a performance problem. You probably want to know how much you improved the program, right? Bragging rights are much more secure with hard data to back them up.

Second, I cannot tell you how often I have been wrong. Case in point: While analyzing the amount of native memory in a process compared to managed memory, we assumed for a while that it was coming from one particular area that loaded an enormous data set. Rather than putting a developer on the task of reducing that memory usage, we did some experiments to disable loading that component. We also used the debugger to dump information about all the heaps in the process. To our surprise, most of the mystery memory was coming from assembly loading overhead, not this dataset. We saved a lot of wasted effort.

Optimizing performance is meaningless if you do not have effective tools for measuring it. Performance measurement is a continual process that you should bake into your development tool set, testing processes, and monitoring tools. If your application requires continual monitoring for functionality purposes, then it likely also requires performance monitoring.

The remainder of this chapter covers various tools that you can use to profile, monitor, and debug performance issues. I give emphasis to Visual Studio and software that is freely available, but know there are many other commercial offerings that can in some cases simplify various analysis tasks. If you have the budget for these tools, go for it. However, there is a lot of value in using some of the leaner tools I describe (or others like them). For one, they may be easier to run on customer machines or production environments. More importantly, by being a little “closer to the metal,” they will encourage you to gain knowledge and understanding at a very deep level that will help you interpret data, regardless of the tool you are using.

For each of the tools, I describe basic usage and general knowledge to get started. Sections throughout the book will give you detailed steps for very specific scenarios, but will often rely on you already being familiar with the UI and the basics of operation.

Tip Before digging into specific tools, a general tip for how to use them is in order. If you try to use an unfamiliar tool on a large, complicated project, it can be very easy to get overwhelmed, frustrated, or even get erroneous results. When learning how to measure performance with a new tool, create a test program with well-known behavior, and use the tool to prove its performance characteristics to you. By doing this, you will be more comfortable using the tool in a more complicated situation and less prone to making technical or judgmental mistakes.

Visual Studio

While it is not the only IDE, most .NET programmers use Visual Studio, and if you do, chances are this is where you will start to analyze performance. Different versions of Visual Studio come with different tools. This book will assume you have at least the Professional version installed, but I will also describe some tools found in higher versions as well. If you do not have the right version, then skip ahead to the other tools mentioned.

Assuming you installed Visual Studio Professional or higher, you can access the performance tools via the Analyze menu and selecting Performance Profiler (or use the default keyboard shortcut: Alt+F2).

Standard .NET applications will show at least three options, with more available depending on the specific type of application:

  • CPU Usage: Measures CPU usage per function.
  • Memory Usage: Shows garbage collections and allows you to take heap snapshots.
  • Performance Wizard: Uses VsPerf.exe to do ETW-based analysis of CPU usage (sampling or instrumentation), .NET memory allocation, and thread contention.
Profiling options in Visual Studio.
Profiling options in Visual Studio.

If you just need to analyze CPU or look at what is on the heap, then use the first two tools. The Performance Wizard can also do CPU analysis, but it can be a bit slower. However, despite being somewhat of a legacy tool, it can also track memory allocations and concurrency.

For superior concurrency analysis, install the free Concurrency Visualizer, available as an optional extension (Tools | Extensions and Updates… menu).

The Visual Studio tools are among the easiest to use, but if you do not already have the right version of Visual Studio, they are quite expensive. They are also fairly limited and inflexible in what they provide. If you cannot use Visual Studio, or need more capabilities, I describe free alternatives below. Nearly all modern performance measurement tools use the same underlying mechanism (at least in Windows 8/Server 2012 and above kernels): ETW events. ETW stands for Event Tracing for Windows and this is the operating system’s way of logging all interesting events in an extremely fast, efficient manner. Any application can generate these events with simple APIs. Chapter 8 describes how to take advantage of ETW events in your own programs, defining your own or integrating with a stream of system events. Some tools, such as PerfView, can collect arbitrary ETW events all at once and you can analyze all of them separately from one collection session. Sometimes I think of Visual Studio performance analysis as “development-time” while the other tools are for the real system. Your experience may differ and you should use the tools that give you the most bang for the buck.

CPU Profiling

This section will introduce the general interface for profiling with the CPU profiling options. The other profiler options (such as for memory) will be covered later in the book, in appropriate sections.

When you choose CPU Usage, the results will bring up a window with a graph of CPU usage and a list of expensive methods.

CPU Usage results. Timeline, overall usage graph, and tree of the most expensive methods.
CPU Usage results. Timeline, overall usage graph, and tree of the most expensive methods.

If you want to drill into a specific method, just double-click it on the list, and it will open up a method Call/Callee view.

CPU Usage Method Call/Callee Diagram. Shows the most expensive parts of a method.
CPU Usage Method Call/Callee Diagram. Shows the most expensive parts of a method.

If that option does not give you enough information, take a look at the performance wizard. This tool uses VsPerf.exe to gather important events.

The first screen of the Performance Wizard.
The first screen of the Performance Wizard.

When you choose the CPU (Sampling), it collects CPU samples without any interruption to your program.

The Performance Wizard’s CPU sampling report view.
The Performance Wizard’s CPU sampling report view.

While a different interface than the CPU Usage view we saw earlier, this view shows you the overall CPU usage on a time line, with a tree of expensive methods below it. There are also alternate reports you can view. You can zoom in on the graph and the rest of the analysis will update in response. Clicking on a method name in the table will take you to a familiar-looking Function Details view.

Details of the method’s CPU usage.
Details of the method’s CPU usage.

Below the function call summary, you will see the source code (if available), with highlighted lines showing the most expensive parts of the method.

There are other reports as well, including:

  • Modules: Which assemblies have the most samples in them.
  • Caller/Callee: An alternative to the Function Details view that shows tables of samples above and below the current method in the stack.
  • Functions: A quick way to see a table of all functions in the process.
  • Lines: A way to jump quickly to the most expensive individual code lines in the process.

Instead of sampling, you can choose to instrument the code. This modifies the original executable by adding instructions around each method call to measure the time spent. This can give more accurate reporting for very small, fast methods, but it has much higher overhead in execution time as well as the amount of data produced. Other than a lack of a CPU graph, the report looks and behaves the same as the CPU sampling report. The major difference in the interface is that it is measuring time instead of number of samples.

Command Line Profiling

Visual Studio can analyze CPU usage, memory allocations, and resource contentions. This is perfect for use during development or when running comprehensive tests that accurately exercise the product. However, it is very rare for a test to accurately capture the performance characteristics of a large application running on real data. If you need to capture performance data on non-development machines, say a customer’s machine or in the data center, you need a tool that can run outside of Visual Studio.

For that, there is the Visual Studio Standalone Profiler, which comes with the Professional or higher versions of Visual Studio. You will need to install it from your installation media separately from Visual Studio. On my ISO images for both 2012 - 2015 Professional versions, it is in the Standalone Profiler directory. For Visual Studio 2017, the executable is VsPerf.exe and is located in C:Program Files (x86)Microsoft Visual Studio2017EnterpriseTeam ToolsPerformance Tools.

To collect data from the command line with this tool:

  1. Navigate to the installation folder (or add the folder to your path)
  2. Run: VsPerfCmd.exe /Start:Sample /Output:outputfile.vsp
  3. Run the program you want to profile
  4. Run: VsPerfCmd.exe /Shutdown

This will produce a file called outputfile.vsp, which you can open in Visual Studio.

VsPerfCmd.exe has a number of other options, including all of the profiling types that the full Visual Studio experience offers. Aside from the most common option of Sample, you can choose:

  • Coverage: Collects code coverage data
  • Concurrency: Collects resource contention data
  • Trace: Instruments the code to collect method call timing and counts

Trace vs. Sample mode is an important choice. Which one to use depends on what you want to measure. Sample mode should be your default. It interrupts the process every few milliseconds and records the stacks of all threads. This is the best way to get a good picture of CPU usage in your process. However, it does not work well for I/O calls, which will not have much CPU usage, but may still contribute to your overall run time.

Trace mode requires modification of every function call in the process to record time stamps. It is much more intrusive and causes your program to run much slower. However, it records actual time spent in each method, so may be more accurate for smaller, faster methods.

Coverage mode is not for performance analysis, but is useful for seeing which lines of your code were executed. This is a nice feature to have when running tests to see how much of your product the tests cover. There are commercial products that do this for you, but you can do it yourself without much more work.

Concurrency mode records events that occur when there is contention for a resource via a lock or some other synchronization object. This mode can tell you if your threads are being blocked due to contention. See Chapter 4 for more information about asynchronous programming and measuring the amount of lock contention in your application.

Performance Counters

Performance counters are some of the simplest ways to monitor your application’s and the system’s performance. Windows has hundreds of counters in dozens of categories, including many for .NET. The easiest way to access these is via the built-in Windows utility PerformanceMonitor (PerfMon.exe).

PerfMon’s main window showing a processor counter for a small window of time. The vertical line represents the current instance and the graph will wrap around after 100 seconds by default.
PerfMon’s main window showing a processor counter for a small window of time. The vertical line represents the current instance and the graph will wrap around after 100 seconds by default.
One of the hundreds of counters in many categories, showing all of the applicable instances (processes, in this case).
One of the hundreds of counters in many categories, showing all of the applicable instances (processes, in this case).

Each counter has a category and a name. Many counters also have instances of the selected counter as well. For example, for the % Processor Time counter in the Process category, the instances are the various processes for which there are values. Some counters also have meta-instances, such as _Total or <Global>, which aggregate the values over all instances.

Many of the chapters ahead will detail the relevant counters for that topic, but there are general-purpose counters that are not .NET-specific that you should be familiar with. There are performance counters for nearly every Windows subsystem and these are generally applicable to every program.

However, before continuing, you should familiarize yourself with some basic operating system terminology:

  • Physical Memory: The actual physical memory chips in a computer. Only the operating system manages physical memory directly.
  • Virtual Memory: A logical organization of memory in a given process. Virtual memory size can be larger than physical memory. For example, 32-bit programs have a 4 GB address space, even if the computer itself only has 2 GB of RAM. Windows allows the program to access only 2 GB of that by default, but all 4 GB is possible if the executable has the large-address aware bit set. (On 32-bit versions of Windows, large-address aware programs are limited to 3 GB, with 1 GB reserved for the operating system.) As of Windows 8.1 and Server 2012, 64-bit processes have a 128 TB address space, far larger than the 4 TB physical memory limit. Some of the virtual memory may be in RAM while other parts are stored on disk in a paging file. Contiguous blocks of virtual memory may not be contiguous in physical memory. All memory addresses in a process are for the virtual memory.
  • Reserved Memory: A region of virtual memory address space that has been reserved for the process and thus will not be allocated to a future requester. Reserved memory cannot be used for memory allocation requests because there is nothing backing it—it is just a description of a range of memory addresses.
  • Committed Memory: A region of memory that has a physical backing store. This can be RAM or disk.
  • Page: An organizational unit of memory. Blocks of memory are allocated in a page, which is usually a few KB in size.
  • Paging: The process of transferring pages between regions of virtual memory. The page can move to or from another process (soft paging) or the disk (hard paging). Soft paging can be accomplished very quickly by mapping the existing memory into the current process’s virtual address space. Hard paging involves a relatively slow transfer of data to or from a disk. Your program must avoid this at all costs to maintain good performance.
  • Page In: Transfer a page from another location to the current process.
  • Page Out: Transfer a page from the current process to another location, such as disk.
  • Context Switch: The process of saving and restoring the state of a thread or process. Because there are usually more running threads than available processors, there are often many context switches per second. These are pure overhead, so fewer is better, but it is difficult to know what an optimal absolute value should be.
  • Kernel Mode: A mode that allows the OS to modify low-level aspects of the hardware’s state, such as modifying certain registers or enabling/disabling interrupts. Transitioning to Kernel Mode requires an operating system call, and can be quite expensive.
  • User Mode: An unprivileged mode of executing instructions. There is no ability to modify low-level aspects of the system.

I will use some of these terms throughout the book, especially in Chapter 2 when I discuss garbage collection. For more information on these topics, look at a dedicated operating system book such as Windows Internals. (See the bibliography at the end of the book.)

The Process category of counters surfaces much of this critical information via counters with instances for each process, including:

  • % Privileged Time: Amount of time spent in executing privileged (kernel mode) code.
  • % Processor Time: Percentage of a single processor the application is using. If your application is using two logical processor cores at 100% each, then this counter will read 200.
  • % User Time: Amount of time spent in executing unprivileged (user mode) code.
  • IO Data Bytes/sec: How much I/O your process is doing.
  • Page Faults/sec: Total number of page faults in your process. A page fault occurs when a page of memory is missing from the current working set. It is important to realize that this number includes both soft and hard page faults. Soft page faults are innocuous and can be caused by the page being in memory, but outside the current process (such as for shared DLLs). Hard page faults are more serious, indicating data that is on disk but not currently in memory. Unfortunately, you cannot track hard page faults per process with performance counters, but you can see it for the entire system with the MemoryPage Reads/sec counter. You can do some correlation with a process’s total page faults plus the system’s overall page reads (hard faults). You can definitively track a process’s hard faults with ETW tracing with the Windows Kernel/Memory/Hard Fault event.
  • Pool Nonpaged Bytes: Typically operating system and driver allocated memory for data structures that cannot be paged out such as operating system objects like threads and mutexes, but also custom data structures.
  • Pool Paged Bytes: Also for operating system data structures, but these are allowed to be paged out.
  • Private Bytes: Committed virtual memory private to the specific process (not shared with any other processes).
  • Virtual Bytes: Allocated memory in the process’s address space, some of which may be backed by the page file, possibly shared with other processes or private to the process.
  • Working Set: The amount of virtual memory currently resident in physical memory (usually RAM).
  • Working Set-Private: The amount of private bytes currently resident in physical memory.
  • Thread Count: The number of threads in the process. This may or may not be equal to the number of .NET threads. See Chapter 4 for a discussion of .NET thread-related counters.

There are a few other generally useful categories, depending on your application. You can use PerfMon to explore the specific counters found in these categories.

  • IPv4/IPv6: Internet Protocol-related counters for datagrams and fragments.
  • Memory: System-wide memory counters such as overall paging, available bytes, committed bytes, and much more.
  • Objects: Data about kernel-owned objects such as events, mutexes, processes, threads, semaphores, and sections.
  • Processor: Counters for each logical processor in the system.
  • System: Context switches, alignment fixes, file operations, process count, threads, and more.
  • TCPv4/TCPv6: Data for TCP connections and segment transfers.

It is surprisingly difficult to find detailed information on performance counters on the Internet, but thankfully, they are self-documenting! In the Add Counter dialog box in PerfMon, you can check the “Show description” box at the bottom to display details on the highlighted counter.

PerfMon also has the ability to collect specified performance counters at scheduled times and store them in logs for later viewing, or even perform a custom action when a performance counter passes a threshold. You do this with Data Collector Sets and they are not limited just to performance counter data, but can also collect system configuration data and ETW events.

To set up a Data Collector Set, in the main PerfMon window:

  1. Expand the Data Collector Sets tree.
  2. Right-click on User Defined.
  3. Select New.
  4. Select Data Collector Set.
  5. Give it a name, check Create manually (Advanced), and click the Next button.
  6. Check the Performance counter box under Create Data Logs and click the Next button.
  7. Click Add to select the counters you want to include.
  8. Click Next to set the path where you want to store the logs and Next again to select security information.
Data Collector Set configuration dialog box for setting up regular counter collections.
Data Collector Set configuration dialog box for setting up regular counter collections.
Specify the type of data you want to store.
Specify the type of data you want to store.
Select the counters to collect.
Select the counters to collect.

Once done, you can open the properties for the collection set and set a schedule for collection. You can also run them manually by right-clicking on the job node and selecting Start. This will create a report, which you can view by double-clicking its node under Reports in the main tree view.

A saved report file. Use the toolbar buttons to change the view to a graph of the captured counter data.
A saved report file. Use the toolbar buttons to change the view to a graph of the captured counter data.

To create an alert, follow the same process but select the Performance Counter Alert option in the Wizard.

It is likely that everything you will need to do with performance counters can be done using the functionality described here, but if you want to take programmatic control or create your own counters, see Chapter 7 for details. You should consider performance counter analysis a baseline for all performance work on your application.

ETW Events

Event Tracing for Windows (ETW) is one of the fundamental building blocks for all diagnostic logging in Windows, not just for performance. This section will give you an overview of ETW and Chapter 8 will teach you how to create and monitor your own events.

Events are produced by providers. For example, the CLR contains the Runtime provider that produces most of the events we are interested in for this book. There are providers for nearly every subsystem in Windows, such as the CPU, disk, network, firewall, memory, and many, many more. The ETW subsystem is extremely efficient and can handle the enormous volume of events generated, with minimal overhead.

Each event has some standard fields associated with it, like event level (informational, warning, error, verbose, and critical) and keywords. Each provider can define its own keywords. The CLR’s Runtime provider has keywords for things like GC, JIT, Security, Interop, Contention, and more. Keywords allow you to filter the events you would like to monitor.

A list of all GC Start events taken in a 60-second trace. Notice various pieces of data associated with the event, such as the Reason and Depth.
A list of all GC Start events taken in a 60-second trace. Notice various pieces of data associated with the event, such as the Reason and Depth.

Each event also has a custom data structure defined by its provider that describes the state of some behavior. For example, the Runtime’s GC events will mention things like the generation of the current collection, whether it was background, and so on.

What makes ETW so powerful is that, since most components in Windows produce an enormous number of events describing nearly every aspect of an application’s operation, at every layer, you can do the bulk of performance analysis with ETW events only.

Many tools can process ETW events and give specialized views. In fact, starting in Windows 8, all CPU profiling is done using ETW events.

To see a list of all the ETW providers registered on your system, open a command prompt and type:

logman query providers

This will produce a large amount of output similar to the following:

Provider                               GUID
------------------------------------------------------------------
.NET Common Language Runtime           {E13C0D23-CCBC-4E12-931B...
ACPI Driver Trace Provider             {DAB01D4D-2D48-477D-B1C3...
Active Directory Domain Services: SAM  {8E598056-8993-11D2-819E...
Active Directory: Kerberos Client      {BBA3ADD2-C229-4CDB-AE2B...
Active Directory: NetLogon             {F33959B4-DBEC-11D2-895B...
ADODB.1                                {04C8A86F-3369-12F8-4769...
ADOMD.1                                {7EA56435-3F2F-3F63-A829...
Application Popup                      {47BFA2B7-BD54-4FAC-B70B...
Application-Addon-Event-Provider       {A83FA99F-C356-4DED-9FD6...
...

You can also get details on the keywords for a specific provider:

D:>logman query providers "Windows Kernel Trace"

Provider                               GUID
------------------------------------------------------------------
Windows Kernel Trace                   {9E814AAD-3204-11D2-9A82...

Value               Keyword            Description
------------------------------------------------------------------
0x0000000000000001  process            Process creations/deletions
0x0000000000000002  thread             Thread creations/deletions
0x0000000000000004  img                Image load
0x0000000000000008  proccntr           Process counters
0x0000000000000010  cswitch            Context switches
0x0000000000000020  dpc                Deferred procedure calls
0x0000000000000040  isr                Interrupts
0x0000000000000080  syscall            System calls
0x0000000000000100  disk               Disk IO
0x0000000000000200  file               File details
0x0000000000000400  diskinit           Disk IO entry
0x0000000000000800  dispatcher         Dispatcher operations
0x0000000000001000  pf                 Page faults
0x0000000000002000  hf                 Hard page faults
0x0000000000004000  virtalloc          Virtual memory allocations
0x0000000000010000  net                Network TCP/IP
0x0000000000020000  registry           Registry details
0x0000000000100000  alpc               ALPC
0x0000000000200000  splitio            Split IO
0x0000000000800000  driver             Driver delays
0x0000000001000000  profile            Sample based profiling
0x0000000002000000  fileiocompletion   File IO completion
0x0000000004000000  fileio             File IO

Unfortunately, there is no good online resource to explain which events exist in the various providers. Some common ETW events for all Windows processes include those in the Windows Kernel Trace category:

  • Memory/Hard Fault
  • DiskIO/Read
  • DiskIO/Write
  • Process/Start
  • Process/Stop
  • TcpIp/Connect
  • TcpIp/Disconnect
  • Thread/Start
  • Thread/Stop

To see other events from this provider or others, you can collect ETW events and examine them yourself.

Throughout the book, I will mention the important events you should pay attention to in an ETW trace, particularly from the CLR Runtime provider. For the complete CLR ETW documentation, you can visit https://docs.microsoft.com/dotnet/framework/performance/etw-events-in-the-common-language-runtime.

PerfView

Many tools can collect and analyze ETW events, but PerfView, originally written by Microsoft .NET performance architect (and writer of this book’s Foreword) Vance Morrison, is one of the best for its sheer power. The previous screenshot of ETW events is from this tool.

PerfView is built upon an ETW processing engine called TraceEvent, which you can reuse yourself (See Chapter 8). But PerfView’s real utility lies in its extremely powerful stack grouping and folding mechanism that lets you drill into events at multiple layers of abstraction.

While other ETW analysis tools can be useful, I often prefer PerfView for a few reasons:

  1. It requires no installation so it is easy to run on any computer.
  2. It is extremely configurable and easily scriptable.
  3. You can pick which events to capture at a very granular level, which allows you, for example, to take hours-long traces of just a few categories of events.
  4. It generally causes very little impact to the machine or processes it monitors.
  5. It has unparalleled analysis capabilities through its extensive stack grouping and folding capability.
  6. You can customize PerfView with extensions for your own custom analysis that take advantage of the built-in stack grouping and folding functionality.
  7. It has integrated source browsing, including the source of the .NET Framework.
  8. Sophisticated analysis of asynchronous calls that use the Task Parallel Library.
  9. Support for IIS and ASP.NET.

Here are some common questions that I routinely answer using PerfView:

  • Where is my CPU usage going?
  • Who is allocating the most memory?
  • What types are being allocated the most?
  • What is causing my Gen 2 garbage collections?
  • How long is the average Gen 0 collection?
  • How much JITting is my code doing?
  • Which locks are most contentious?
  • What does my managed heap look like?

To collect and analyze events using PerfView follow these basic steps:

  1. From the Collect menu, select the Collect menu item.
  2. From the resulting dialog, specify the options you need.
    1. Expand Advanced Options to narrow down the type of events you want to capture.
    2. Check No V3.X NGEN Symbols if you are not using .NET 3.5.
    3. Optionally specify Max Collect Sec to automatically stop collection after the given time.
  3. Click the Start Collection button.
  4. If not using a Max Collect Sec value, click the Stop Collection button when done.
  5. Wait for the events to be processed.
  6. Select the view to use from the resulting tree.

During event collection, PerfView captures ETW events for all processes. You can filter events per-process after the collection is complete.

Collecting events is not free. Certain categories of events are more expensive to collect than others. For example, a CPU profile generates a huge number of events, so you should keep the profile time very limited (around a minute or two) or you could end up with multi-gigabyte files that you cannot analyze.

PerfView Interface and Views

Most views in PerfView are variations of a single type, so it is worth understanding how it works

PerfView is mostly a stack aggregator and viewer. When you record ETW events, the stack for each event is recorded. PerfView analyzes these stacks and shows them to you in a grid that is common to CPU, memory allocation, lock contention, exceptions thrown, and most other types of events. The principles you learn while doing one type of investigation will apply to other types, since the stack analysis is the same.

You also need to understand the concepts of grouping and folding. Grouping turns multiple sources into a single entity. For example, there are multiple .NET Framework DLLs and which DLL a particular function is in is not usually interesting for profiling. Using grouping, you can define a grouping pattern, such as “System.*!=>LIB”, which coalesces all System.*.dll assemblies into a single group called LIB. This is one of the default grouping patterns that PerfView applies. If you wanted to, for example, collapse all method calls in the TimeZoneInfo class, you could have a group defined as:

“mscorlib.ni!System.TimeZoneInfo*->TIMEZONE”

This will cause TIMEZONE to appear throughout your stack in the place of any TimeZoneInfo methods.

Without a grouping, there are multiple layers of calls inside the TimeZoneInfo class.
Without a grouping, there are multiple layers of calls inside the TimeZoneInfo class.
By grouping these, you can hide their details and pretend that it is all just a single frame in the stack.
By grouping these, you can hide their details and pretend that it is all just a single frame in the stack.

Folding allows you to hide some of the irrelevant complexity of the lower layers of code by counting its cost in the nodes that call it. As a simple example, consider where memory allocations occur—always via some internal CLR method invoked by the new operator. What you really want to know is which types are most responsible for those allocations. Folding allows you to attribute those underlying costs to their parents, code which you can actually control. For example, in most cases you do not care about which internal operations are taking up time inside String.Format; you really care about what areas of your code are calling String.Format in the first place. PerfView can fold those operations into the caller to give you a better picture of your code’s performance.

The call to DateTime.Now includes a deep chain of TimeZoneInfo method calls. Without folding, these can get a little noisy.
The call to DateTime.Now includes a deep chain of TimeZoneInfo method calls. Without folding, these can get a little noisy.
By folding the pattern “mscorlib.ni!System.TimeZoneInfo*“, all of the cost of those methods will be counted as the cost of calling DateTime.Now
By folding the pattern “mscorlib.ni!System.TimeZoneInfo*“, all of the cost of those methods will be counted as the cost of calling DateTime.Now

Folding patterns can use the groups you defined for grouping. So, for example, you can just specify a folding pattern of “LIB” which will ensure that all methods in System.* are attributed to their caller outside of System.*.

The user interface of the stack viewer needs some brief explanation as well.

A typical stack view in PerfView. The UI contains many options for filtering, sorting, and searching.
A typical stack view in PerfView. The UI contains many options for filtering, sorting, and searching.

Controls at the top allow you to organize the stack view in multiple ways. Here is a summary of their usage, but you can click on the ? in the column headers to bring up a help file that gives you more details.

  • Start: Start time (in microseconds) which you want to examine.
  • End: End time (in microseconds) which you want to examine.
  • Find: Text to search for.
  • GroupPats: A semi-colon-delimited list of grouping patterns.
  • Fold%: Any stack that takes less than this percentage will be folded into its parent.
  • FoldPats: A semi-colon-delimited list of folding patterns.
  • IncPats: Stacks must have this pattern to be included in the analysis. This usually contains the process name.
  • ExcPats: Exclude anything with this pattern from analysis. By default, this has just the Idle process.

There are a few different view tabs:

  • By Name: Shows every node, whether type, method, or group. This is good for bottom-up analysis.
  • Caller-Callee: Focuses on a single node, showing you callers and callees of that node.
  • CallTree: Shows a tree of all nodes in the profile, starting at ROOT. This works well for doing top-down analysis.
  • Callers: Shows you all callers of a particular node.
  • Callees: Shows you all called methods of a particular node.
  • Notes: Allows you to save notes on your investigation in the ETL files themselves.

In the grid view, there are a number of columns. Click on the column names to bring up more information. Here is a summary of the most important columns:

  • Name: The type, method, or customized group name.
  • Exc %: Percentage of exclusive cost. For memory traces, it is the amount of memory attributed to this type/method only. For CPU traces, it is the amount of CPU time attributed to this method.
  • Exc: The sum of the sampled metric in just this node, excluding child nodes. For memory traces, the number of bytes attributed to this node exclusively. For CPU traces, the amount of time (in milliseconds) spent here.
  • Exc Ct: Number of samples exclusively in this node.
  • Inc %: Percentage of cost for this type/method and all its children. This is always at least as big as Exc %.
  • Inc: Cost of this node, including all children. For CPU usage, this is the amount of CPU time spent in this node plus all of its children.
  • Inc Ct: Number of samples on this node and all its children.

In the chapters that follow, I will give instructions for solving specific problems with various types of performance investigations. A complete overview of PerfView would be worth a book on its own, or at least a very detailed help file—which just so happens to come with PerfView. I strongly encourage you to read this manual once you have gone through a few simple analyses.

It may seem like PerfView is mostly for analyzing memory or CPU, but do not forget that it is really just a generic stack aggregation program, and those stacks can come from any ETW event. It can analyze your sources of lock contention, disk I/O, or any arbitrary application event with the same grouping and folding power.

CLR Profiler

CLR Profiler is a possible alternative to PerfView’s memory analysis capabilities if you want a graphical representation of the heap and relationships between objects. CLR Profiler can show you a wealth of detail. For example:

  • Visual graph of what the program allocates and the chain of methods that led to the allocation.
  • Histograms of allocated, relocated, and finalized objects by size and type.
  • Histogram of objects by lifetime.
  • Timeline of object allocations and garbage collections, showing the change in the heap over time.
  • Graphs of objects by their virtual memory address, which can show fragmentation quite easily.

I rarely use CLR Profiler because of some of its limitations and age, but it is still occasionally useful. It has unique visualizations that no other free tool currently matches. It comes with 32-bit and 64-bit binaries as well as documentation and the source code.

The basic steps to get a trace are:

  1. Pick the correct version to run: 32-bit or 64-bit, depending on your target program. You cannot profile a 32-bit program with the 64-bit profiler or vice-versa.
  2. Check the Profiling active box.
  3. Optionally check the Allocations and Calls boxes.
  4. If necessary, go to the File | Set Parameters… menu option to set options like command line parameters, working directory, and log file directory.
  5. Click the Start Application button
  6. Browse to the application you want to profile and click the Open button.
CLR Profiler’s main window.
CLR Profiler’s main window.

This will start the application with profiling active. When you are done profiling, exit the program, or select Kill Application in CLR Profiler. This will terminate the profiled application and start processing the capture log. This processing can take quite a while, depending on the profile duration. (I have seen it take over an hour before.)

While profiling is going on, you can click the “Show Heap now” button in CLR Profiler. This will cause it to take a heap dump and open the results in a visual graph of object relationships. Profiling will continue uninterrupted, and you can take multiple heap dumps at different points.

When it is done, you will see the main results screen.

CLR Profiler’s Results Summary view, showing you the data it collected during the trace.
CLR Profiler’s Results Summary view, showing you the data it collected during the trace.

From this screen, you can access different visualizations of heap data. Start with the Allocation Graph and the Time Line to see some of the essential capabilities. As you become comfortable analyzing managed code, the histogram views will also become an invaluable resource.

Note While CLR Profiler is generally great, I have had a few major problems with it. First, it is a bit finicky. If you do not set it up correctly before starting to profile, it can throw exceptions or die unexpectedly. For example, I always have to check the Allocations or Calls boxes before I start profiling if I want to get any data at all. You should completely disregard the Attach to Process button, as it does not seem to work reliably. CLR Profiler does not seem to work well for truly huge applications with enormous heaps or a large number of assemblies. If you find yourself having trouble, PerfView may be a better solution because of its polish and extreme customizability through very detailed command-line parameters that allow you to control nearly all aspects of its behavior. Your mileage may vary. On the other hand, CLR Profiler comes with its own source code so you can fix it!

Windows Performance Analyzer

The Windows Assessment and Deployment Kit (Windows ADK, also part of the Windows SDK) contains a number of tools that aid in deploying operating systems and applications to machines. Inside it are a pair of tools called Windows Performance Recorder and Windows Performance Analyzer. These tools process ETW events in the same manner as PerfView. However, Windows Performance Analyzer excels in displaying hardware and operating system level information. It can display .NET events as well, but it is not as convenient as PerfView.

To capture a trace, invoke Windows Performance Recorder and start capturing.

The main interface of Windows Performance Recorder. Click More Options to customize what kind of events are captured.
The main interface of Windows Performance Recorder. Click More Options to customize what kind of events are captured.

After you are done capturing events, click the Save button, which will bring up an interface for you to provide more details, while WPR processes the captured data in the background.

It can take a few minutes for this process to complete.
It can take a few minutes for this process to complete.

The capture data file can be opened in any tool that can analyze ETW events, but there is a convenient button to open it directly in Windows Performance Analyzer.

Windows Performance Analyzer shows you a list of resource categories along the left-hand size. Double-clicking them opens up a detailed view with a graph and table with details suitable for that resource. For example, details for memory usage will show you different categories of memory usage, such as active vs committed memory, paged pool, private pages, and more.

Windows Performance Analyzer’s main interface, showing captured OS and hardware metrics.
Windows Performance Analyzer’s main interface, showing captured OS and hardware metrics.

Because this tool focuses more on general operating system resource usage issues, rather than .NET, I will not discuss it further in this book, but it is a useful tool to keep in mind when you are dealing with some classes of performance problems.

WinDbg

WinDbg is a general-purpose Windows Debugger distributed for free by Microsoft. If you are used to using Visual Studio as your main debugger, using this bare-bones, text-only debugger may seem daunting. Do not let it be. Once you learn a few commands, you will feel comfortable and after a while, you will rarely use Visual Studio for debugging except during active development.

WinDbg is far more powerful than Visual Studio and will let you examine your process in many ways you could not otherwise. It is also lightweight and more easily deployable to production servers or customer machines. In these situations, it is in your best interest to become familiar with WinDbg. By itself, however, WinDbg is not that interesting for managed code. To work with managed processes effectively, you will need to use .NET’s SOS extensions, which ship with each version of the .NET Framework. A very handy SOS reference cheat sheet is located at https://docs.microsoft.com/dotnet/framework/tools/sos-dll-sos-debugging-extension. You can also use SOS.dll from Visual Studio, but this is not as straightforward, and there are other benefits to becoming familiar with WinDbg, so I will cover that scenario.

With WinDbg and SOS together, you can quickly answer questions such as these:

  • How many of each object type are on the heap, and how big are they?
  • How big are each of my heaps and how much of them is free space (fragmentation)?
  • What objects stick around through a garbage collection?
  • Which objects are pinned?
  • Which threads are taking the most CPU time? Is one of them stuck in an infinite loop?

WinDbg is not usually my first tool (that is often PerfView), but it is often my second or third, allowing me to see things that other tools will not easily show. For this reason, I will use WinDbg extensively throughout this book to show you how to examine your program’s operation, even when other tools do a quicker or better job. (Do not worry; I will also cover those tools.)

Do not be daunted by the text interface of WinDbg. Once you use a few commands to look into your process, you will quickly become comfortable and appreciative of the speed with which you can analyze a program. The chapters in this book will add to your knowledge little by little with specific scenarios.

To get WinDbg, you must install the Windows SDK. You can choose to install only the debuggers if you wish.

To get started with WinDbg, do a simple tutorial with a sample program. The program will be basic enough—a straightforward, easy-to-debug memory leak. You can find it in the accompanying source code in the MemoryLeak project (available at http://www.writinghighperf.net).

using System;
using System.Collections.Generic;
using System.Threading;

namespace MemoryLeak
{
  class Program
  {
    static List<string> times = new List<string>();

    static void Main(string[] args)
    {
      Console.WriteLine("Press any key to exit");
      while (!Console.KeyAvailable)
      {
        times.Add(DateTime.Now.ToString());
        Console.Write('.');
        Thread.Sleep(10);
      }
    }
  }
}

Startup this program and let it run for a few minutes.

Run WinDbg from where you installed it. It should be in the Start Menu if you installed it via the Windows SDK. Take care to run the correct version, either x86 (for 32-bit processes) or x64 (for 64-bit processes). Go to File | Attach to Process (or hit F6) to bring up the Attach to Process dialog.

WinDbg’s Attach to Process screen.
WinDbg’s Attach to Process screen.

From here, find the MemoryLeak process. (It may be easier to check the By Executable sort option.) Click OK.

WinDbg will suspend the process (This is important to know if you are debugging a live production process!) and display any loaded modules. At this point, it will be waiting for your command. The first thing you usually want to do is load the CLR debugging extensions. Enter this command:

.loadby sos clr

If it succeeds, there will be no output.

If you get an error message that says “Unable to find module ‘clr’” it most likely means the CLR has not yet been loaded. This can happen if you launch a program from WinDbg and break into it immediately. In this case, first set a breakpoint on the CLR module load:

sxe ld clr
g

The first command sets a breakpoint on the load of the CLR module. The g command tells the debugger to continue execution. Once you break again, the CLR module should be loaded and you can now load SOS with the .loadby sos clr command, as described previously.

At this point, you can do any number of things. Here are some commands to try:

!ProcInfo

This prints out some general debugging information about the process as a whole, including environment variables set:

---------------------------------------
Environment
=::=::
=C:=C:WINDOWSsystem32
...many, many environment variables
---------------------------------------
Process Times
Process Started at: 2017 Nov  7 22:5:49.44
Kernel CPU time   : 0 days 00:00:00.01
User   CPU time   : 0 days 00:00:00.01
Total  CPU time   : 0 days 00:00:00.02
---------------------------------------
Process Memory
WorkingSetSize:    26572 KB       PeakWorkingSetSize:    26572 KB
VirtualSize:      717972 KB       PeakVirtualSize:      717972 KB
PagefileUsage:    566560 KB       PeakPagefileUsage:    566560 KB
---------------------------------------
44 percent of memory is in use.

Memory Availability (Numbers in MB)

                     Total        Avail
Physical Memory       4095         4095
Page File             4095         4095
Virtual Memory        4095         3783

More useful commands:

g

This stands for “Go” and continues execution. You cannot enter any commands while the program is running.

<Ctrl-Break>

This pauses a running program. Do this after you Go (g) to get back control.

.dump /ma d:memorydump.dmp

This creates a full process dump to the selected file. This will allow you to debug the process’s state later, though since it is a snapshot, of course you will not be able to debug any further execution.

!DumpHeap -stat

DumpHeap shows a summary of all managed objects on the object heap, including their size (just for this object, not any referenced objects), count, and other information. If you want to see every object on the heap of type System.String, type !DumpHeap -type System.String. You will see more about this command when investigating garbage collection.

~*kb

This is a regular WinDbg command, not from SOS. It prints the current stack for all threads in the process.

To switch the current thread to a different one, use the command:

~32s

This will change the current thread to thread 32. Note that thread numbers in WinDbg are not the same as thread IDs. WinDbg numbers all the threads in your process for easy reference, regardless of the Windows or .NET thread ID.

!DumpStackObjects

You can also use the abbreviated version: !dso. This dumps out the address and type of each object from all stack frames for the current thread.

Note that all commands located in the SOS debugging extension for managed code are prefixed with a ! character.

The other thing you need to do to be effective with the debugger is set your symbol path to download the public symbols for Microsoft DLLs so you can see what is going on in the system layer. Set your _NT_SYMBOL_PATH environment variable to this string:

symsrv*symsrv.dll*c:sym*http://msdl.microsoft.com/download/symbols

Replace c:sym with your preferred local symbol cache path (and make sure you create the directory). With the environment variable set, both WinDbg and Visual Studio will use this path to automatically download and cache the public symbols for system DLLs. During the initial download, symbol resolution may be quite slow, but once cached, it should speed up significantly. You can also use the .symfix command to automatically set the symbol path to the Microsoft symbol server and local cache directly:

.symfix c:sym

If you have not used WinDbg before, do not be afraid to dive in and try it out. Once you memorize a small number of commands, you will be highly productive in no time. Deep mastery of WinDbg will come with time and experience, but it is worth the journey. You can do many types of analysis in WinDbg that are very difficult or impossible to do in other debuggers. See especially Chapter 2’s section for investigating memory issues for many examples of WinDbg usage.

CLR MD

After you have used WinDbg for a while and seen the power available to you, you will likely have the thought, “I wish I could access this stuff programmatically.” Thankfully, you can! Microsoft.Diagnostics.Runtime (nicknamed “CLR MD”) is an open source library available at https://github.com/microsoft.clrmd. It provides access to much of the functionality in SOS.dll, in a convenient, easy-to-use API. CLR MD is designed to be a fairly low level API, allowing you to easily build on top of it to provide richer functionality. In fact, some of PerfView’s functionality is built on top of CLR MD, so if PerfView is not giving you exactly what you need, you can go under the hood, so to speak, to this library, and build what you need.

The easiest way to obtain the library is via a NuGet package from http://NuGet.org. You can search for either Microsoft.Diagnostics.Runtime or CLR MD.
The easiest way to obtain the library is via a NuGet package from http://NuGet.org. You can search for either “Microsoft.Diagnostics.Runtime” or “CLR MD”.

In this section, I’ll provide an overview of the tool and how to use it, but specific solutions to problems will be found in the relevant sections throughout the book.

Note The library is very much in active development, and you will see differences between the documentation and what is currently implemented. The API may also change further.

You can use this library to both attach to live processes (as a debugger), or open heap dump files on disk. I’ll show examples of both.

To attach to a live process, you just need to supply a process ID. In this example, I’m explicitly starting a new process for convenience. Most examples of CLR MD in this book will come from the AnalyzeProcess sample code project accompanying this book.

static void Main(string[] args)
{
    // Let's create our own process to test with
    var startInfo = new ProcessStartInfo(TargetProcessName);
    startInfo.CreateNoWindow = true;
    startInfo.WindowStyle = ProcessWindowStyle.Hidden;
    
    var targetProcess = Process.Start(startInfo);
    Thread.Sleep(1000);
    using (DataTarget target = DataTarget.AttachToProcess(
        targetProcess.Id,
        10000, // timeout
        AttachFlag.Invasive))
    {
        PrintDumpInfo(target);

        var clr = target.ClrVersions[0].CreateRuntime();
    }
}

private static void PrintDumpInfo(DataTarget target)
{
    PrintHeader("Target Info");
    
    Console.WriteLine($"Architecture: {target.Architecture}");
    Console.WriteLine($"Pointer Size: {target.PointerSize}");
    Console.WriteLine("CLR Versions:");
    foreach(var clr in target.ClrVersions)
    {
        Console.WriteLine($"	{clr.Version}");
    }            
}

This program will print out the following information:

Target Info
===========
Architecture: X86
Pointer Size: 4
CLR Versions:
        v4.7.2115.00

The clr object obtained after calling PrintDumpInfo is the main interface to most of the interesting commands. Using it, you can, for example, iterate over every object in the heap:

var heap = clr.Heap;
foreach(var obj in heap.EnumerateObjects())
{
    int gen = heap.GetGeneration(obj.Address);
    Console.WriteLine(
      $"0x{obj.Address:x} - {obj.Type.Name}" + 
      $" - Generation: {generation}");
}

Which produces output similar to:

0x30ec8ac - System.Byte[] - Generation: 0
0x30ecca0 - LargeMemoryUsage.B - Generation: 1

In addition to the heap, you can examine code:

foreach(var module in clr.Modules)
{
    foreach (var type in module.EnumerateTypes())
    {
        foreach(var method in type.Methods)
        {
            Console.WriteLine(method.Name);
        }
    }
}

This produces output like this:

Main
GetNewObject
.cctor
ToString
ToString
Equals

You can also open crash dumps. This is slightly more complicated because you must also obtain the mscordacwks.dll file that matches the CLR version(s) present in the dump. When attaching to a live process, this is trivial because it is guaranteed to be present on the machine. With a dump from a different machine, and potentially a different version of the CLR altogether, you must obtain it from that machine or download it from the Microsoft symbol server. This code shows you how to accomplish this:

{
    ...
    string dacFile = 
      GetDacFile(
        dataTarget.ClrVersions[0], 
        dataTarget);
    var clr = dataTarget.ClrVersions[0].CreateRuntime(dacFile);
    ...
}

private static string GetDacFile(ClrInfo clrInfo, 
                                 DataTarget target)
{            
    string location = clrInfo.LocalMatchingDac;
    if (string.IsNullOrEmpty(location) || !File.Exists(location))
    {
        // try to download from symbol server
        ModuleInfo dacInfo = clrInfo.DacInfo;
        try
        {
            location = target.SymbolLocator.FindBinary(dacInfo);
        }
        catch (WebException)
        {
            return null;
        }
    }
    return location;
}

This method is equivalent to calling CreateRuntime with no arguments, but it is useful to know how to do this yourself in case you have custom needs.

You will see more examples of its power in later chapters, but a summary of some of the things it can tell you:

  • Enumerate all objects in the heap, and give information such as generation, whether they are pinned, etc.
  • Provide tools to find objects’ roots and sizes
  • Enumerate memory segments
  • Iterate all methods in process
  • Calculate IL and native code sizes

Note I have seen a couple of issues when using this library to examine the code in a truly huge DLL. The APIs in Microsoft.Diagnostics.Runtime rely on internal .NET APIs that may not have the most efficient implementation. In one case, I was using a dump file to calculate how much JITting had happened in a 500 MB DLL with 80,000 types, and hundreds of thousands of methods. I hit Ctrl-Break after about 36 hours. That is the only DLL I’ve had issues with.

IL Analyzers

There are many free and paid products out there that can take a compiled assembly and decompile it into IL, C#, VB.NET, or any other .NET language. Some of the most popular include Reflector, ILSpy, and dotPeek, but there are others.

These tools are valuable for showing you the inner details of other people’s code, something critical for good performance analysis. I use them most often to examine the .NET Framework itself when I want to see the potential performance implications of various APIs.

Converting your own code to readable IL is also valuable because it can show you many operations, such as boxing, that are not visible in the higher-level languages.

ILSpy with a decompilation of Enum.HasFlag in C#. Decompilers are a powerful tool for learning how 3rd-party code works and performs.
ILSpy with a decompilation of Enum.HasFlag in C#. Decompilers are a powerful tool for learning how 3rd-party code works and performs.

Chapter 6 discusses the .NET Framework code and encourages you to train a critical eye on every API you use. Tools like ILSpy, dotPeek, and Reflector are vital for that purpose and you will use them frequently as you become more familiar with existing code. You will often be surprised at how much work goes into seemingly simple methods. Analyzing the assemblies of other developers and companies can teach you much about good (or bad) organization, design, and coding practices.

Some other things these tools can show you:

  • Assembly references
  • Assembly metadata such as target framework, processor architecture
  • IL code (we will make extensive use of this feature in this book)
  • Size of code

Most tools also have search capability to allow you to find types, methods, fields, or code statements.

MeasureIt

MeasureIt is a handy micro-benchmark tool by Vance Morrison (the same author of PerfView). It shows the relative costs of various .NET APIs in many categories including method calls, arrays, delegates, iteration, reflection P/Invoke, and many more. It compares all the costs to calling an empty static function as a benchmark.

MeasureIt is primarily useful to show you how design choices will affect performance at an API level. For example, in the locks category, it shows you that using ReaderWriteLock is about four times slower than just using a regular lock statement.

It is easy to add your own benchmarks to MeasureIt’s code. It ships with its own code packed inside itself—just run MeasureIt /edit to extract it. Studying this code will give you a good idea of how to write accurate benchmarks. There is a lengthy explanation in the code comments about how to do high-quality analysis, which you should pay special attention to, especially if you want to do some simple benchmarking yourself.

For example, it prevents the compiler from inlining function calls:

[MethodImpl(MethodImplOptions.NoInlining)]
public void AnyEmptyFunction()
{
}

There are other tricks it uses such as working around processor caches and doing enough iterations to produce statistically significant results.

MeasureIt is handy because it has a number of built-in measurements of the CLR itself, which can give you a good idea of what the basics cost. If you are interested in benchmarking your own code, then read on to the next section.

BenchmarkDotNet

The standard in .NET benchmarking is probably the open-source project BenchmarkDotNet. This library handles many of the usual concerns about micro-benchmarking and does much more by:

  • Making it easy to pick code for benchmarking with attributes.
  • Generating isolated projects for each method under test.
  • Automatically calculating iteration count for the required precision.
  • Warming up code.
  • Performing statistical analysis.
  • Comparing performance across a number of different code environments, such as x86, x64, different JIT versions, GC configuration, and more.
  • Analyzing CPU, garbage collection, memory allocations, JIT, and various hardware counters.

Getting started is very easy. Here is a simple example, comparing the performance of foreach loops on an array versus IEnumerable. With simple attribute decoration, you can let the library do almost all the work for you.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;

namespace BenchmarkTest
{
    public class LoopBenchmarks
    {
        static int[] arr = new int[100];

        public LoopBenchmarks()
        {
            for (int i = 0; i < arr.Length; i++)
            {
                arr[i] = i;
            }
        }

        [Benchmark]
        public int ForEachOnArray()
        {
            int sum = 0;
            foreach (int val in arr)
            {
                sum += val;
            }
            return sum;
        }

        [Benchmark]
        public int ForEachOnIEnumerable()
        {
            int sum = 0;
            IEnumerable<int> arrEnum = arr;
            foreach (int val in arrEnum)
            {
                sum += val;
            }
            return sum;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<LoopBenchmarks>();
        }
    }
}

You can run this yourself with the BenchmarkTest sample code.

The output ends with this:

Total time: 00:00:43 (43.64 sec)

// * Summary *

BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-3930K CPU 3.20GHz (Ivy Bridge), 
  ProcessorCount=12
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 
     32bit LegacyJIT-v4.7.2102.0
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 
     32bit LegacyJIT-v4.7.2102.0


               Method |      Mean |     Error |    StdDev |
--------------------- |----------:|----------:|----------:|
       ForEachOnArray |  53.32 ns | 0.2083 ns | 0.1846 ns |
 ForEachOnIEnumerable | 561.69 ns | 7.2943 ns | 6.8231 ns |

// * Hints *
Outliers
  ForEachTest.ForEachOnArray: Default -> 1 outlier  was  removed

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 ns   : 1 Nanosecond (0.000000001 sec)

// ***** BenchmarkRunner: End *****
// * Artifacts cleanup *

Notice that even for such simple code, it took a full 43 seconds to execute the benchmarks.

You can of course customize how these benchmarks work with additional configuration.

To read more, visit http://benchmarkdotnet.org. Add it to your project directly from Visual Studio by installing the BenchmarkDotNet NuGet package.

Code Instrumentation

The old standby of brute-force debugging via console output is still a valid scenario and should not be ignored. Rather than console output, however, I encourage you to use ETW events instead, as detailed in Chapter 8.

Performing accurate code timing is also a useful feature at times. Never use DateTime.Now for tracking performance data. It is just too slow for this purpose. Instead, use the System.Diagnostics.Stopwatch class to track the time span of small or large events in your program with extreme accuracy, precision, and low overhead.

var stopwatch = Stopwatch.StartNew();
...do work...
stopwatch.Stop();
TimeSpan elapsed = stopwatch.Elapsed;
long elapsedTicks = stopwatch.ElapsedTicks;

See Chapter 6 for more information about using times and timing in .NET.

If you want to ensure that your own benchmarks are accurate and reproducible, study the source code and documentation to MeasureIt, which highlights the best practices on this topic. It is often harder than you would expect and performing benchmarks incorrectly can be worse than doing no benchmarks at all because it will cause you to waste time on the wrong thing. It would be better to use a 3rd-party library like BenchmarkDotNet.

SysInternals Utilities

No developer, system administrator, or even hobbyist should be without this great set of tools. Originally developed by Mark Russinovich and Bryce Cogswell and now owned by Microsoft, these are tools for computer management, process inspection, network analysis, and a lot more. Here are some of my favorites:

  • ClockRes: Shows the resolution of the system’s clock (which is also the maximum timer resolution).
  • CoreInfo: Relates logical processors to physical processors, sockets, caches, and more.
  • Diskmon: Monitors all disk activity.
  • DiskView: Sector-by-sector utility for hard disks.
  • Handle: Shows which files are opened by which processes.
  • ListDLLs: Lists loaded DLLs.
  • NTFSInfo: Get detailed version about NTFS volumes.
  • PsInfo: Displays OS, disk, user, and software information about a system.
  • ProcDump: A highly configurable process dump creator.
  • Process Explorer: A much better Task Manager, with a wealth of detail about every process.
  • Process Monitor: Monitor file, registry, and process activity in real-time.
  • RAMMap: Analyze the physical memory usage of the entire system.
  • SDelete: A secure file delete utility.
  • Strings: Searches binaries for strings.
  • VMMap: Analyze a process’s address space.

There are dozens more. You can download this suite of utilities (individually or as a whole) from https://docs.microsoft.com/sysinternals/.

Process Explorer is a highly advanced version of Task Manager that gives you an extreme amount of detail about each process, as well as the relationships among processes.
Process Explorer is a highly advanced version of Task Manager that gives you an extreme amount of detail about each process, as well as the relationships among processes.
Process Monitor shows a live trace of file, registry, process, thread, and network events for the whole system. It can be useful, for example, to find out whether a process is reading from a specific file and when.
Process Monitor shows a live trace of file, registry, process, thread, and network events for the whole system. It can be useful, for example, to find out whether a process is reading from a specific file and when.

Database

The final performance tool is a rather generic one: a simple database—something to track your performance over time. The metrics you track are whatever is relevant to your project, and the format does not have to be a full-blown SQL Server relational database (though there are certainly advantages to such a system). It can be a collection of reports stored over time in an easily readable format, or just CSV files with labels and values. The point is that you should record it, store it, and build the ability to report from it.

When someone asks you if your application is performing better, which is the better answer?

Yes.

Or:

In the last 6 months, we have reduced CPU usage by 50%, memory consumption by 25%, and request latency by 15%. Our GC rate is down to one in every 10 seconds (it used to be every second!), and our startup time is now dominated entirely by configuration loading (35 seconds).

As mentioned earlier, bragging about performance gains is so much better with solid data to back it up!

Other Tools

You can find many other tools. There are plenty of static code analyzers, ETW event collectors and analyzers, assembly decompilers, performance profilers, and much more.

You can consider the list presented in this chapter as a starting point, but understand that you can do significant work with just these tools. Sometimes an intelligent visualization of a performance problem can help, but you will not always need it.

You will also discover that as you become more familiar with technologies like Performance Counters or ETW events, it is easy to write your own tools to do custom reporting or intelligent analysis. Many of the tools discussed in this book are automatable to some degree.

Measurement Overhead

No matter what you do, there is going to be some overhead from measuring your performance. CPU profiling slows your program down somewhat, performance counters will require memory and/or disk space. ETW events, as fast as they are, are not free.

You will have to monitor and optimize this overhead in your code just like all other aspects of your program. Then decide whether the cost of measurement in some scenarios is worth the performance hit you will pay.

If you cannot afford to measure all the time, then you will have to settle for some kind of profiling. As long as it is often enough to catch issues, then it is likely fine. However, do not underestimate the people cost of manual performance measurement—often, this can add up to a much higher cost than building a system that can automatically perform measurement for you.

You could also have “special builds” of your software, but this can be a little dangerous. You do not want these special builds to morph into something that is unrepresentative of the actual product.

As with many things in software, there is a balance you will have to find between having all the data you want and having optimal performance.

Summary

The most important rule of performance is Measure, Measure, Measure!

Know what metrics are important for your application. Develop precise, quantifiable goals for each metric. Average values are good, but pay attention to percentiles as well, especially for high-availability services. Ensure that you include good performance goals in the design up front and understand the performance implications of your architecture. Optimize the parts of your program that have the biggest impact first. Focus on macro-optimizations at the algorithmic or systemic level before moving on to micro-optimizations. When you are unsure about the performance of an algorithm, utilize benchmarking frameworks to test them.

Have a good foundation of performance counters and ETW events for your program. For analysis and debugging, use the right tools for the job. Learn how to use the most powerful tools like WinDbg and PerfView to solve problems quickly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset