B

 

Back Trace

Usually, when we analyze traces and find Anchor Message or Error Message, we backtrack using a combination of Data Flow and Message Sets. Then we select the appropriate log messages to form Back Trace leading to a possible root cause message:

This pattern is different from Error Thread pattern that just backtracks messages having the same TID (or, in general, ATID19). It is also different from Exception Stack Trace pattern that is just a serialized stack trace from memory snapshot.

 

Background and Foreground Components

A metaphorical bijection20 from the literary narratology to software narratology21 provides a pattern of Background and Foreground Components. We can easily illustrate it on pseudo-trace color diagrams. Suppose we troubleshoot a graphical issue using an ETW trace containing the output from all components of the problem system. Graphics components and their messages are considered foreground for a trace viewer (a person) against numerous background components (for example, database, file, and registry access, shown in shades of green):

Trace viewers (for example, CDFAnalyzer) can filter out background component messages and present only foreground components (that I propose to call the component foregrounding):

Of course, this process is iterative and parts of what once was foreground become background and candidate for further filtering:

 

Basic Facts

A typical trace is a detailed narrative accompanied by a problem description that lists essential facts. For this reason, the first task of any trace analysis is to check the presence of Basic Facts in the trace. If they are not visible or do not correspond, the trace was possibly not recorded during the problem occurrence or was taken from a different computer or under different conditions. Here is an example. A user “test01 cannot connect to a published application in a terminal services environment. We look at the trace and find this statement:

No   PID  TID  Date      Time         Statement​
[...]​
3903 3648 5436 4/29/2009 16:17:36.150 User Name: test01​
[...]

At least we can be sure that this trace was taken for the user test01 especially when we expect this or similar trace statements. If we do not see this trace statement, we may suppose that the trace was taken at the wrong time, for example, when the problem had already happened.

We suggest the following taxonomy of Basic Facts:

  • Functional facts: Expected a dialog to enter data.
  • Non-functional facts: CPU consumption 100%.
  • Identification facts: Application name, PID, username.

 

Bifurcation Point

The following two software traces from working and non-working software environments are a perfect example of the pattern. We borrow the name of this pattern from catastrophe theory22:

Working trace (the issue is absent):

#   PID  TID  Message​
[...]​
25  2768 3056 Trace Statement A ​
26  3756 2600 Trace Statement B ​
27  3756 2600 Trace Statement C ​
[...]​
149 3756  836 Trace Statement X (Query result: XXX)​
150 3756  836 Trace Statement 150.1 ​
151 3756  836 Trace Statement 151.1​
152 3756  836 Trace Statement 152.1 ​
153 3756  836 Trace Statement 153.1 ​
[...]

Non-working trace (the issue is present):

#   PID  TID  Message​
[...]​
27  2768 3056 Trace Statement A ​
28  3756 2176 Trace Statement B ​
29  3756 2176 Trace Statement C ​
[...]​
151 3756 5940 Trace Statement Y (Query result: YYY)​
152 3756 5940 Trace Statement 152.2 ​
153 3756 5940 Trace Statement 153.2 ​
154 3756 5940 Trace Statement 154.2 ​
155 3756 5940 Trace Statement 155.2 ​
[...]

First, we notice that in both traces, PID are the same (2768 and 3756) and we can conclude for this reason that most likely both traces came from the same environment and session. Second, messages A, B, C, and further are identical up to messages X and Y. The latter two messages differ greatly in their query results XXX and YYY. After that, message distribution differs greatly in both size and content. Despite the same tracing time, 15 seconds, Statement Current is 155 msg/s for working and 388 msg/s for the non-working case.

We can easily observe Bifurcation Points when tracing noise ratio is small and, for example, in the case of terminal services environments, we can achieve that by selecting appropriate tracing modules based on problem description or filtering irrelevant modules from full CDF traces.

The following diagram illustrates Bifurcation Point:

 

Blackout

We recently analyzed a Process Monitor log that had a several hour gap that we call Blackout. If you see such a pattern, it might have the following possible causes:

  • Some files from Split Trace are missing;
  • Split Trace file set was artificially created;
  • The tracing scope system was paused or frozen (for example, a virtualized system), or restarted;
  • The tracing itself was paused.

Blackout pattern is different from Visibility Limit where the latter is about the inherent inability to trace, but the former is only temporary inability due to circumstances listed above. It is also different from Discontinuity pattern where the latter is about gaps in individual Threads of Activity or Adjoint Threads of Activity.

 

Braid Group

One of the trace attributes we didn’t pay much attention to in the past is CPU. This column is present in some ETW-based trace implementations such as Citrix CDF traces. As any trace attribute, it can be used to form Adjoint Thread of Activity (as all messages from code executed on that particular CPU). As we already considered threads as braids23, we use braid groups24 as a further metaphor. In our case, we combine CPUs and threads into one group which uses permutation for CPU scheduling. Instead of permutations, twists may be modeled as changes of threads. The Braid Group analysis pattern is illustrated in the following diagram:

This is a preliminary description of the analysis pattern. We plan to elaborate on it in further case studies. For example, instead of multithreading, we can use multibraiding25.

 

Braid of Activity

If we consider a log as a text, ignore its column structure, and do search for the particular attribute value (for example, PID) we get Message Set consisting of messages having that attribute value as column (Adjoint Thread of Activity) and messages having that attribute value referenced in their message text. We call this pattern Braid of Activity because metaphorically it looks like Adjoint Threads of Activity cross each other (like multibraiding26):

 

Break-in Activity

This is a message or a set of messages that surface just before the end of Discontinuity of Adjoint Thread and possibly triggered it:

 

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset